This section answers some of the frequently asked questions about query performance.
Answer: There are generally two measures of query performance:
Response time, the time to get an answer to an individual query, and
Throughput, the number of queries that can be run in any time period; for example, queries each second.
These two are related, but are not the same. In a heavily loaded system, you normally want maximum throughput, whereas in a relatively lightly loaded system, you probably want minimum response time. Also, some applications require a query to deliver all its hits to the user, whereas others might only require the first 20 hits from an ordered set. It is important to distinguish between these two scenarios.
Answer: The fastest type of query will meet the following conditions:
Single CONTAINS
clause
No other conditions in the WHERE
clause
No ORDER
BY
clause at all
Only the first page of results is returned (for example, the first 10 or 20 hits).
Answer: Yes. Collecting statistics on your tables enables Oracle Text to do cost-based analysis. This helps Oracle Text choose the most efficient execution plan for your queries.
If your queries are always pure text queries (no structured predicate and no joins), you should delete statistics on your Oracle Text index.
Answer: The speed at which the text index can deliver ROWIDs is not affected by the actual size of the data. Text query speed will be related to the number of rows that must be fetched from the index table, number of hits requested, number of hits produced by the query, and the presence or absence of sorting.
Answer: The format of the documents (plain ASCII text, HTML or Microsoft Word) should make no difference to query speed. The documents are filtered to plain text at indexing time, not query time.
The cleanliness of the data will make a difference. Spell-checked and sub-edited text for publication tends to have a much smaller total vocabulary (and therefore size of the index table) than informal text such as e-mails, which will contain many spelling errors and abbreviations. For a given index memory setting, the extra text takes up more memory, which can lead to more fragmented rows than in the cleaner text, which can adversely affect query response time.
Answer: There are two ways the kernel can query the text index. In the first and most common case, the kernel asks the text index for all the rowids that satisfy a particular text search. These rowids are returned in batches. In the second, the kernel passes individual rowids to the text index, and asks whether that particular rowid satisfies a certain text criterion.
The second is known as a functional lookup, and is most commonly done where there is a very selective structured clause, so that only a few rowids must be checked against the text index. An example of a search where a functional lookup may be used:
SELECT ID, SCORE(1), TEXT FROM MYTABLE
WHERE START_DATE = '21 Oct 1992' <- highly selective AND CONTAINS (TEXT, 'commonword') > 0 <- unselective
Functional invocation is also used for text query ordered by structured column (for example date, price) and text query is unselective.
Answer: All queries look at the index token table. Its name has the form DR$indexname$I
. This contains the list of tokens (column TOKEN_TEXT
) and the information about the row and word positions where the token occurs (column TOKEN_INFO
).
The row information is stored as internal DOCID values. These must be translated into external ROWID values. The table used for this depends on the type of lookup: For functional lookups, the $K
table, DR$indexname$K
, is used. This is a simple Index Organized Table (IOT) which contains a row for each DOCID/ROWID pair.
For indexed lookups, the $R
table, DR$indexname$R
, is used. This holds the complete list of ROWIDs in a BLOB column.
Hence we can easily find out whether a functional or indexed lookup is being used by examining a SQL trace, and looking for the $K or $R tables.
Answer: Yes, it certainly does.
If there is no sorting, then Oracle Text can return results as it finds them, which is quicker in the common case where the application needs to display only a page of results at a time.
Answer: Sorting by relevance (SCORE(n)
) can be extremely quick if the FIRST_ROWS(n)
hint is used. In this case, Oracle Text performs a high speed internal sort when fetching from the text index tables.
An example of such a query:
SELECT /*+ FIRST_ROWS(10) */ ID, SCORE(1), TEXT FROM mytable WHERE CONTAINS (TEXT, 'searchterm', 1) > 0 ORDER BY SCORE(1) DESC;
Note that for this to work efficiently, there must be no other criteria in the WHERE
clause other than a single CONTAINS
.
Answer: For querying, you want to strive for a large system global area (SGA). You can set these parameters related to SGA in your Oracle Database initialization file. You can also set these parameters dynamically.
The SORT_AREA_SIZE
parameter controls the memory available for sorting for ORDER BY
queries. You should increase the size of this parameter if you frequently order by structured columns.
Oracle Database Administrator's Guide for more information on setting SGA related parameters
Oracle Database Performance Tuning Guide for more information on memory allocation and setting the SORT_AREA_SIZE
parameter
Answer: Yes. Typically, a SELECT
statement selects more than one column from your base table. Because Oracle Text fetches columns to memory, it is more efficient to store wide base table columns such as LOBs out of line, especially when these columns are rarely updated but frequently selected.
When LOBs are stored out of line, only the LOB locators need to be fetched to memory during querying. Out of line storage reduces the effective size of the base table making it easier for Oracle Text to cache the entire table to memory. This reduces the cost of selecting columns from the base table, and hence speeds up text queries.
In addition, having smaller base tables cached in memory enables more index table data to be cached during querying, which improves performance.
Answer: The fastest type of query is one where there is only a single CONTAINS
clause, and no other conditions in the WHERE
clause.
Consider the following multiple CONTAINS
query:
SELECT title, isbn FROM booklist WHERE CONTAINS (title, 'horse') > 0 AND CONTAINS (abstract, 'racing') > 0
We can obtain the same result with section searching and the WITHIN
operator as follows:
SELECT title, isbn FROM booklist WHERE CONTAINS (alltext, 'horse WITHIN title AND racing WITHIN abstract')>0
This query completes more quickly. To use a query like this, we must copy all the data into a single text column for indexing, with section tags around each column's data. This can be done with PL/SQL procedures before indexing, or by making use of the USER_DATASTORE
datastore during indexing to synthesize structured columns with the text column into one document.
Answer: Each distinct word used in a query requires at least one row to be fetched from the index table. It is therefore best to keep the number of expansions down as much as possible.
You should not use expansions such as wild cards, thesaurus, stemming and fuzzy matching unless they are necessary to the task. In general, a few expansions (for example, 10 to 20) does not cause difficulty, but avoid having large numbers of explansions (80 or 100) in a query. The query feedback mechanism can be used to determine the number of expansions for any particular query expression.
In addition for wildcard and stem queries, you can remove the cost of term expansion from query time to index time by creating prefix, substring or stem indexes. Query performance increases at the cost of longer indexing time and added disk space.
Prefix and substring indexes can improve wildcard performance. You enable prefix and substring indexing with the BASIC_WORDLIST
preference. The following example sets the wordlist preference for prefix and substring indexing. For prefix indexing, it specifies that Oracle Text create token prefixes between 3 and 4 characters long:
begin
ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('mywordlist','PREFIX_INDEX','TRUE'); ctx_ddl.set_attribute('mywordlist','PREFIX_MIN_LENGTH', '3'); ctx_ddl.set_attribute('mywordlist','PREFIX_MAX_LENGTH', '4'); ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX', 'YES');
end
You enable stem indexing with the BASIC_LEXER
preference:
begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER'); ctx_ddl.set_attribute ( 'mylex', 'index_stems', 'ENGLISH');
end;
Answer: You can create local partitioned CONTEXT
indexes on partitioned tables. This means that on a partitioned table, each partition has its own set of index tables. Effectively, there are multiple indexes, but the results from each are combined as necessary to produce the final result set.
The index is created using the LOCAL
keyword:
CREATE INDEX index_name ON table_name (column_name) INDEXTYPE IS ctxsys.context PARAMETERS ('...') LOCAL
With partitioned tables and local indexes, you can improve performance of the following types of CONTAINS
queries:
Range Search on Partition Key Column
This is a query that restricts the search to a particular range of values on a column that is also the partition key.
This is a query that requires only the first n
hits and the ORDER BY
clause names the partition key
Answer: It depends on system load and server capacity. Even though parallel querying is the default behavior for indexes created in parallel, it usually results in degrading overall query throughput on heavily loaded systems.
In general, parallel queries are particularly appropriate for DSS or analytical systems with large data collections, multiple CPUs, and low number of concurrent users.
Answer: Indexing theme information with a CONTEXT
index takes longer and also increases the size of your index. However, theme indexes enable ABOUT
queries to be more precise by using the knowledge base. If your application uses ABOUT
queries heavily, it might be worthwhile to create a theme component to the index, despite the extra indexing time and extra storage space required.
Answer: CTXCAT
indexes work best when text is in small chunks, maybe a few lines maximum, and searches need to restrict or sort the result set according to certain structured criteria, usually numbers or dates.
For example, consider an on-line auction site. Each item for sale has a short description, a current bid price, and dates for the start and end of the auction. A user might want to see all the records with antique cabinet in the description, with a current bid price less than $500. Because he is particularly interested in newly posted items, he wants the results sorted by auction start time.
Such a search is not always efficient with a CONTAINS
structured query on a CONTEXT
index, where the response time can vary significantly depending on the structured and CONTAINS
clauses. This is because the intersection of structured and CONTAINS
clauses or the ordering of text query is computed during query time.
By including structured information such as price and date within the CTXCAT
index, query response time is always in an optimal range regardless of search criteria. This is because the interaction between text and structured query is pre-computed during indexing. Consequently query response time is optimum.
Answer: There are differences in the time and space needed to create the index. CTXCAT
indexes take a bit longer to create and use considerably more disk space than CONTEXT
indexes. If you are tight on disk space, you should consider carefully whether CTXCAT
indexes are appropriate for you.
With respect to query operators, you can now use the richer CONTEXT
grammar in CATSEARCH
queries with query templates. The older restriction of a single CATSEARCH
query grammar no longer holds.
Answer: The optimizer hint INDEX(table column)
can be used in the usual way to drive the query with a text or b-tree index.
You can also use the NO_INDEX(table column)
hint to disable a specific index.
Additionally, the FIRST_ROWS(n)
hint has a special meaning for text queries and should be used when you need the first n
hits to a query. Use of the DOMAIN_INDEX_SORT
hint in conjunction with ORDER BY SCORE(n) DESC
tells the Oracle optimizer to accept a sorted set from the text index, and not to do a further sort.