2 Oracle Text Indexing Elements

This chapter describes the indexing elements that you can use to create an Oracle Text index.

The following topics are discussed in this chapter:

Overview
Datastore Types
Filter Types
Lexer Types
Wordlist Type
Storage Types
Section Group Types
Classifier Types
Cluster Types
Stoplists
System-Defined Preferences
System Parameters

2.1 Overview

When you use the CREATE INDEX statement to create an index or the ALTER INDEX statement to manage an index, you can optionally specify indexing preferences, stoplists, and section groups in the parameter string. Specifying a preference, stoplist, or section group answers one of the following questions about the way Oracle Text indexes text:

Preference Class	Answers the Question
Datastore	How are your documents stored?
Filter	How can the documents be converted to plain text?
Lexer	What language is being indexed?
Wordlist	How should stem and fuzzy queries be expanded?
Storage	How should the index tables be stored?
Stop List	What words or themes are not to be indexed?
Section Group	Is querying within sections enabled, and how are the document sections defined?

This chapter describes how to set each preference. Enable an option by creating a preference with one of the types described in this chapter.

For example, to specify that your documents are stored in external files, you can create a datastore preference called mydatastore using the FILE_DATASTORE type. Specify mydatastore as the datastore preference in the parameter clause of the CREATE INDEX statement.

2.1.1 Creating Preferences

To create a datastore, lexer, filter, classifier, wordlist, or storage preference, use the CTX_DDL.CREATE_PREFERENCE procedure and specify one of the types described in this chapter. For some types, you can also set attributes with the CTX_DDL.SET_ATTRIBUTE procedure.

An indexing type names a class of indexing objects that you can use to create an index preference. A type, therefore, is an abstract ID, while a preference is an entity that corresponds to a type. Many system-defined preferences have the same name as types (for example, BASIC_LEXER), but exact correspondence is not guaranteed. For example, the DEFAULT_DATASTORE preference uses the DIRECT_DATASTORE type, and there is no system preference corresponding to the CHARSET_FILTER type. Be careful in assuming the existence or nature of either indexing types or system preferences.

You specify indexing preferences with the CREATE INDEX and ALTER INDEX statements. Indexing preferences determine how your index is created. For example, lexer preferences indicate the language of the text to be indexed. You can create and specify your own user-defined preferences, or you can use system-defined preferences.

To create a stoplist, use the CTX_DDL.CREATE_STOPLIST procedure. Add stopwords to a stoplist with CTX_DDL.ADD_STOPWORD.

To create section groups, use CTX_DDL.CREATE_SECTION_GROUP and specify a section group type. Add sections to section groups with the CTX_DDL.ADD_ZONE_SECTION or CTX_DDL.ADD_FIELD_SECTION procedures.

2.2 Datastore Types

Use the datastore types to specify how your text is stored. To create a datastore preference, you must use one of the datastore types described in Table 2-1.

Table 2-1 Datastore Types

Datastore Type	Use When
DIRECT_DATASTORE	Data is stored internally in the text column. Each row is indexed as a single document.
MULTI_COLUMN_DATASTORE	Data is stored in a text table in more than one column. Columns are concatenated to create a virtual document, one for each row.
DETAIL_DATASTORE	Data is stored internally in the text column. Document consists of one or more rows stored in a text column in a detail table, with header information stored in a master table.
FILE_DATASTORE	Data is stored externally in operating system files. File names are stored in the text column, one for each row.
NESTED_DATASTORE	Data is stored in a nested table.
URL_DATASTORE	Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) are stored in the text column.
USER_DATASTORE	Documents are synthesized at index time by a user-defined stored procedure.

2.2.1 DIRECT_DATASTORE

Use the DIRECT_DATASTORE type for text stored directly in the text column, one document for each row. The DIRECT_DATASTORE type has no attributes.

The following column types are supported: CHAR, VARCHAR, VARCHAR2, BLOB, CLOB, BFILE, XMLType, and URIType.

Note:

If your column is a BFILE, then the index owner must have read permission on all directories used by the BFILEs.

2.2.1.1 DIRECT_DATASTORE CLOB Example

The following example creates a table with a CLOB column to store text data. It then populates two rows with text data and indexes the table using the system-defined preference CTXSYS.DEFAULT_DATASTORE.

create table mytable(id number primary key, docs clob); 

insert into mytable values(111555,'this text will be indexed');
insert into mytable values(111556,'this is a direct_datastore example');
commit;

create index myindex on mytable(docs) 
  indextype is ctxsys.context 
  parameters ('DATASTORE CTXSYS.DEFAULT_DATASTORE');

2.2.2 MULTI_COLUMN_DATASTORE

Use the MULTI_COLUMN_DATASTORE datastore when your text is stored in more than one column. During indexing, the system concatenates the text columns, tags the column text, and indexes the text as a single document. The XML-like tagging is optional. You can also set the system to filter and concatenate binary columns.

The data store MULTI_COLUMN_DATASTORE has the attributes shown in Table 2-2.

Table 2-2 MULTI_COLUMN_DATASTORE Attributes

Attribute Attribute Value

Attribute	Attribute Value
columns	Specify a comma-delimited list of columns to be concatenated during indexing. You can also specify any allowed expression for the `SELECT` statement column list for the base table. This includes expressions, PL/SQL functions, column aliases, and so on. The `NUMBER` and `DATE` column types are supported. They are converted to text before indexing using the default format mask. The `TO_CHAR` function can be used in the column list for formatting. The `RAW` and `BLOB` columns are directly concatenated as binary data. The `LONG`, `LONG` `RAW`, `NCHAR`, and `NCLOB` data types, nested table columns, and collections are not supported. The column list is limited to 500 bytes.
filter	Specify a comma-delimited list of Y/N flags. Each flag corresponds to a column in the `COLUMNS` list and denotes whether to filter the column using the `AUTO_FILTER`. Specify one of the following allowed values: Y: Column is to be filtered with `AUTO_FILTER` N or no value: Column is not to be filtered (default)
delimiter	Specify the delimiter that separates column text as follows: `COLUMN_NAME_TAG`: Column text is set off by XML-like open and close tags (default). `NEWLINE`: Column text is separated with a newline.

columns

Specify a comma-delimited list of columns to be concatenated during indexing. You can also specify any allowed expression for the SELECT statement column list for the base table. This includes expressions, PL/SQL functions, column aliases, and so on.

The NUMBER and DATE column types are supported. They are converted to text before indexing using the default format mask. The TO_CHAR function can be used in the column list for formatting.

The RAW and BLOB columns are directly concatenated as binary data.

The LONG, LONG RAW, NCHAR, and NCLOB data types, nested table columns, and collections are not supported.

The column list is limited to 500 bytes.

filter

Specify a comma-delimited list of Y/N flags. Each flag corresponds to a column in the COLUMNS list and denotes whether to filter the column using the AUTO_FILTER.

Specify one of the following allowed values:

Y: Column is to be filtered with AUTO_FILTER

N or no value: Column is not to be filtered (default)

delimiter

Specify the delimiter that separates column text as follows:

COLUMN_NAME_TAG: Column text is set off by XML-like open and close tags (default).

NEWLINE: Column text is separated with a newline.

2.2.2.1 Indexing and DML

To index, you must create a dummy column to specify in the CREATE INDEX statement. This column's contents are not made part of the virtual document, unless its name is specified in the columns attribute.

The index is synchronized only when the dummy column is updated. You can create triggers to propagate changes if needed.

2.2.2.2 MULTI_COLUMN_DATASTORE Restriction

You cannot create a multicolumn datastore with XMLType columns. MULTI_COLUMN_DATA_STORE does not support XMLType. You can create a CONTEXT index with an XMLType column, as described in Chapter 1, "Oracle Text SQL Statements and Operators".

2.2.2.3 MULTI_COLUMN_DATASTORE Example

The following example creates a multicolumn datastore preference called my_multi with three text columns:

begin

ctx_ddl.create_preference('my_multi', 'MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('my_multi', 'columns', 'column1, column2, column3');

end;

2.2.2.4 MULTI_COLUMN_DATASTORE Filter Example

The following example creates a multicolumn datastore preference and denotes that the bar column is to be filtered with the AUTO_FILTER.

ctx_ddl.create_preference('MY_MULTI','MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('MY_MULTI', 'COLUMNS','foo,bar');
ctx_ddl.set_attribute('MY_MULTI','FILTER','N,Y');

The multicolumn datastore fetches the content of the foo and bar columns, filters bar, then composes the compound document as:

<FOO>
foo contents
</FOO>
<BAR>
bar filtered contents (probably originally HTML)
</BAR>

The N flags do not need not be specified, and there does not need to be a flag for every column. Only the Y flags must be need to be specified, with commas to denote to which column they apply. For instance:

ctx_ddl.create_preference('MY_MULTI','MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('MY_MULTI', 'COLUMNS','foo,bar,zoo,jar');
ctx_ddl.set_attribute('MY_MULTI','FILTER',',,Y');

This filters only the column zoo.

2.2.2.5 Tagging Behavior

During indexing, the system creates a virtual document for each row. The virtual document is composed of the contents of the columns concatenated in the listing order with column name tags automatically added. For example:

create table mc(id number primary key, name varchar2(10), address varchar2(80));
insert into mc values(1, 'John Smith', '123 Main Street');

exec ctx_ddl.create_preference('mymds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attibute('mymds', 'columns', 'name, address');

This produces the following virtual text for indexing:

<NAME>
John Smith
</NAME>
<ADDRESS>
123 Main Street
</ADDRESS>

The system indexes the text between the tags, ignoring the tags themselves.

2.2.2.6 Indexing Columns as Sections

To index the tags as sections, you can optionally create field sections with BASIC_SECTION_GROUP.

Note:

No section group is created when you use the MULTI_COLUMN_DATASTORE. To create sections for these tags, you must create a section group.

When you use expressions or functions, the tag is composed of the first 30 characters of the expression unless a column alias is used.

For example, if your expression is as follows:

exec ctx_ddl.set_attibute('mymds', 'columns', '4 + 17');

then it produces the following virtual text:

<4 + 17>
21
</4 + 17>

If your expression is as follows:

exec ctx_ddl.set_attibute('mymds', 'columns', '4 + 17 col1');

then it produces the following virtual text:

<col1>
21
<col1>

The tags are in uppercase unless the column name or column alias is in lowercase and surrounded by double quotation marks. For example:

exec ctx_ddl.set_attibute('mymds', 'COLUMNS', 'foo');

This produces the following virtual text:

<FOO>
content of foo
</FOO>

For lowercase tags, use the following:

exec ctx_ddl.set_attibute('mymds', 'COLUMNS', 'foo "foo"');

This expression produces:

<foo>
content of foo
</foo>

2.2.3 DETAIL_DATASTORE

Use the DETAIL_DATASTORE type for text stored directly in the database in detail tables, with the indexed text column located in the master table.

The DETAIL_DATASTORE type has the attributes described in Table 2-3.

Table 2-3 DETAIL_DATASTORE Attributes

Attribute	Attribute Value
binary	Specify `TRUE` for Oracle Text to add no newline character after each detail row. Specify `FALSE` for Oracle Text to add a newline character (\n) after each detail row automatically.
detail_table	Specify the name of the detail table (`OWNER.TABLE` if necessary).
detail_key	Specify the name of the detail table foreign key column.
detail_lineno	Specify the name of the detail table sequence column.
detail_text	Specify the name of the detail table text column.

2.2.3.1 Synchronizing Master/Detail Indexes

Changes to the detail table do not trigger re-indexing when you synchronize the index. Only changes to the indexed column in the master table triggers a re-index when you synchronize the index.

You can create triggers on the detail table to propagate changes to the indexed column in the master table row.

2.2.3.2 Example Master/Detail Tables

This example illustrates how master and detail tables are related to each other.

2.2.3.2.1 Master Table Example

Master tables define the documents in a master/detail relationship. Assign an identifying number to each document. The following table is an example master table, called my_master:

Column Name	Column Type	Description
`article_id`	`NUMBER`	Document ID, unique for each document (primary key)
`author`	`VARCHAR2(30)`	Author of document
`title`	`VARCHAR2(50)`	Title of document
`body`	`CHAR(1)`	Dummy column to specify in `CREATE` `INDEX`

Note:

Your master table must include a primary key column when you use the DETAIL_DATASTORE type.

2.2.3.2.2 Detail Table Example

Detail tables contain the text for a document, whose content is usually stored across a number of rows. The following detail table my_detail is related to the master table my_master with the article_id column. This column identifies the master document to which each detail row (sub-document) belongs.

Column Name	Column Type	Description
`article_id`	`NUMBER`	Document ID that relates to master table
`seq`	`NUMBER`	Sequence of document in the master document defined by `article_id`
`text`	`VARCHAR2`	Document text

2.2.3.2.3 Detail Table Example Attributes

In this example, the DETAIL_DATASTORE attributes have the following values:

Attribute	Attribute Value
`binary`	`TRUE`
`detail_table`	`my_detail`
`detail_key`	`article_id`
`detail_lineno`	`seq`
`detail_text`	`text`

Use CTX_DDL.CREATE_PREFERENCE to create a preference with DETAIL_DATASTORE. Use CTX_DDL.SET_ATTRIBUTE to set the attributes for this preference as described earlier. The following example shows how this is done:

begin

ctx_ddl.create_preference('my_detail_pref', 'DETAIL_DATASTORE');
ctx_ddl.set_attribute('my_detail_pref', 'binary', 'true');
ctx_ddl.set_attribute('my_detail_pref', 'detail_table', 'my_detail');
ctx_ddl.set_attribute('my_detail_pref', 'detail_key', 'article_id');
ctx_ddl.set_attribute('my_detail_pref', 'detail_lineno', 'seq');
ctx_ddl.set_attribute('my_detail_pref', 'detail_text', 'text');

end;

2.2.3.2.4 Master/Detail Index Example

To index the document defined in this master/detail relationship, specify a column in the master table using the CREATE INDEX statement. The column you specify must be one of the allowed types.

This example uses the body column, whose function is to enable the creation of the master/detail index and to improve readability of the code. The my_detail_pref preference is set to DETAIL_DATASTORE with the required attributes:

CREATE INDEX myindex on my_master(body) indextype is ctxsys.context
parameters('datastore my_detail_pref');

In this example, you can also specify the title or author column to create the index. However, if you do so, changes to these columns will trigger a re-index operation.

2.2.4 FILE_DATASTORE

The FILE_DATASTORE type is used for text stored in files accessed through the local file system.

Note:

The FILE_DATASTORE type may not work with certain types of remote-mounted file systems.

The FILE_DATASTORE type has the attributes described Table 2-4.

Table 2-4 FILE_DATASTORE Attributes

Attribute	Attribute Value
`path`	path1:path2:pathn
`filename_charset`	name

path

Specifies the full directory path name of the files stored externally in a file system. When you specify the full directory path as such, you need to include only file names in your text column.

You can specify multiple paths for the path attribute, with each path separated by a colon (:) on UNIX and semicolon(;) on Windows. File names are stored in the text column in the text table.

If you do not specify a path for external files with this attribute, then Oracle Text requires that the path be included in the file names stored in the text column.

filename_charset

Specifies a valid Oracle character set name (maximum length 30 characters) to be used by the file datastore for converting file names. In general, the Oracle database can use a different character set than the operating system. This can lead to problems in finding files (which may raise DRG-11513 errors) when the indexed column contains characters that are not convertible to the operating system character set. By default, the file datastore will convert the file name to WE8ISO8859p1 for ASCII platforms or WE8EBCDIC1047 for EBCDIC platforms.

However, this may not be sufficient for applications with multibyte character sets for both the database and the operating system, because neither WE8ISO8859p1 nor WE8EBCDIC1047 supports multibyte characters. The attribute filename_charset rectifies this problem. If specified, then the datastore will convert from the database character set to the specified character set rather than to ISO8859 or EBCDIC.

If the filename_charset attribute is the same as the database character set, then the file name is used as is. If filename_charset is not a valid character set, then the error "DRG-10763: value %s is not a valid character set" is raised.

2.2.4.1 PATH Attribute Limitations

The PATH attribute has the following limitations:

If you specify a PATH attribute, then you can only use a simple file name in the indexed column. You cannot combine the PATH attribute with a path as part of the file name. If the files exist in multiple folders or directories, you must leave the PATH attribute unset, and include the full file name, with PATH, in the indexed column.
On Windows systems, the files must be located on a local drive. They cannot be on a remote drive, whether the remote drive is mapped to a local drive letter.

2.2.4.2 FILE_DATASTORE and Security

File and URL datastores enable access to files on the actual database disk. This may be undesirable when security is an issue since any user can browse the file system that is accessible to the Oracle user. The FILE_ACCESS_ROLE system parameter can be used to set the name of a database role that is authorized to create an index using FILE or URL datastores. If set, any user attempting to create an index using FILE or URL datastores must have this role, or the index creation will fail. Only SYS can set FILE_ACCESS_ROLE, and an error will be raised if any other user tries to modify it. If FILE_ACCESS_ROLE is left at the default of NULL, access is disallowed. Thus, by default, users are not able to create indexes that use the file or URL datastores. Users can, if desired, set FILE_ACCESS_ROLE to PUBLIC if they want to preserve the behavior from earlier releases.

For example, the following statement sets the name of the database role:

ctx_adm.set_parameter('FILE_ACCESS_ROLE','TOPCAT');

where TOPCAT is the role that is authorized to create an index on a file or URL datastore. The CREATE INDEX operation will fail when a user that does not have an authorized role tries to create an index on a file or URL datastore. For example:

CREATE INDEX myindex ON mydocument(TEXT) INDEXTYPE IS ctxsys.context  PARAMETERS('DATASTORE ctxsys.file_datastore')

In this case, if the user does not have the role TOPCAT, then index creation will fail and return an error. For users who have the TOPCAT role, the index creation will proceed normally.

The authorized role name is checked any time the datastore is accessed. This includes index creation, index sync, and calls to document services, such as CTX_DOC.HIGHLIGHT.

2.2.4.3 FILE_DATASTORE Example

This example creates a file datastore preference called COMMON_DIR that has a path of /mydocs:

begin
 ctx_ddl.create_preference('COMMON_DIR','FILE_DATASTORE');
 ctx_ddl.set_attribute('COMMON_DIR','PATH','/mydocs');
end;

When you populate the table mytable, you need only insert file names. The path attribute tells the system where to look during the indexing operation.

create table mytable(id number primary key, docs varchar2(2000)); 
insert into mytable values(111555,'first.txt');
insert into mytable values(111556,'second.txt');
commit;

Create the index as follows:

create index myindex on mytable(docs)
  indextype is ctxsys.context
  parameters ('datastore COMMON_DIR');

2.2.5 URL_DATASTORE

Use the URL_DATASTORE type for text stored:

In files on the World Wide Web (accessed through HTTP or FTP)
In files in the local file system (accessed through the file protocol)

Store each URL in a single text field.

2.2.5.1 URL Syntax

The syntax of a URL you store in a text field is as follows (with brackets indicating optional parameters):

[URL:]<access_scheme>://<host_name>[:<port_number>]/[<url_path>]

The access_scheme string can be either ftp, http, or file. For example:

http://mymachine.us.oracle.com/home.html

Note:

The login:password@ syntax within the URL is supported only for the ftp access scheme.

Because this syntax is partially compliant with the RFC 1738 specification, the following restriction holds for the URL syntax: The URL must contain only printable ASCII characters. Non-printable ASCII characters and multibyte characters must be escaped with the %xx notation, where xx is the hexadecimal representation of the special character.

2.2.5.2 URL_DATASTORE Attributes

URL_DATASTORE has the following attributes:

Table 2-5 URL_DATASTORE Attributes

Attribute	Attribute Value
`timeout`	The value of this attribute is ignored. This is provided for backward compatibility.
`maxthreads`	The value of this attribute is ignored. `URL_DATASTORE` is single-threaded. This is provided for backward compatibility.
`urlsize`	The value of this attribute is ignored. This is provided for backward compatibility.
`maxurls`	The value of this attribute is ignored. This is provided for backward compatibility.
`maxdocsize`	The value of this attribute is ignored. This is provided for backward compatibility.
`http_proxy`	Specify the host name of http proxy server. Optionally specify port number with a colon in the form `hostname:port`.
`ftp_proxy`	Specify the host name of ftp proxy server. Optionally specify port number with a colon in the form `hostname:port`.
`no_proxy`	Specify the domain for no proxy server. Use a comma separated string of up to 16 domain names.

timeout

The value of this attribute is ignored. This is provided for backward compatibility.

maxthreads

The value of this attribute is ignored. URL_DATASTORE is single-threaded. This is provided for backward compatibility.

urlsize

The value of this attribute is ignored. This is provided for backward compatibility.

maxdocsize

The value of this attribute is ignored. This is provided for backward compatibility.

maxurls

The value of this attribute is ignored. This is provided for backward compatibility.

http_proxy

Specify the fully qualified name of the host machine that serves as the HTTP proxy (gateway) for the machine on which Oracle Text is installed. You can optionally specify port number with a colon in the form hostname:port.

You must set this attribute if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

ftp_proxy

Specify the fully qualified name of the host machine that serves as the FTP proxy (gateway) for the server on which Oracle Text is installed. You can optionally specify a port number with a colon in the form hostname:port.

This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

no_proxy

Specify a string of domains (up to sixteen, separated by commas) that are found in most, if not all, of the machines in your intranet. When one of the domains is encountered in a host name, no request is sent to the server(s) specified for ftp_proxy and http_proxy. Instead, the request is processed directly by the host machine identified in the URL.

For example, if the string us.example.com, uk.example.com is entered for no_proxy, any URL requests to machines that contain either of these domains in their host names are not processed by your proxy server(s).

2.2.5.3 URL_DATASTORE and Security

For a discussion of how to control file access security for file and URL datastores, refer to "FILE_DATASTORE and Security".

2.2.5.4 URL_DATASTORE Example

This example creates a URL_DATASTORE preference called URL_PREF for which the http_proxy, no_proxy, and timeout attributes are set. The defaults are used for the attributes that are not set.

begin
 ctx_ddl.create_preference('URL_PREF','URL_DATASTORE');
 ctx_ddl.set_attribute('URL_PREF','HTTP_PROXY','www-proxy.us.oracle.com');
 ctx_ddl.set_attribute('URL_PREF','NO_PROXY','us.oracle.com');
 ctx_ddl.set_attribute('URL_PREF','Timeout','300');
end;

Create the table and insert values into it:

create table urls(id number primary key, docs varchar2(2000));
insert into urls values(111555,'http://context.us.oracle.com');
insert into urls values(111556,'http://www.sun.com');
commit;

To create the index, specify URL_PREF as the datastore:

create index datastores_text on urls ( docs ) 
  indextype is ctxsys.context 
  parameters ( 'Datastore URL_PREF' );

2.2.6 USER_DATASTORE

Use the USER_DATASTORE type to define stored procedures that synthesize documents during indexing. For example, a user procedure might synthesize author, date, and text columns into one document to have the author and date information be part of the indexed text.

USER_DATASTORE has the following attributes:

Table 2-6 USER_DATASTORE Attributes

Attribute Attribute Value

Attribute	Attribute Value
`procedure`	Specify the procedure that synthesizes the document to be indexed. This procedure can be owned by any user and must be executable by the index owner.
`output_type`	Specify the data type of the second argument to procedure. Valid values are `CLOB`, `BLOB`, `CLOB_LOC`, `BLOB_LOC`, or `VARCHAR2`. The default is `CLOB`. When you specify `CLOB_LOC`, `BLOB_LOC`, you indicate that no temporary `CLOB` or `BLOB` is needed, because your procedure copies a locator to the `IN`/`OUT` second parameter.

procedure

Specify the procedure that synthesizes the document to be indexed.

This procedure can be owned by any user and must be executable by the index owner.

output_type

Specify the data type of the second argument to procedure. Valid values are CLOB, BLOB, CLOB_LOC, BLOB_LOC, or VARCHAR2. The default is CLOB.

When you specify CLOB_LOC, BLOB_LOC, you indicate that no temporary CLOB or BLOB is needed, because your procedure copies a locator to the IN/OUT second parameter.

procedure

Specify the name of the procedure that synthesizes the document to be indexed. This specification must be in the form PROCEDURENAME or PACKAGENAME.PROCEDURENAME. You can also specify the schema owner name.

The procedure you specify must have two arguments defined as follows:

procedure (r IN ROWID, c IN OUT NOCOPY output_type)

The first argument r must be of type ROWID. The second argument c must be of type output_type. NOCOPY is a compiler hint that instructs Oracle Text to pass parameter c by reference if possible.

Note:

The procedure name and its arguments can be named anything. The arguments r and c are used in this example for simplicity.

The stored procedure is called once for each row indexed. Given the rowid of the current row, procedure must write the text of the document into its second argument, whose type you specify with output_type.

2.2.6.1 Constraints

The following constraints apply to procedure:

It can be owned by any user, but the user must have database permissions to execute procedure correctly
It must be executable by the index owner
It must not enter DDL or transaction control statements, like COMMIT

2.2.6.2 Editing Procedure after Indexing

When you change or edit the stored procedure, indexes based on it will not be notified, so you must manually re-create such indexes. So if the stored procedure makes use of other columns, and those column values change, the row will not be re-indexed. The row is re-indexed only when the indexed column changes.

output_type: Specify the datatype of the second argument to procedure. You can use either CLOB, BLOB, CLOB_LOC, BLOB_LOC, or VARCHAR2.

2.2.6.3 USER_DATASTORE with CLOB Example

Consider a table in which the author, title, and text fields are separate, as in the articles table defined as follows:

create table articles( 
    id       number, 
    author   varchar2(80), 
    title    varchar2(120), 
    text     clob );

The author and title fields are to be part of the indexed document text. Assume user appowner writes a stored procedure with the user datastore interface that synthesizes a document from the text, author, and title fields:

create procedure myproc(rid in rowid, tlob in out clob nocopy) is 
  begin 
      for c1 in (select author, title, text from articles 
                  where rowid = rid) 
      loop

dbms_lob.writeappend(tlob, length(c1.title), c1.title);
   dbms_lob.writeappend(tlob, length(c1.author), c1.author);
   dbms_lob.writeappend(tlob, length(c1.text), c1.text);

end loop; 
    end;

This procedure takes in a rowid and a temporary CLOB locator, and concatenates all the article's columns into the temporary CLOB. The for loop executes only once.

The user appowner creates the preference as follows:

begin

ctx_ddl.create_preference('myud', 'user_datastore'); 
ctx_ddl.set_attribute('myud', 'procedure', 'myproc'); 
ctx_ddl.set_attribute('myud', 'output_type', 'CLOB');

end;

When appowner creates the index on articles(text) using this preference, the indexing operation sees author and title in the document text.

2.2.6.4 USER_DATASTORE with BLOB_LOC Example

The following procedure might be used with OUTPUT_TYPE BLOB_LOC:

procedure myds(rid in rowid, dataout in out nocopy blob)
is
  l_dtype varchar2(10);
  l_pk    number;
begin
  select dtype, pk into l_dtype, l_pk from mytable where rowid = rid;
  if (l_dtype = 'MOVIE') then
    select movie_data into dataout from movietab where fk = l_pk;
  elsif (l_dtype = 'SOUND') then
    select sound_data into dataout from soundtab where fk = l_pk;
  end if;
end;

The user appowner creates the preference as follows:

begin

ctx_ddl.create_preference('myud', 'user_datastore'); 
ctx_ddl.set_attribute('myud', 'procedure', 'myproc'); 
ctx_ddl.set_attribute('myud', 'output_type', 'blob_loc');

end;

2.2.7 NESTED_DATASTORE

Use the nested datastore type to index documents stored as rows in a nested table.

Table 2-7 NESTED_DATASTORE Attributes

Attribute	Attribute Value
`nested_column`	Specify the name of the nested table column.This attribute is required. Specify only the column name. Do not specify schema owner or containing table name.
`nested_type`	Specify the type of nested table. This attribute is required. You must provide owner name and type.
`nested_lineno`	Specify the name of the attribute in the nested table that orders the lines. This is like `DETAIL_LINENO` in detail datastore. This attribute is required.
`nested_text`	Specify the name of the column in the nested table type that contains the text of the line. This is like `DETAIL_TEXT` in detail datastore. This attribute is required. `LONG` column types are not supported as nested table text columns.
`binary`	Specify `FALSE` for Oracle Text to automatically insert a newline character when synthesizing the document text. If you specify `TRUE`, Oracle Text does not do this. This attribute is not required. The default is `FALSE`.

When using the nested table datastore, you must index a dummy column, because the extensible indexing framework disallows indexing the nested table column. See the example.

DML on the nested table is not automatically propagated to the dummy column used for indexing. For DML on the nested table to be propagated to the dummy column, your application code or trigger must explicitly update the dummy column.

Filter defaults for the index are based on the type of the nested_text column.

During validation, Oracle Text checks that the type exists and that the attributes you specify for nested_lineno and nested_text exist in the nested table type. Oracle Text does not check that the named nested table column exists in the indexed table.

2.2.7.1 NESTED_DATASTORE Example

This section shows an example of using the NESTED_DATASTORE type to index documents stored as rows in a nested table.

2.2.7.1.1 Create the Nested Table

The following code creates a nested table and a storage table mytab for the nested table:

create type nt_rec as object (
  lno number, -- line number
  ltxt varchar2(80) -- text of line
);

create type nt_tab as table of nt_rec;
create table mytab (
   id number primary key, -- primary key
   dummy char(1), -- dummy column for indexing
   doc nt_tab -- nested table
)
nested table doc store as myntab;

2.2.7.1.2 Insert Values into Nested Table

The following code inserts values into the nested table for the parent row with ID equal to 1.

insert into mytab values (1, null, nt_tab());
insert into table(select doc from mytab where id=1) values (1, 'the dog');
insert into table(select doc from mytab where id=1) values (2, 'sat on mat ');
commit;

2.2.7.1.3 Create Nested Table Preferences

The following code sets the preferences and attributes for the NESTED_DATASTORE according to the definitions of the nested table type nt_tab and the parent table mytab:

begin
-- create nested datastore pref
ctx_ddl.create_preference('ntds','nested_datastore'); 

-- nest tab column in main table
ctx_ddl.set_attribute('ntds','nested_column', 'doc'); 

-- nested table type
ctx_ddl.set_attribute('ntds','nested_type', 'scott.nt_tab');

-- lineno column in nested table
ctx_ddl.set_attribute('ntds','nested_lineno','lno');

--text column in nested table
ctx_ddl.set_attribute('ntds','nested_text', 'ltxt');
end;

2.2.7.1.4 Create Index on Nested Table

The following code creates the index using the nested table datastore:

create index myidx on mytab(dummy) -- index dummy column, not nest table
indextype is ctxsys.context parameters ('datastore ntds');

2.2.7.1.5 Query Nested Datastore

The following select statement queries the index built from a nested table:

select * from mytab where contains(dummy, 'dog and mat')>0;
-- returns document 1, because it has dog in line 1 and mat in line 2.

2.3 Filter Types

Use the filter types to create preferences that determine how text is filtered for indexing. Filters enable word processor documents, formatted documents, plain text, HTML, and XML documents to be indexed.

For formatted documents, Oracle Text stores documents in their native format and uses filters to build interim plain text or HTML versions of the documents. Oracle Text indexes the words derived from the plain text or HTML version of the formatted document. The TMP_DIR environment variable sets the directory path for storing temporary files created by the filter.

To create a filter preference, you must use one of the following types:

Table 2-8 Filter Types

Filter	When Used
CHARSET_FILTER	Character set converting filter.
AUTO_FILTER	Auto filter for filtering formatted documents.
NULL_FILTER	No filtering required. Use for indexing plain text, HTML, or XML documents.
MAIL_FILTER	Use the `MAIL_FILTER` to transform RFC-822, RFC-2045 messages in to text that can be indexed.
USER_FILTER	User-defined external filter to be used for custom filtering.
PROCEDURE_FILTER	User-defined stored procedure filter to be used for custom filtering.

2.3.1 CHARSET_FILTER

Use the CHARSET_FILTER to convert documents from a non-database character set to the character set used by the database.

CHARSET_FILTER has the attribute described in Table 2-9.

Table 2-9 CHARSET_FILTER Attributes

Attribute Attribute Value

Attribute	Attribute Value
charset	Specify the Globalization Support name of source character set. If you specify UTF16AUTO, then this filter automatically detects the if the character set is UTF16 big- or little-endian. Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets. JAAUTO can only be specified on a database whose character set is JA16EUC, JA16SJIS, or UTF8. Specify `AUTO` to have `CHARSET_FILTER` automatically detect and convert character sets that Oracle Database supports, as shown in Table 2-10.

charset

Specify the Globalization Support name of source character set.

If you specify UTF16AUTO, then this filter automatically detects the if the character set is UTF16 big- or little-endian.

Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

JAAUTO can only be specified on a database whose character set is JA16EUC, JA16SJIS, or UTF8.

Specify AUTO to have CHARSET_FILTER automatically detect and convert character sets that Oracle Database supports, as shown in Table 2-10.

When the charset column or attribute is set to AUTO, the CHARSET_FILTER automatically detects the document character set and converts the document from the detected character set to the database character set. CHARSET_FILTER can detect the supported character sets shown in Table 2-10.

Table 2-10 Character Sets Supported for CHARSET_FILTER Auto-detection

Character Set
AL16UTF16	JA16EUC
AL32UTF8	JA16SJIS
AR8ISO8859P6	KO16KSC5601
AR8MSWIN1256	TH8TISASCII
CL8ISO8859P5	WE8ISO8859P1
CL8KOI8R	WE8ISO8859P9
CL8MSWIN1251	WE8MSWIN1252
EE8ISO8859P2	ZHS16CGB231280
EE8MSWIN1250	ZHS32GB18030
EL8ISO8859P7	ZHT16BIG5
EL8MSWIN1253	WE8MSWIN1252

See Also:

Oracle Database Globalization Support Guide for more information about the supported globalization character sets

2.3.1.1 UTF-16 Big- and Little-Endian Detection

If your character set is UTF-16, then you can specify UTF16AUTO to automatically detect big- or little-endian data. Oracle Text does so by examining the first two bytes of the document row.

If the first two bytes are 0xFE, 0xFF, the document is recognized as big-endian and the remainder of the document minus those two bytes is passed on for indexing.

If the first two bytes are 0xFF, 0xFE, the document is recognized as little-endian and the remainder of the document minus those two bytes is passed on for indexing.

If the first two bytes are anything else, the document is assumed to be big-endian and the whole document including the first two bytes is passed on for indexing.

2.3.1.2 Indexing Mixed-Character Set Columns

A mixed character set column is one that stores documents of different character sets. For example, a text table might store some documents in WE8ISO8859P1 and others in UTF8.

To index a table of documents in different character sets, you must create your base table with a character set column. In this column, specify the document character set on a per-row basis. To index the documents, Oracle Text converts the documents into the database character set.

Character set conversion works with the CHARSET_FILTER. When the charset column is NULL or not recognized, Oracle Text assumes the source character set is the one specified in the charset attribute.

Note:

Character set conversion also works with the AUTO_FILTER when the document format column is set to TEXT.

2.3.1.2.1 Indexing Mixed-Character Set Example

For example, create the table with a charset column:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     cset varchar2(20),
     text varchar2(80)
);

Create a preference for this filter:

begin
cxt_ddl.create_preference('cs_filter', 'CHARSET_FILTER');
ctx_ddl.set_attribute('cs_filter', 'charset', 'UTF8');
end;
/

Insert plain-text documents and name the character set:

insert into hdocs values(1, 'text', 'WE8ISO8859P1', '/docs/iso.txt');
insert into hdocs values (2, 'text', 'UTF8', '/docs/utf8.txt');
commit;

Create the index and name the charset column:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter cs_filter 
  format column fmt
  charset column cset');

2.3.2 AUTO_FILTER

The AUTO_FILTER is a universal filter that filters most document formats, including PDF and Microsoft Word documents. Use it for indexing both single-format and mixed-format columns. This filter automatically bypasses plain text, HTML, XHTML, SGML, and XML documents.

See Also:

Appendix B, "Oracle Text Supported Document Formats", for a list of the formats supported by AUTO_FILTER, and to learn more about how to set up your environment

Note:

The AUTO_FILTER replaces the INSO_FILTER, which has been deprecated. While every effort has been made to ensure maximal backward compatibility between the two filters, so that applications using INSO_FILTER will continue to work without modification, some differences may arise. Users should therefore use AUTO_FILTER in their new programs and, when possible, replace instances of INSO_FILTER, and any system preferences or constants that make use of it, in older applications.

The AUTO_FILTER preference has the following attributes:

Table 2-11 AUTO_FILTER Attributes

Attribute Attribute Value

Attribute	Attribute Value
`timeout`	Specify the `AUTO_FILTER` timeout in seconds. Use a number between 0 and 42,949,672. Default is 120. Setting this value to 0 disables the feature. How this wait period is used depends on how you set `timeout_type`. This feature is disabled for rows for which the corresponding charset and format column cause the `AUTO_FILTER` to bypass the row, such as when format is marked `TEXT`. Use this feature to prevent the Oracle Text indexing operation from waiting indefinitely on a hanging filter operation.
`timeout_type`	Specify either `HEURISTIC` or `FIXED`. Default is `HEURISTIC`. Specify `HEURISTIC` for Oracle Text to check every `TIMEOUT` seconds if output from Outside In HTML Export has increased. The operation terminates for the document if output has not increased. An error is recorded in the `CTX_USER_INDEX_ERRORS` view and Oracle Text moves to the next document row to be indexed. Specify `FIXED` to terminate the Outside In HTML Export processing after `TIMEOUT` seconds regardless of whether filtering was progressing normally or just hanging. This value is useful when indexing throughput is more important than taking the time to successfully filter large documents.
`output_formatting`	Setting this attribute has no effect on filter performance or filter output. It is maintained for backward compatibility.

timeout

Specify the AUTO_FILTER timeout in seconds. Use a number between 0 and 42,949,672. Default is 120. Setting this value to 0 disables the feature.

How this wait period is used depends on how you set timeout_type.

This feature is disabled for rows for which the corresponding charset and format column cause the AUTO_FILTER to bypass the row, such as when format is marked TEXT.

Use this feature to prevent the Oracle Text indexing operation from waiting indefinitely on a hanging filter operation.

timeout_type

Specify either HEURISTIC or FIXED. Default is HEURISTIC.

Specify HEURISTIC for Oracle Text to check every TIMEOUT seconds if output from Outside In HTML Export has increased. The operation terminates for the document if output has not increased. An error is recorded in the CTX_USER_INDEX_ERRORS view and Oracle Text moves to the next document row to be indexed.

Specify FIXED to terminate the Outside In HTML Export processing after TIMEOUT seconds regardless of whether filtering was progressing normally or just hanging. This value is useful when indexing throughput is more important than taking the time to successfully filter large documents.

output_formatting

Setting this attribute has no effect on filter performance or filter output. It is maintained for backward compatibility.

2.3.2.1 Indexing Formatted Documents

To index a text column containing formatted documents such as Microsoft Word, use the AUTO_FILTER. This filter automatically detects the document format. Use the CTXSYS.AUTO_FILTER system-defined preference in the parameter clause as follows:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.auto_filter');

Note:

The CTXSYS.AUTO_FILTER replaces CTXSYS.INSO_FILTER, which has been deprecated. Programs making use of CTXSYS.INSO_FILTER should still work. New programs should use CTXSYS.AUTO_FILTER.

2.3.2.2 Explicitly Bypassing Plain Text or HTML in Mixed Format Columns

A mixed-format column is a text column containing more than one document format, such as a column that contains Microsoft Word, PDF, plain text, and HTML documents.

The AUTO_FILTER can index mixed-format columns, automatically bypassing plain text, HTML, and XML documents. However, if you prefer not to depend on the built-in bypass mechanism, you can explicitly tag your rows as text and cause the AUTO_FILTER to ignore the row and not process the document in any way.

The format column in the base table enables you to specify the type of document contained in the text column. You can specify the following document types: TEXT, BINARY, and IGNORE. During indexing, the AUTO_FILTER ignores any document typed TEXT, assuming the charset column is not specified. (The difference between a document with a TEXT format column type and one with an IGNORE type is that the TEXT document is indexed, but ignored by the filter, while the IGNORE document is not indexed at all. Use IGNORE to overlook documents such as image files, or documents in a language that you do not want to index. IGNORE can be used with any filter type.)

To set up the AUTO_FILTER bypass mechanism, you must create a format column in your base table.

For example:

create table hdocs (
     id number primary key,
     fmt varchar2(10),
     text varchar2(80)
);

Assuming you are indexing mostly Word documents, you specify BINARY in the format column to filter the Word documents. Alternatively, to have the AUTO_FILTER ignore an HTML document, specify TEXT in the format column.

For example, the following statements add two documents to the text table, assigning one format as BINARY and the other TEXT:

insert into hdocs values(1, 'binary', '/docs/myword.doc');
insert in hdocs values (2, 'text', '/docs/index.html');
commit;

To create the index, use CREATE INDEX and specify the format column name in the parameter string:

create index hdocsx on hdocs(text) indextype is ctxsys.context
  parameters ('datastore ctxsys.file_datastore 
  filter ctxsys.auto_filter 
  format column fmt');

If you do not specify TEXT or BINARY for the format column, BINARY is used.

Note:

You need not specify the format column in CREATE INDEX when using the AUTO_FILTER.

2.3.2.3 Character Set Conversion With AUTO_FILTER

The AUTO_FILTER converts documents to the database character set when the document format column is set to TEXT. In this case, the AUTO_FILTER looks at the charset column to determine the document character set.

If the charset column value is not an Oracle Text character set name, the document is passed through without any character set conversion.

Note:

You need not specify the charset column when using the AUTO_FILTER.

If you do specify the charset column and do not specify the format column, the AUTO_FILTER works like the CHARSET_FILTER, except that in this case there is no Japanese character set auto-detection.

See Also:

"CHARSET_FILTER".

2.3.3 NULL_FILTER

Use the NULL_FILTER type when plain text or HTML is to be indexed and no filtering needs to be performed. NULL_FILTER has no attributes.

2.3.3.1 Indexing HTML Documents

If your document set is entirely HTML, Oracle recommends that you use the NULL_FILTER in your filter preference.

For example, to index an HTML document set, specify the system-defined preferences for NULL_FILTER and HTML_SECTION_GROUP as follows:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
  parameters('filter ctxsys.null_filter
  section group ctxsys.html_section_group');

See Also:

For more information on section groups and indexing HTML documents, see "Section Group Types".

2.3.4 MAIL_FILTER

Use MAIL_FILTER to transform RFC-822, RFC-2045 messages into indexable text. The following limitations apply to the input:

Documents must be US-ASCII
Lines must not be longer than 1024 bytes
Documents must be syntactically valid with regard to RFC-822.

Behavior for invalid input is not defined. Some deviations may be robustly handled by the filter without error. Others may result in a fetch-time or filter-time error.

The MAIL_FILTER has the following attributes:

Table 2-12 MAIL_FILTER Attributes

Attribute	Attribute Value
`INDEX_FIELDS`	Specify a colon-separated list of fields to preserve in the output. These fields are transformed to tag markup. For example, if `INDEX_FIELDS` is set to "FROM": `From: Scott Tiger` becomes: `<FROM>Scott Tiger</FROM>` Only top-level fields are transformed in this way.
`AUTO_FILTER_TIMEOUT`	Specify a timeout value for the `AUTO_FILTER` filtering invoked by the mail filter. Default is 60. (Replaces the `INSO_TIMEOUT` attribute and is backward compatible with `INSO_TIMEOUT`.)
`AUTO_FILTER_OUTPUT_FORMATTING`	Specify either `TRUE` or `FALSE`. Default is `TRUE`. This attribute replaces the previous `INSO_OUTPUT_FORMATTING` attribute. However, it has no effect in the current release.
`PART_FIELD_STYLE`	Specify how fields occurring in lower-level parts and identified by the `INDEX_FIELDS` attribute should be transformed. The fields of the top-level message part identified by `INDEX_FIELDS` are always transformed to tag markup (see the previous description of `INDEX_FIELDS`); `PART_FIELD_STYLE` controls the transformation of subsequent parts; for example, attached e-mails. Possible values include `IGNORE` (the default), in which the part fields are not included for indexing; `TAG`, in which the part field names are transformed to tags, as occurs with top-level part fields; `FIELD`, in which the part field names are preserved as fields, not as tags; and `TEXT`, in which the part field names are eliminated and only the field content is preserved for indexing. See "Mail_Filter Example" for an example of how `PART_FIELD_STYLE` works.

2.3.4.1 Filter Behavior

This filter behaves in the following way for each document:

Read and remove header fields
Decode message body if needed, depending on Content-transfer-encoding field
Take action depending on the Content-Type field value and the user-specified behavior specified in a mail filter configuration file. (See "About the Mail Filter Configuration File".) The possible actions are:
- produce the body in the output text (INCLUDE). If no character set is encountered in the INCLUDE parts in the Content-Type header field, then Oracle defaults to the value specified in the character set column in the base table. Name your populated character set column in the parameter string of the CREATE INDEX command.
- AUTO_FILTER the body contents (AUTO_FILTER directive).
- remove the body contents from the output text (IGNORE)
If no behavior is specified for the type in the configuration file, then the defaults are as follows:
- text/*: produce body in the output text
- application/*: AUTO_FILTER the body contents
- image/*, audio/*, video/*, model/*: ignore
Multipart messages are parsed, and the mail filter applied recursively to each part. Each part is appended to the output.
All text produced will be charset-converted to the database character set, if needed.

2.3.4.2 About the Mail Filter Configuration File

The MAIL_FILTER filter makes use of a mail filter configuration file, which contains directives specifying how a mail document should be filtered. The mail filter configuration file is a editable text file. Here you can override default behavior for each Content-Type. The configuration file also contains IANA-to-Oracle Globalization Support character set name mappings.

The location of the file must be in ORACLE_HOME/ctx/config. The name of the file to use is stored in the new system parameter MAIL_FILTER_CONFIG_FILE. On install, this is set to drmailfl.txt, which has useful default contents.

Oracle recommends that you create your own mail filter configuration files to avoid overwrite by the installation of a new version or patch set. The mail filter configuration file should be in the database character set.

2.3.4.2.1 Mail File Configuration File Structure

The file has two sections, BEHAVIOR and CHARSETS. Indicate the start of the behavior section as follows:

[behavior]

Each line following starts with a mime type, then whitespace, then behavior specification. The MIME type can be a full TYPE/SUBTYPE or just TYPE, which will apply to all subtypes of that type. TYPE/SUBTYPE specification overrides TYPE specification, which overrides default behavior. Behavior can be INCLUDE, AUTO_FILTER, or IGNORE (see "Filter Behavior" for definitions). For instance:

application/zip     IGNORE
application/msword  AUTO_FILTER
model               IGNORE

You cannot specify behavior for "multipart" or "message" types. If you do, such lines are ignored. Duplicate specification for a type replaces earlier specifications.

Comments can be included in the mail configuration file by starting lines with the # symbol.

The charset mapping section begins with

[charsets]

Lines consist of an IANA name, then whitespace, then an Oracle Globalization Support charset name, like:

US-ASCII     US7ASCI
ISO-8859-1   WE8ISO8859P1

This file is the only way the mail filter gets the mappings. There are no defaults.

When you change the configuration file, the changes affect only the documents indexed after that point. You must flush the shared pool after changing the file.

2.3.4.3 Mail_Filter Example

Suppose there is an e-mail with the following form, in which other e-mails with different subject lines are attached to this e-mail:

To:  somebody@someplace
Subject:  mainheader
Content-Type:  multipart/mixed
. . .
Content-Type: text/plain
X-Ref:  some_value
Subject:  subheader 1
. . .
Content-Type:  text/plain
X-Control:  blah blah blah 
Subject:  subheader 2
. . .

Set INDEX_FIELDS to be "Subject" and, initially, PART_FIELD_STYLE to IGNORE.

CTX_DDL.CREATE_PREFERENCE('my_mail_filt', 'mail_filter');
CTX_DDL_SET_ATTRIBUTE(my_mail_filt', 'INDEX_FILES', 'subject');
CTX_DDL.SET ATTRIBUTE ('my_mail_filt', 'PART_FIELD_STYLE', 'ignore');

Now when the index is created, the file will be indexed as follows:

<SUBJECT>mainheader</SUBJECT>

If PART_FIELD_STYLE is instead set to TAG, this becomes:

<SUBJECT>mainheader</SUBJECT>
<SUBJECT>subheader1</SUBJECT>
<SUBJECT>subheader2</SUBJECT>

If PART_FIELD_STYLE is set to FIELD instead, this is the result:

<SUBJECT>mainheader<SUBJECT>
SUBJECT:subheader1
SUBJECT:subheader2

Finally, if PART_FIELD_STYLE is instead set to TEXT, then the result is:

<SUBJECT>mainheader</SUBJECT>
subheader1
subheader2

2.3.5 USER_FILTER

Use the USER_FILTER type to specify an external filter for filtering documents in a column. USER_FILTER has the following attribute:

Table 2-13 USER_FILTER Attribute

Attribute	Attribute Value
`command`	Specify the name of the filter executable.

CAUTION:

The USER_FILTER type introduces the potential for security threats. A database user granted the CTXAPP role could potentially use USER_FILTER to load a malicious application. Therefore, the DBA must safeguard against any combination of input and output file parameters that would enable the named filter executable to compromise system security.

command

Specify the executable for the single external filter that is used to filter all text stored in a column. If more than one document format is stored in the column, then the external filter specified for command must recognize and handle all such formats.

The executable that you specify must exist in the $ORACLE_HOME/ctx/bin directory on UNIX, and in the %ORACLE_HOME%/ctx/bin directory on Windows.

You must create your user-filter command with two parameters:

The first parameter is the name of the input file to be read.
The second parameter is the name of the output file to be written to.

If all the document formats are supported by AUTO_FILTER, then use AUTO_FILTER instead of USER_FILTER, unless additional tasks besides filtering are required for the documents.

2.3.5.1 Using USER_FILTER with Charset and Format Columns

USER_FILTER bypasses documents that do not need to be filtered. Its behavior is sensitive to the values of the format and charset columns. In addition, USER_FILTER performs character set conversion according to the charset column values.

2.3.5.2 Explicitly Bypassing Plain Text or HTML in Mixed Format Columns

A mixed-format column is a text column containing more than one document format, such as a column that contains Microsoft Word, PDF, plain text, and HTML documents.

The USER_FILTER executable can index mixed-format columns, automatically bypassing textual documents. However, if you prefer not to depend on the built-in bypass mechanism, you can explicitly tag your rows as text and cause the USER_FILTER executable to ignore the row and not process the document in any way.

The format column in the base table enables you to specify the type of document contained in the text column. You can specify the following document types: TEXT, BINARY, and IGNORE. During indexing, the USER_FILTER executable ignores any document typed TEXT, assuming the charset column is not specified. (The difference between a document with a TEXT format column type and one with an IGNORE type is that the TEXT document is indexed, but ignored by the filter, while the IGNORE document is not indexed at all. Use IGNORE to overlook documents such as image files, or documents in a language that you do not want to index. IGNORE can be used with any filter type.

To set up the USER_FILTER bypass mechanism, you must create a format column in your base table. For example:

create table hdocs (
   id number primary key,
   fmt varchar2(10),
   text varchar2(80)
);

Assuming you are indexing mostly Word documents, you specify BINARY in the format column to filter the Word documents. Alternatively, to have the USER_FILTER executable ignore an HTML document, specify TEXT in the format column.

For example, the following statements add two documents to the text table, assigning one format as BINARY and the other TEXT:

insert into hdocs values(1, 'binary', '/docs/myword.doc');
insert into hdocs values(2, 'text', '/docs/index.html');
commit;

Assuming that this file is named upcase.pl, create the filter preference as follows:

ctx_ddl.create_preference
   (
    preference_name => 'USER_FILTER_PREF',
    object_name     => 'USER_FILTER'
    );

ctx_ddl.set_attribute ('USER_FILTER_PREF', 'COMMAND', 'upcase.pl');

To create the index, use CREATE INDEX and specify the format column name in the parameter string:

create index hdocsx on hdocs(text) indextype is ctxsys.context
   parameters ('datastore ctxsys.file_datastore
   filter 'USER_FILTER_PREF'
   format column fmt');

If you do not specify TEXT or BINARY for the format column, BINARY is used.

2.3.5.3 Character Set Conversion with USER_FILTER

The USER_FILTER executable converts documents to the database character set when the document format column is set to TEXT. In this case, the USER_FILTER executable looks at the charset column to determine the document character set.

If the charset column value is not an Oracle Text character set name, the document is passed through without any character set conversion.

If you do specify the charset column and do not specify the format column, the USER_FILTER executable works like the CHARSET_FILTER, except that in this case, there is no Japanese character set auto-detection. See "CHARSET_FILTER" for more information regarding CHARSET_FILTER.

2.3.5.4 User Filter Example

The following example shows a Perl script to be used as the user filter. This script converts the input text file specified in the first argument to uppercase and writes the output to the location specified in the second argument.

#!/usr/local/bin/perl

open(IN, $ARGV[0]);
open(OUT, ">".$ARGV[1]);

while (<IN>)
{
  tr/a-z/A-Z/;
  print OUT;
}

close (IN);
close (OUT);

Assuming that this file is named upcase.pl, create the filter preference as follows:

begin 
  ctx_ddl.create_preference 
    ( 
      preference_name => 'USER_FILTER_PREF', 
      object_name     => 'USER_FILTER' 
    ); 
  ctx_ddl.set_attribute
    ('USER_FILTER_PREF','COMMAND','upcase.pl');
end;

Create the index in SQL*Plus as follows:

create index user_filter_idx on user_filter ( docs ) 
  indextype is ctxsys.context 
  parameters ('FILTER USER_FILTER_PREF');

2.3.6 PROCEDURE_FILTER

Use the PROCEDURE_FILTER type to filter your documents with a stored procedure. The stored procedure is called each time a document needs to be filtered.

Table 2-14 lists the attributes for PROCEDURE_FILTER.

Table 2-14 PROCEDURE_FILTER Attributes

Attribute	Purpose	Allowable Values
`procedure`	Name of the filter stored procedure.	Any procedure. The procedure can be a PL/SQL stored procedure.
`input_type`	Type of input argument for stored procedure.	`VARCHAR2, BLOB, CLOB, FILE`
`output_type`	Type of output argument for stored procedure.	`VARCHAR2, CLOB, FILE`
`rowid_parameter`	Include rowid parameter?	`TRUE/FALSE`
`format_parameter`	Include format parameter?	`TRUE/FALSE`
`charset_parameter`	Include charset parameter?	`TRUE/FALSE`

procedure

Specify the name of the stored procedure to use for filtering. The procedure can be a PL/SQL stored procedure. The procedure can be a safe callout, or call a safe callout.

With the rowid_parameter, format_parameter, and charset_parameter set to FALSE, the procedure can have one of the following signatures:

PROCEDURE(IN BLOB, IN OUT NOCOPY CLOB)
PROCEDURE(IN CLOB, IN OUT NOCOPY CLOB)
PROCEDURE(IN VARCHAR, IN OUT NOCOPY CLOB)
PROCEDURE(IN BLOB, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN CLOB, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN VARCHAR2, IN OUT NOCOPY VARCHAR2)
PROCEDURE(IN BLOB, IN VARCHAR2)
PROCEDURE(IN CLOB, IN VARCHAR2)
PROCEDURE(IN VARCHAR2, IN VARCHAR2)

The first argument is the content of the unfiltered row, output by the datastore. The second argument is for the procedure to pass back the filtered document text.

The procedure attribute is mandatory and has no default.

input_type

Specify the type of the input argument of the filter procedure. You can specify one of the following types:

Type	Description
`procedure`	Name of the filter stored procedure.
`input_type`	Type of input argument for stored procedure.
`output_type`	Type of output argument for stored procedure.
`rowid_parameter`	Include rowid parameter?

The input_type attribute is not mandatory. If not specified, then BLOB is the default.

output_type

Specify the type of output argument of the filter procedure. You can specify one of the following types:

Type	Description
`CLOB`	The output argument is `IN` `OUT` `NOCOPY` `CLOB`. Your procedure must write the filtered content to the `CLOB` passed in.
`VARCHAR2`	The output argument is `IN` `OUT` `NOCOPY` `VARCHAR2`. Your procedure must write the filtered content to the `VARCHAR2` variable passed in.
`FILE`	The output argument must be `IN` `VARCHAR2`. On entering the filter procedure, the output argument is the name of a temporary file. The filter procedure must write the filtered contents to this named file. Using a FILE output type is useful only when the procedure is a safe callout, which can write to the file.

The output_type attribute is not mandatory. If not specified, then CLOB is the default.

rowid_ parameter

When you specify TRUE, the rowid of the document to be filtered is passed as the first parameter, before the input and output parameters.

For example, with INPUT_TYPE BLOB, OUTPUT_TYPE CLOB, and ROWID_PARAMETER TRUE, the filter procedure must have the signature as follows:

procedure(in rowid, in blob, in out nocopy clob)

This attribute is useful for when your procedure requires data from other columns or tables. This attribute is not mandatory. The default is FALSE.

format_parameter

When you specify TRUE, the value of the format column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid parameter, if enabled.

Specify the name of the format column at index time in the parameters string, using the keyword 'format column <columnname>'. The parameter type must be IN VARCHAR2.

The format column value can be read by means of the rowid parameter, but this attribute enables a single filter to work on multiple table structures, because the format attribute is abstracted and does not require the knowledge of the name of the table or format column.

FORMAT_PARAMETERis not mandatory. The default is FALSE.

charset_parameter

When you specify TRUE, the value of the charset column of the document being filtered is passed to the filter procedure before input and output parameters, but after the rowid and format parameter, if enabled.

Specify the name of the charset column at index time in the parameters string, using the keyword 'charset column <columnname>'. The parameter type must be IN VARCHAR2.

CHARSET_PARAMETERattribute is not mandatory. The default is FALSE.

2.3.6.1 Parameter Order

ROWID_PARAMETER, FORMAT_PARAMETER, and CHARSET_PARAMETERare all independent. The order is rowid, the format, then charset. However, the filter procedure is passed only the minimum parameters required.

For example, assume that INPUT_TYPE is BLOB and OUTPUT_TYPE is CLOB. If your filter procedure requires all parameters, then the procedure signature must be:

(id IN ROWID, format IN VARCHAR2, charset IN VARCHAR2, input IN BLOB, output IN
OUT NOCOPY CLOB)

If your procedure requires only the ROWID, then the procedure signature must be:

(id IN ROWID,input IN BLOB, ouput IN OUT NOCOPY CLOB)

2.3.6.2 Procedure Filter Execute Requirements

To create an index using a PROCEDURE_FILTER preference, the index owner must have execute permission on the procedure.

2.3.6.3 Error Handling

The filter procedure can raise any errors needed through the normal PL/SQL raise_application_error facility. These errors are propagated to the CTX_USER_INDEX_ERRORS view or reported to the user, depending on how the filter is invoked.

2.3.6.4 Procedure Filter Preference Example

Consider a filter procedure CTXSYS.NORMALIZE that you define with the following signature:

PROCEDURE NORMALIZE(id IN ROWID, charset IN VARCHAR2, input IN CLOB, 
output IN OUT NOCOPY VARCHAR2);

To use this procedure as your filter, set up your filter preference as follows:

begin
ctx_ddl.create_preference('myfilt', 'procedure_filter');
ctx_ddl.set_attribute('myfilt', 'procedure', 'normalize');
ctx_ddl.set_attribute('myfilt', 'input_type', 'clob');
ctx_ddl.set_attribute('myfilt', 'output_type', 'varchar2');
ctx_ddl.set_attribute('myfilt', 'rowid_parameter', 'TRUE');
ctx_ddl.set_attribute('myfilt', 'charset_parameter', 'TRUE');
end;

2.4 Lexer Types

Use the lexer preference to specify the language of the text to be indexed. To create a lexer preference, you must use one of the lexer types described in Table 2-15.

Table 2-15 Lexer Types

Type	Description
AUTO_LEXER	Lexer for indexing columns that contain documents of different languages.
BASIC_LEXER	Lexer for extracting tokens from text in languages, such as English and most western European languages that use white space delimited words.
MULTI_LEXER	Lexer for indexing tables containing documents of different languages such as English, German, and Japanese.
CHINESE_VGRAM_LEXER	Lexer for extracting tokens from Chinese text.
CHINESE_LEXER	Lexer for extracting tokens from Chinese text. This lexer offers benefits over the `CHINESE_VGRAM` lexer: Generates a smaller index Better query response time Generates real world tokens resulting in better query precision Supports stop words
JAPANESE_VGRAM_LEXER	Lexer for extracting tokens from Japanese text.
JAPANESE_LEXER	Lexer for extracting tokens from Japanese text. This lexer offers the following advantages over the `JAPANESE_VGRAM` lexer: Generates smaller index Better query response time Generates real world tokens resulting in better precision
KOREAN_MORPH_LEXER	Lexer for extracting tokens from Korean text.
USER_LEXER	Lexer you create to index a particular language.
WORLD_LEXER	Lexer for indexing tables containing documents of different languages; autodetects languages in a document.

2.4.1 AUTO_LEXER

Use the AUTO_LEXER type to index columns that contain documents of different languages. It performs language identification, word segmentation, document analysis, part-of-speech tagging, and stemming. The AUTO_LEXER also enables customization of these components. Although parts-of-speech information that is generated by the AUTO_LEXER is not exposed for your use, AUTO_LEXER uses it for context-sensitive or tagged stemming.

At index time, AUTO_LEXER automatically detects the language of the document, and tokenizes and stems the document appropriately. At query time, the language of the query is inherited from the query template. If the query template is not used, or if no language is specified in the query template, then the language of the query is inherited from the session language. Table 2-16 lists the supported languages.

Table 2-16 Languages Supported for AUTO_LEXER

Language
ARABIC	JAPANESE
CATALAN	KOREAN
SIMPLIFIED CHINESE (see Note)	TRADITIONAL CHINESE (see Note)
CROATION	POLISH
CZECH	PORTUGUESE
DANISH	ROMANIAN
DUTCH	RUSSIAN
ENGLISH	SERBIAN
FINNISH	SLOVAK
FRENCH	SLOVENIAN
GERMAN	SPANISH
GREEK	SWEDISH
HEBREW	THAI
HUNGARIAN	TURKISH
ITALIAN	NORWEGIAN: NYNORSK
NORWEGIAN: BOKMAL	PERSIAN

Note:

Due to the limitation on the string, Traditional Chinese must be specified as Trad-Chinese. Simplified Chinese must be specified as Simp-Chinese.

2.4.1.1 AUTO_LEXER Attributes Inherited from BASIC_LEXER

The following attributes are used in the same way and have the same effect on the AUTO_LEXER as their corresponding attributes in BASIC_LEXER:

base_letter
base_letter_type
override_base_letter
mixed_case
alternate_spelling

See Also:

"BASIC_LEXER"

2.4.1.2 AUTO_LEXER Language-Independent Attributes

Table 2-17 lists the language-independent attributes available in the AUTO_LEXER

Table 2-17 AUTO_LEXER Language-Independent Attributes

Attribute	Attribute Value	Description
`language`	<characters> (space-delimited string)	Specifies the possible languages of the input documents. If no language is specified, then `AUTO_LEXER` performs auto detection. If one language is specified, then the language is set manually and `AUTO_LEXER` does not perform auto detection. If more than one language is specified, then `AUTO_LEXER` performs auto detection but limits the detected language to be among the language set. Note: The automatic detection of language is statistically based and, thus, inherently imperfect.
`deriv_stems`	NO (disabled)	Specifies whether the derivational stemming should be used or not. Currently, derivational stemming is only available for English. Hence, the `DERIV_STEMS` has no effect in other languages. Also, when derivational stemming is performed, tagging and tag stemming is not used. As a result, the tagging and tagged stemming client dictionary has no effect on the stemming result.
	YES (default)
`german_decompound`	NO (disabled)	Specifies whether German de-compounding should be performed in the stemmer or not.
	YES (default, enabled for German only)
`sentence_token_limit`	n (default 100)	Specifies the maximum number of tokens allowed in a single sentence. This parameter can affect memory usage for `AUTO_LEXER`. If set too high, then it might cause memory overload. If set too low, then the sentence might be truncated incorrectly.
`index_stems`	NO (disabled)	Specifies whether an index stemmer should be used. If specified as YES, then the stemmer that corresponds to the document language will be used and the stemmer will always be configured to maximize document recall. If specified as NO, then queries with stem operators will use the word list stemming to try to stem the tokens. If word list stemming is not available, then the stem operator will be ignored. For documents in Finnish, Swedish, Dutch, if the `index_stems` is set to YES, then compound word stemming will automatically be performed, and compounds are always separated into their component stems. Note: If `INDEX_STEMS` attribute is set to YES, then the `STEMMER` attribute of `BASIC_WORDLIST` will be ignored, and the stemmer used by the `auto_lexer` will be used during query to determine the stem of the given query term.
	YES (default)
`base_letter`	NO (disabled)	Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, and so on) are converted to their base form before being stored in the Text index.
	YES (enabled)
`base_letter_type`	GENERIC (default)	The `GENERIC` value is the default and means that base letter transformation uses one transformation table that applies to all languages.
	SPECIFIC
`override_base_letter`	TRUE FALSE (default)	When `base_letter` is enabled at the same time as `alternate_spelling`, it is sometimes necessary to override `base_letter` to prevent unexpected results from serial transformations.
`mixed_case`	NO (disabled)	Specify whether the lexer leaves the tokens exactly as they appear in the text or converts the tokens to all uppercase. The default is NO (tokens are converted to all uppercase).
	YES (enabled)
`alternate_spelling`	GERMAN (German alternate spelling)
	DANISH (Danish alternate spelling)
	SWEDISH (Swedish alternate spelling)
	NONE (No alternate spelling, default)

2.4.1.3 AUTO_LEXER Language-Dependent Attributes

Table 2-18 lists the language-dependent attributes available in the AUTO_LEXER. The <language> variable in the attribute name refers to any of the supported language names that are listed in Table 2-10, "Character Sets Supported for CHARSET_FILTER Auto-detection". Examples are provided in "Examples for AUTO_LEXER User-Supplied Dictionary Attributes".

Important:

Attribute names must not exceed 30 characters. Therefore, where the <language> variable is specified, the language name may need to be abreviated in certain instances. For example, traditional_chinese should be abbreviated to trad_chinese and simplified_chinese should be abbreviated to simp_chinese.

Table 2-18 AUTO_LEXER Language-Dependent Attributes

Attribute	Attribute Value	Description
<language>`_prefix_morphemes`	characters (space-delimited string)	Specifies the list of inflectional prefixes that, when enclosed by parentheses, are kept together with the base word. For example, (re) analyze.
<language>`_suffix_morphemes`	characters (space-delimited string)	Specifies the list of inflectional suffixes that, when enclosed by parentheses are kept together with the base word. For example, file(s).
<language>`_punctuations`	characters (space-delimited string)	Specifies punctuation that breaks sentences.
<language>`_sentence_starts`	characters (space-delimited string)	Specifies text that starts a sentence.
<language>`_non_sent_end_abbr`	characters (space-delimited string)	Specifies abbreviations that do not end sentences.

Table 2-19 Default Values for AUTO_LEXER Language-Dependent Attributes

Attribute	Language	Default Value
<language>`_prefix_morphemes`	All languages	None
<language>`_suffix_morphemes`	English	s es er
	Spanish	ba n s es
	Portuguese	s es
	German	in innen
	French	ne e
	All other languages	None
<language>`_punctuations`	English	. ? !
	Arabic, Catalan, Croatian, Czech, English, Greek, Hebrew, Hungarian, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Turkish	. ? ! - --
	Bokmal, Danish, Finnish, French, German, Italian, Korean, Nynorsk, Portuguese, Spanish, Swedish	, ? !
	Japanese	Description of the illustration imga.gif
	Persian	Description of the illustration imgb.gif
	Simplified Chinese Abbreviate to: simp_chinese	Description of the illustration imgd.gif
	Thai	Description of the illustration imge.gif
	Traditional Chinese Abbreviate to: trad_chinese	Description of the illustration imgf.gif
<language>`_sentence_starts`	Arabic, catalan, Croatian, Czech, English, Finish, French, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Simplified Chinese, Slovak, Slovenian, Thai, Traditional Chinese (abbreviate to trad_chinese), Turkish	"
	Bokmal, Danish, Nynorsk, Swedish	" –
	German	" ,, „
	Spanish	" ¿ ¡
<language>`_non_sent_end_abbr`	Arabic, Polish, Romanian, Russian, Serbian, Slovak, Slovenian, Turkish	e.g. i.e. viz. a.k.a.
	Bokmal	f.eks. f. eks. inkl. 1.kons. 1.sekr. 1.aman. vit.ass. vit. ass. sekr. stortingsrep." stip. prof. kons. hr. gen.sekr. gen. sekr. førstekons. førstesekr. førsteaman. fullm. frk. d.e. d.y. dr. dir. aman. adm.dir. adm. dir.
	Catalan	R.D. pp.
	Croatian, Czech, Greek, Hebrew, Hungarian	e.g. i.e. viz. a.k.a.
	Danish	f.eks. f. eks. inkl. sr. skuesp. sekr. prof. mus. lrs. logr. kgl. insp. hr. hrs. gdr. frk. fr. forst. forf. fm. fmd. esq. d.æ d.æ. d.y. dr. dir. dept.chef civiling. bibl. ass. admn. adj. Skt. H.K.H.
	English, Japanese, Simplified Chinese (abbreviate to simp_chinese), Thai, Traditional Chinese (abbreviate to trad_chinese)	e.g. i.e. viz. a.k.a. Adm. Br. Capt. Cdr. Cmdr. Col. Comdr. Comdt. Dr. Drs. Fr. Gen. Gov. Hon. Ins. Lieut. Lt. Maj. Messrs. Mdm. Mlle. Mlles. Mme. Mmes. Mr. Mrs. Ms. Pres. Prof. Profs. Pvt. Rep. Rev. Revd. Secy. Sen. Sgt. Sra. Srta. St. Ste.
	Finish	inkl. dipl. prof. hrr. hr. Hrr. Hr. dr. Dr.
	French	c.-à-d. cf. e.g. ex. i.e. Pr. Prof. M. Mr. Mrs. Mme Mmes Mlle Mlles Mgr. MM. Lieut. Gén. Dr. Col.
	German	ca. bzw. e.g. i.e. inkl. Fr. Frl. Mme. Mile. Mag. Stud. Tel. Hr. Hrn. apl.Prof. Prof.
	Italian	e.g. i.e. pag. pagg. tel. T.V. N.H. N.D. comm. col. cav. cap. geom. gen. ing. jr. mr. mons. mar. magg. prof. prof.ssa prof.sse proff. pres. perito ind. p. p.i. sr. s.ten. sottoten. sig. serg. sen. segr. sac. ten. uff. vicepres. vesc. S.S. S.E. avv. app. amm. arch. on. dir. dott. dott.ssa dr. rag.
	Korean	e.g. i.e. a.k.a. Dr. Mr. Mrs. Ms. Prof.
	Nynorsk	f.eks. f. eks. inkl. 1.kons. 1.sekr. 1.aman. vit.ass. vit. ass. sekr. stortingsrep. stip. prof. kons. hr. gen.sekr. gen. sekr. førstekons. førstesekr. førsteaman. fullm. frk. d.e. d.y. dr. dir. aman. adm.dir. adm. dir. fyrstesekr. fyrstekons. fyrsteaman. hr
	Persian	Description of the illustration imgc.gif Dr. Mr. Mrs. Ms. Prof. e.g. i.e. viz. a.k.a.
	Portuguese	cf. Cf. e.g. E.g. i.é. I.é. p.ex. P.ex. pág. pag. Pág. Pag. tel. telef. Tel. Telef. sr. srs. sra. mr. eng. dr. dra. Dr. Dra. V.Ex. V.Exa. S. N. S. Mrs. Eng. Ex. Exa.
	Spanish	e.g. i.e. ej. p.ej. pág. págs. tel. tfno. Fr. Ldo. Lda. Lic. Pbro. D. Dña. Dr. Dres. Dra. Dras. Dn. Mons. Rvdo. Sto. Sta. Sr. Srs. Srta. Srtas. Sres. Sra. Sras. Excmo. Excma. Ilmo. Ilma. Sto. Sta.
	Swedish	inkl. prof. hrr. hr. Hrr. Hr. dr. Dr.

Examples for AUTO_LEXER Language-Dependent Attributes

Example 2-1 <language>_prefix_morphemes

ctx_ddl.set_attribute(
      'a_lex', 'english_prefix_morphemes', 're'
);

Example 2-2 <language>_suffix_morphemes

ctx_ddl.set_attribute(
      'a_lex', 'english_suffix_morphemes', 's es'
);

Example 2-3 <language>_punctuations

ctx_ddl.set_attribute(
      'a_lex', 'english_punctuations', '. ? !'
);

Example 2-4 <language>_sentence_starts

ctx_ddl.set_attribute(
      'a_lex', 'english_sentence_starts', '" '
);

Example 2-5 <language>_non_sentence_ending_abbrev

ctx_ddl.set_attribute(
      'a_lex', 'english_non_sentence_ending_abbrev', 'e.g. a.k.a. Dr.'
);

2.4.1.4 AUTO_LEXER User-Defined Dictionary Attributes

The attributes in this section are language-specific and are used to set the name of user-supplied dictionary files. The attributes share the following behavior:

The value of the attribute specifies only the file name (excluding the file path) of the dictionary. The file should be placed at the following location: $ORACLE_HOME/ctx/data/user.
The set_attribute method does not load the file; it only records the file name. Therefore, the file must be at the specified location when the dictionary is needed. Otherwise, an error will be raised.
The client dictionaries specify the character encoding of the file in the <?xml ...?> markup (for example, <?xml encoding="cp-1252" ?>). If no <?xml ... ?> specification exists and no special identification is given (for example, as for UCS-4), then the system assumes that the encoding is cp-1252, instead of UTF-8 as specified in the XML standard.

The user-supplied dictionary attributes are listed in Table 2-20, and examples are given for each attribute under "Examples for AUTO_LEXER User-Supplied Dictionary Attributes". The values inside the brackets in Table 2-20 refer to parts-of-speech tags.

See Also:

Appendix I, " AUTO_LEXER Parts-of-Speech Tagging" for the detailed list of parts-of-speech tags for each supported language

Table 2-20 AUTO_LEXER User-Supplied Dictionary Attributes

Attribute	Attribute Value	Description
language`_abbr_dict`	Any valid file name.	Specifies the file name of the user-supplied abbreviation dictionary. The abbreviation dictionary lets you add custom abbreviations that should be processed as such. The abbreviation dictionary is used in word segmentation and document analysis to help resolve ambiguity and, therefore, improve sentence/paragraph searches.
language`_tag_dict`	Any valid file name.	Specifies the file name of the user-supplied tagging dictionary. The tagging dictionary lets you add the appropriate part-of-speech tag for words that may not occur in the prepackaged tagging lexicons. The objective of setting this attribute is to improve internal parts-of-speech tagging and, therefore, improve the quality of context-sensitive stemming. This is accomplished by providing higher quality context using the POS information of the surrounding words.
language`_tagged_tagstem_dict`	Any valid file name.	Specifies the file name of the user-supplied tagged stemming dictionary. The tagged stemming dictionary lets you add the appropriate stem for words of a particular part of speech that may not occur in the prepackaged tagged stemming lexicons.
language`_tagstem_dict`	Any valid file name.	Specifies the file name of the user-supplied stemming dictionary. The stemming dictionary lets you add the appropriate stem for words that may not occur in the prepackaged tagged stemming lexicons.
language`_CCJT_dictionary`	Any valid file name.	Specifies the file name of the user-supplied CCJT dictionary. The CCJT dictionaries can be used for each of the languages in a limited way to influence all the following processes: segmentation, tagging, stemming and tagged stemming. (As Thai supports only segmentation and stemming, the client dictionary cannot be used for tagging.) The processing algorithm for the CCJT languages uses pre-calculated probabilities of tag sequences to segment, stem and tag a sentence. For this reason, it cannot be guaranteed that a user-defined entry will always be returned for all parts of speech. However, client dictionary entries tagged `Nn` or `Nn-Prop` will be given precedence and will always be returned in the final results.

Table 2-21 Languages and Default Values for AUTO_LEXER User-Supplied Dictionary Attributes

Attribute	Language	Default Value
language`_ abbrev_dictionary`	All	Null
language`_tag_dict`	All	Null
language`_tagstem_dict`	All	Null
language`_CCJT_dictionary`	All	Null

Examples for AUTO_LEXER User-Supplied Dictionary Attributes

Example 2-6 <language>_abbr_dict

exec Ctx_ddl.set_attribute (
     'a_lex', 
     'english_abbr_dict', 
     'english_abbr_dict.xml'
);

Dictionary example:

<explicit-pair-list>
     <item key="inc." analysis="[Abbrev]"/>
     <item key="Inc." analysis="[Abbrev]"/>
</explicit-pair-list>

Example 2-7 <language>_tag_dict

exec Ctx_ddl.set_attribute
            ('a_lex', 'english_tag_dict', 'english_tag_dict.xml');

Dictionary example:

<explicit-pair-list>
     <item key="Inxight" analysis = "[Prop]"></item>
     <item key="furby" analysis = "[Nn-Indef-Sg]"></item>
     <item key="furbys"
          analysis = "[Nn-Indef-Sg-Gen]"></item>
     <item key="furbyer" analysis = "[Nn-Indef-Pl]"></item>
     <item key="furbyen" analysis = "[Nn-Def-Sg]"></item>
     <item key="furbyene" analysis = "[Nn-Def-Pl]"></item>
</explicit-pair-list>

The values inside the brackets are part-of-speech tags.

Example 2-8 <language>_tagstem_dict

exec Ctx_ddl.set_attribute (
    'a_lex', 
    'english_tagstem_dict',
    'english_tagged_stem_dictionary.xml'
);

Dictionary example:

<tag-stem-list>
     <item key = "running[V-PrPart]" stem = "run" > </item>
</tag-stem-list>

The values inside the brackets are part-of-speech tags.

Example 2-9 <language>_tagstem_dict

exec Ctx_ddl.set_attribute (
    'a_lex', 
    'english_tagstem_dict', 
    'english_tagstem_dict.xml'
);

Dictionary example:

<explicit-pair-list>
     <item key="running" stem="run"></item>
     <item key="flying" stem="fly"></item>
</explicit-pair-list>

Example 2-10 <language>_CCJT_dictionary

exec Ctx_ddl.set_attribute (
    'a_lex', 
    'thai_ccjt_dictionary', 
    'thai_ccjt_dictionary.xml'
);

CCJT dictionary example:

Description of the illustration ccjt_dict.gif

2.4.2 BASIC_LEXER

Use the BASIC_LEXER type to identify tokens for creating Text indexes for English and all other supported whitespace-delimited languages.

The BASIC_LEXER also enables base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for whitespace-delimited languages that have extended character sets.

In English and French, you can use the BASIC_LEXER to enable theme indexing.

Note:

Any processing that the lexer does to tokens before indexing (for example, removal of characters, and base-letter conversion) are also performed on query terms at query time. This ensures that the query terms match the form of the tokens in the Text index.

BASIC_LEXER supports any database character set.

BASIC_LEXER has the attributes shown in Table 2-22.

Table 2-22 BASIC_LEXER Attributes

Attribute	Attribute Value
`continuation`	characters
`numgroup`	characters
`numjoin`	characters
`printjoins`	characters
`punctuations`	characters
`skipjoins`	characters
`startjoins`	non alphanumeric characters that occur at the beginning of a token (string)
`endjoins`	non alphanumeric characters that occur at the end of a token (string)
`whitespace`	characters (string)
`newline`	NEWLINE (\n) CARRIAGE_RETURN (\r)
`base_letter`	NO (disabled)
	YES (enabled)
`base_letter_type`	GENERIC (default)
	SPECIFIC
`override_base_letter`	TRUE FALSE (default)
`mixed_case`	NO (disabled)
	YES (enabled)
`composite`	DEFAULT (no composite word indexing, default)
	GERMAN (German composite word indexing)
	DUTCH (Dutch composite word indexing)
`index_stems`	0 NONE 1 ENGLISH 2 DERIVATIONAL 3 DUTCH 4 FRENCH 5 GERMAN 6 ITALIAN 7 SPANISH 8 ARABIC 9 BOKMAL 10 CATALAN 11 CROATIAN 12 CZECH 13 DANISH 28 DERIVATIONAL_NEW (see Note) 29 DUTCH_NEW (see Note) 27 ENGLISH_NEW (see Note) 14 FINNISH (see Note) 30 FRENCH_NEW (see Note) 31 GERMAN_NEW (see Note) 15 GREEK 16 HEBREW 17 HUNGARIAN 32 ITALIAN_NEW (see Note) 18 NYNORSK 19 POLISH 20 PORTUGUESE 21 ROMANIAN 22 RUSSIAN 23 SERBIAN 24 SLOVAK 25 SLOVENIAN 33 SPANISH_NEW (see Note) 26 SWEDISH (see Note) Note: De-compounding word stemming is automatically performed when `index_stems` is set to SWEDISH, FINNISH, or DUTCH_NEW values. Note: Seven of the `index_stem` attributes that are new for this release have a "_NEW" suffix to enable you to utilize the new stemmer attributes while maintaining backward compatibility with previous releases of Oracle Text.
`index_themes`	YES (enabled)
	NO (disabled, default)
	NO (disabled, default)
`index_text`	YES (enabled, default
	NO (disabled)
`prove_themes`	YES (enabled, default)
	NO (disabled)
`theme_language`	AUTO (default)
	(any Globalization Support language)
`alternate_spelling`	GERMAN (German alternate spelling)
	DANISH (Danish alternate spelling)
	SWEDISH (Swedish alternate spelling)
	NONE (No alternate spelling, default)
`new_german_spelling`	YES NO (default)

continuation

Specify the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are hyphen '-' and backslash '\'.

numgroup

Specify a single character that, when it appears in a string of digits, indicates that the digits are groupings within a larger single unit.

For example, comma ',' might be defined as a numgroup character because it often indicates a grouping of thousands when it appears in a string of digits.

numjoin

Specify the characters that, when they appear in a string of digits, cause Oracle Text to index the string of digits as a single unit or word.

For example, period '.' can be defined as numjoin characters because it often serves as decimal points when it appears in a string of digits.

Note:

The default values for numjoin and numgroup are determined by the globalization support initialization parameters that are specified for the database.

In general, a value need not be specified for either numjoin or numgroup when creating a lexer preference for BASIC_LEXER.

printjoins

Specify the non alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.

For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.

Note:

If a printjoins character is also defined as a punctuations character, the character is only processed as an alphanumeric character if the character immediately following it is a standard alphanumeric character or has been defined as a printjoins or skipjoins character.

punctuations

Specify a list of non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence. The defaults are period '.', question mark '?', and exclamation point '!'.

Characters that are defined as punctuations are removed from a token before text indexing. However, if a punctuations character is also defined as a printjoins character, then the character is removed only when it is the last character in the token.

For example, if the period (.) is defined as both a printjoins and a punctuations character, then the following transformations take place during indexing and querying as well:

Token	Indexed Token
.doc	.doc
dog.doc	dog.doc
dog..doc	dog..doc
dog.	dog
dog...	dog..

In addition, BASIC_LEXER use punctuations characters in conjunction with newline and whitespace characters to determine sentence and paragraph delimiters for sentence/paragraph searching.

skipjoins

Specify the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the Text index.

For example, if the hyphen character '-' is defined as a skipjoins, then the word pseudo-intellectual is stored in the Text index as pseudointellectual.

Note:

Printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

startjoins/endjoins

For startjoins, specify the characters that when encountered as the first character in a token explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly ends the previous token.

For endjoins, specify the characters that when encountered as the last character in a token explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token.

The following rules apply to both startjoins and endjoins:

The characters specified for startjoins/endjoins cannot occur in any of the other attributes for BASIC_LEXER.
startjoins/endjoins characters can occur only at the beginning or end of tokens

Printjoins differ from endjoins and startjoins in that position does not matter. For example, $35 will be indexed as one token if $ is a startjoin or a printjoin, but as two tokens if it is defined as an endjoin.

whitespace

Specify the characters that are treated as blank spaces between tokens. BASIC_LEXER uses whitespace characters in conjunction with punctuations and newline characters to identify character strings that serve as sentence delimiters for sentence and paragraph searching.

The predefined default values for whitespace are space and tab. These values cannot be changed. Specifying characters as whitespace characters adds to these defaults.

newline

Specify the characters that indicate the end of a line of text. BASIC_LEXER uses newline characters in conjunction with punctuations and whitespace characters to identify character strings that serve as paragraph delimiters for sentence and paragraph searching.

The only valid values for newline are NEWLINE and CARRIAGE_RETURN (for carriage returns). The default is NEWLINE.

base_letter

Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, and so on) are converted to their base form before being stored in the Text index. The default is NO (base-letter conversion disabled). For more information on base-letter conversions and base_letter_type, see Base-Letter Conversion.

base_letter_type

Specify GENERIC or SPECIFIC.

The GENERIC value is the default and means that base letter transformation uses one transformation table that applies to all languages. For more information on base-letter conversions and base_letter_type, see "Base-Letter Conversion".

override_base_letter

When base_letter is enabled at the same time as alternate_spelling, it is sometimes necessary to override base_letter to prevent unexpected results from serial transformations. See "Overriding Base-Letter Transformations with Alternate Spelling". Default is FALSE.

mixed_case

Specify whether the lexer leaves the tokens exactly as they appear in the text or converts the tokens to all uppercase. The default is NO (tokens are converted to all uppercase).

Note:

Oracle Text ensures that word queries match the case sensitivity of the index being queried. As a result, if you enable case sensitivity for your Text index, queries against the index are always case sensitive.

composite

Specify whether composite word indexing is disabled or enabled for either GERMAN or DUTCH text. The default is DEFAULT (composite word indexing disabled).

Words that are usually one entry in a German dictionary are not split into composite stems, while words that aren't dictionary entries are split into composite stems.

To retrieve the indexed composite stems, you must enter a stem query, such as $bahnhof. The language of the wordlist stemmer must match the language of the composite stems.

2.4.2.1 Stemming User-Dictionaries

You can create a user-dictionary for your own language to customize how words are decomposed. These dictionaries are shown in Table 2-23.

Table 2-23 Stemming User-Dictionaries

Dictionary	Stemmer
`$ORACLE_HOME/ctx/data/frlx/drfr.dct`	French
`$ORACLE_HOME/ctx/data/delx/drde.dct`	German
`$ORACLE_HOME/ctx/data/nllx/drnl.dct`	Dutch
`$ORACLE_HOME/ctx/data/itlx/drit.dct`	Italian
`$ORACLE_HOME/ctx/data/eslx/dres.dct`	Spanish
`$ORACLE_HOME/ctx/data/enlx/dren.dct`	English and Derivational

Stemming user-dictionaries are not supported for languages other than those listed in Table 2-23.

The format for the user dictionary is as follows:

output term <tab> input term

The individual parts of the decomposed word must be separated by the # character. The following example entries are for the German word Hauptbahnhof:

Hauptbahnhof<tab>Haupt#Bahnhof
Hauptbahnhofes<tab>Haupt#Bahnhof
Hauptbahnhof<tab>Haupt#Bahnhof
Hauptbahnhoefe<tab>Haupt#Bahnhof

index_themes

Specify YES to index theme information in English or French. This makes ABOUT queries more precise. The index_themes and index_text attributes cannot both be NO. The default is YES.

You can set this parameter to TRUE for any index type, including CTXCAT. To enter an ABOUT query with CATSEARCH, use the query template with CONTEXT grammar.

Note:

index_themes requires an installed knowledge base. A knowledge base may or may not have been installed with Oracle Text. For more information on knowledge bases, see Oracle Text Application Developer's Guide.

prove_themes

Specify YES to prove themes. Theme proving attempts to find related themes in a document. When no related themes are found, parent themes are eliminated from the document.

While theme proving is acceptable for large documents, short text descriptions with a few words rarely prove parent themes, resulting in poor recall performance with ABOUT queries.

Theme proving results in higher precision and less recall (less rows returned) for ABOUT queries. For higher recall in ABOUT queries and possibly less precision, you can disable theme proving. Default is YES.

The prove_themes attribute is supported for CONTEXT and CTXRULE indexes.

theme_language

Specify which knowledge base to use for theme generation when index_themes is set to YES. When index_themes is NO, setting this parameter has no effect on anything.

Specify any globalization support language or AUTO. You must have a knowledge base for the language you specify. This release provides a knowledge base in only English and French. In other languages, you can create your own knowledge base.

The default is AUTO, which instructs the system to set this parameter according to the language of the environment.

index_stems

Specify the stemmer to use for stem indexing. Choose one of the following stemmers:

NONE
ARABIC
CATALAN
CROATIAN
CZECH
DANISH
DERIVATIONAL
DUTCH
ENGLISH
FINNISH
FRENCH
GERMAN

HEBREW
HUNGARIAN
ITALIAN
NORWEGIAN
POLISH
PORTUGUESE
ROMANIAN
SLOVAK
SLOVENIAN
SPANISH
SWEDISH

Tokens are stemmed to a single base form at index time in addition to the normal forms. Indexing stems enables better query performance for stem ($) queries, such as $computed.

Note:

If the index_stems attribute is set to one of the languages with ID 8 to 33, which are listed Table 2-22, "BASIC_LEXER Attributes", then the stemmer attribute of BASIC_WORDLIST will be ignored and the stemmer used by the BASIC_LEXER will be used during query to determine the stem of the given query term.

index_text

Specify YES to index word information. The index_themes and index_text attributes cannot both be NO.

The default is NO.

alternate_spelling

Specify either GERMAN, DANISH, or SWEDISH to enable the alternate spelling in one of these languages. Enabling alternate spelling enables you to query a word in any of its alternate forms.

Alternate spelling is off by default; however, in the language-specific scripts that Oracle provides in admin/defaults (drdefd.sql for German, drdefdk.sql for Danish, and drdefs.sql for Swedish), alternate spelling is turned on. If your installation uses these scripts, then alternate spelling is on. However, you can specify NONE for no alternate spelling. For more information about the alternate spelling conventions Oracle Text uses, see Alternate Spelling.

new_german_spelling

Specify whether the queries using the BASIC_LEXER return both traditional and reformed (new) spellings of German words. If new_german_spelling is set to YES, then both traditional and new forms of words are indexed. If it is set to NO, then the word will be indexed only as it as provided in the query. The default is NO.

See Also:

"New German Spelling"

2.4.2.2 BASIC_LEXER Example

The following example sets printjoin characters and disables theme indexing with the BASIC_LEXER:

begin
ctx_ddl.create_preference('mylex', 'BASIC_LEXER');
ctx_ddl.set_attribute('mylex', 'printjoins', '_-');
ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO');
ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); 
end;

To create the index with no theme indexing and with printjoin characters set as described, enter the following statement:

create index myindex on mytable ( docs ) 
  indextype is ctxsys.context 
  parameters ( 'LEXER mylex' );

2.4.3 MULTI_LEXER

Use MULTI_LEXER to index text columns that contain documents of different languages. For example, use this lexer to index a text column that stores English, German, and Japanese documents.

This lexer has no attributes.

You must have a language column in your base table. To index multi-language tables, specify the language column when you create the index.

Create a multi-lexer preference with CTX_DDL.CREATE_PREFERENCE. Add language-specific lexers to the multi-lexer preference with the CTX_DDL.ADD_SUB_LEXER procedure.

During indexing, the MULTI_LEXER examines each row's language column value and switches in the language-specific lexer to process the document.

The WORLD_LEXER lexer also performs multi-language indexing, but without the need for separate language columns (that is, it has automatic language detection). For more on WORLD_LEXER, see "WORLD_LEXER".

2.4.3.1 Multi-language Stoplists

When you use the MULTI_LEXER, you can also use a multi-language stoplist for indexing.

See Also:

"Multi-Language Stoplists".

2.4.3.2 MULTI_LEXER Example

Create the multi-language table with a primary key, a text column, and a language column as follows:

create table globaldoc (
   doc_id number primary key,
   lang varchar2(3),
   text clob
);

Assume that the table holds mostly English documents, with the occasional German or Japanese document. To handle the three languages, you must create three sub-lexers, one for English, one for German, and one for Japanese:

ctx_ddl.create_preference('english_lexer','basic_lexer');
ctx_ddl.set_attribute('english_lexer','index_themes','yes');
ctx_ddl.set_attribute('english_lexer','theme_language','english');

ctx_ddl.create_preference('german_lexer','basic_lexer');
ctx_ddl.set_attribute('german_lexer','composite','german');
ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');

ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');

Create the multi-lexer preference:

ctx_ddl.create_preference('global_lexer', 'multi_lexer');

Because the stored documents are mostly English, make the English lexer the default using CTX_DDL.ADD_SUB_LEXER:

ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');

Now add the German and Japanese lexers in their respective languages with CTX_DDL.ADD_SUB_LEXER procedure. Also assume that the language column is expressed in the standard ISO 639-2 language codes, so add those as alternative values.

ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');

Now create the index globalx, specifying the multi-lexer preference and the language column in the parameter clause as follows:

create index globalx on globaldoc(text) indextype is ctxsys.context
parameters ('lexer global_lexer language column lang');

2.4.3.3 Querying Multi-Language Tables

At query time, the multi-lexer examines the language setting and uses the sub-lexer preference for that language to parse the query.

If the language is not set, then the default lexer is used. Otherwise, the query is parsed and run as usual. The index contains tokens from multiple languages, so such a query can return documents in several languages. To limit your query to a given language, use a structured clause on the language column.

If the language column is set to AUTO, then the multi-lexer detects the language of the document for the supported languages shown in Table 2-24.

Table 2-24 Languages Supported for MULTI_LEXER Auto-detection

Language
ARABIC	JAPANESE
CATALAN	KOREAN
TRADITIONAL CHINESE	NORWEGIAN
CROATION	POLISH
CZECH	PORTUGUESE
DANISH	ROMANIAN
DUTCH	RUSSIAN
ENGLISH	LATIN SERBIAN
GERMAN	SLOVAK
GREEK	SWEDISH
HEBREW	THAI
HUNGARIAN	TURKISH
ITALIAN

2.4.4 CHINESE_VGRAM_LEXER

The CHINESE_VGRAM_LEXER type identifies tokens in Chinese text for creating Text indexes.

2.4.4.1 CHINESE_VGRAM_LEXER Attribute

The CHINESE_VGRAM_LEXER has the following attribute:

Table 2-25 CHINESE_VGRAM_LEXER Attributes

Attribute	Attribute Value
`mixed_case_ASCII7`	Enable mixed-case (upper- and lower-case) searches of ASCII7 text (for example, cat and Cat). Allowable values are `YES` and `NO` (default).

2.4.4.2 Character Sets

You can use this lexer if your database uses one of the following character sets:

AL32UTF8
ZHS16CGB231280
ZHS16GBK
ZHS32GB18030
ZHT32EUC
ZHT16BIG5
ZHT32TRIS
ZHT16HKSCS
ZHT16MSWIN950
UTF8

2.4.5 CHINESE_LEXER

The CHINESE_LEXER type identifies tokens in traditional and simplified Chinese text for creating Oracle Text indexes.

This lexer offers the following benefits over the CHINESE_VGRAM_LEXER:

generates a smaller index
better query response time
generates real word tokens resulting in better query precision
supports stop words

Because the CHINESE_LEXER uses a different algorithm to generate tokens, indexing time is longer than with CHINESE_VGRAM_LEXER.

You can use this lexer if your database character is one of the Chinese or Unicode character sets supported by Oracle.

2.4.5.1 CHINESE_LEXER Attribute

The CHINESE_LEXER has the following attribute:

Table 2-26 CHINESE_LEXER Attributes

Attribute	Attribute Value
`mixed_case_ASCII7`	Enable mixed-case (upper- and lower-case) searches of ASCII7 text (for example, cat and Cat). Allowable values are `YES` and `NO` (default).

2.4.5.2 Customizing the Chinese Lexicon

You can modify the existing lexicon (dictionary) used by the Chinese lexer, or create your own Chinese lexicon, with the ctxlc command.

2.4.6 JAPANESE_VGRAM_LEXER

The JAPANESE_VGRAM_LEXER type identifies tokens in Japanese for creating Text indexes. This lexer supports the stem ($) operator.

2.4.6.1 JAPANESE_VGRAM_LEXER Attributes

This lexer has the following attributes:

Table 2-27 JAPANESE_VGRAM_LEXER Attributes

Attribute	Attribute Value
`delimiter`	Specify whether to consider certain Japanese blank characters, such as a full-width forward slash or a full-width middle dot. `ALL` considers these characters, while `NONE` ignores them. Default is `NONE`.
`mixed_case_ASCII7`	Enable mixed-case (upper- and lower-case) searches of ASCII7 text (for example, cat and Cat). Allowable values are `YES` and `NO` (default).

2.4.6.2 JAPANESE_VGRAM_LEXER Character Sets

You can use this lexer if your database uses one of the following character sets:

JA16SJIS
JA16EUC
UTF8
AL32UTF8
JA16EUCTILDE
JA16EUCYEN
JA16SJISTILDE
JA16SJISYEN

2.4.7 JAPANESE_LEXER

The JAPANESE_LEXER type identifies tokens in Japanese for creating Text indexes. This lexer supports the stem ($) operator.

This lexer offers the following benefits over the JAPANESE_VGRAM_LEXER:

generates a smaller index
better query response time
generates real word tokens resulting in better query precision

Because the JAPANESE_LEXER uses a new algorithm to generate tokens, indexing time is longer than with JAPANESE_VGRAM_LEXER.

2.4.7.1 Customizing the Japanese Lexicon

You can modify the existing lexicon (dictionary) used by the Japanese lexer, or create your own Japanese lexicon, with the ctxlc command.

2.4.7.2 JAPANESE_LEXER Attributes

This lexer has the following attributes:

Table 2-28 JAPANESE_LEXER Attributes

Attribute	Attribute Value
`delimiter`	Specify `NONE` or `ALL` to ignore certain Japanese blank characters, such as a full-width forward slash or a full-width middle dot. Default is `NONE`.
`mixed_case_ASCII7`	Enable mixed-case (upper- and lower-case) searches of ASCII7 text (for example, cat and Cat). Allowable values are `YES` and `NO` (default).

2.4.7.3 JAPANESE LEXER Character Sets

The JAPANESE_LEXER supports the following character sets:

JA16SJIS
JA16EUC
UTF8
AL32UTF8
JA16EUCTILDE
JA16EUCYEN
JA16SJISTILDE
JA16SJISYEN

2.4.7.4 Japanese Lexer Example

When you specify JAPANESE_LEXER for creating text index, the JAPANESE_LEXER resolves a sentence into words.

For example, the following compound word (natural language institute)

Description of the illustration nihongo1.gif

is indexed as three tokens:

Description of the illustration nihongo2.gif

To resolve a sentence into words, the internal dictionary is referenced. When a word cannot be found in the internal dictionary, Oracle Text uses the JAPANESE_VGRAM_LEXER to resolve it.

2.4.8 KOREAN_MORPH_LEXER

The KOREAN_MORPH_LEXER type identifies tokens in Korean text for creating Oracle Text indexes.

2.4.8.1 Supplied Dictionaries

The KOREAN_MORPH_LEXER uses four dictionaries:

Table 2-29 KOREAN_MORPH_LEXER Dictionaries

Dictionary	File
System	`$ORACLE_HOME/ctx/data/kolx/drk2sdic.dat`
Grammar	`$ORACLE_HOME/ctx/data/kolx/drk2gram.dat`
Stopword	`$ORACLE_HOME/ctx/data/kolx/drk2xdic.dat`
User-defined	`$ORACLE_HOME/ctx/data/kolx/drk2udic.dat`

The grammar, user-defined, and stopword dictionaries should be written using the KSC 5601 or MSWIN949 character sets. You can modify these dictionaries using the defined rules. The system dictionary must not be modified.

You can add unregistered words to the user-defined dictionary file. The rules for specifying new words are in the file.

2.4.8.2 Supported Character Sets

You can use KOREAN_MORPH_LEXER if your database uses one of the following character sets:

KO16KSC5601
KO16MSWIN949
UTF8
AL32UTF8

The KOREAN_MORPH_LEXER enables mixed-case searches.

2.4.8.3 Unicode Support

The KOREAN_MORPH_LEXER supports:

Words in non-KSC5601 Korean characters defined in Unicode
Supplementary characters

See Also:

For information on supplementary characters, see the Oracle Database Globalization Support Guide

Some Korean documents may have non-KSC5601 characters in them. As the KOREAN_MORPH_LEXER can recognize all possible 11,172 Korean (Hangul) characters, such documents can also be interpreted by using the UTF8 or AL32UTF8 character sets.

Use the AL32UTF8 character set for your database to extract surrogate characters. By default, the KOREAN_MORPH_LEXER extracts all series of surrogate characters in a document as one token for each series.

2.4.8.3.1 Limitations on Korean Unicode Support

For conversion Hanja to Hangul (Korean), the KOREAN_MORPH_LEXER supports only the 4888 Hanja characters defined in KSC5601.

2.4.8.4 KOREAN_MORPH_LEXER Attributes

When you use the KOREAN_MORPH_LEXER, you can specify the following attributes:

Table 2-30 KOREAN_MORPH_LEXER Attributes

Attribute	Attribute Value
`verb_adjective`	Specify `TRUE` or `FALSE` to index verbs, adjectives, and adverbs. Default is `FALSE`.
`one_char_word`	Specify `TRUE` or `FALSE` to index one syllable. Default is `FALSE`.
`number`	Specify `TRUE` or `FALSE` to index number. Default is `FALSE`.
`user_dic`	Specify `TRUE` or `FALSE` to index user dictionary. Default is `TRUE`.
`stop_dic`	Specify `TRUE` of `FALSE` to use stop-word dictionary. Default is `TRUE`. The stop-word dictionary belongs to `KOREAN_MORPH_LEXER`.
`composite`	Specify indexing style of composite noun. Specify `COMPOSITE_ONLY` to index only composite nouns. Specify `NGRAM` to index all noun components of a composite noun. Specify `COMPONENT_WORD` to index single noun components of composite nouns as well as the composite noun itself. Default is `COMPONENT_WORD`. The following example describes the difference between `NGRAM` and `COMPONENT_WORD`.
`morpheme`	Specify `TRUE` or `FALSE` for morphological analysis. If set to `FALSE`, tokens are created from the words that are divided by delimiters such as white space in the document. Default is `TRUE`.
`to_upper`	Specify `TRUE` or `FALSE` to convert English to uppercase. Default is `TRUE`.
`hanja`	Specify `TRUE` to index hanja characters. If set to `FALSE`, hanja characters are converted to hangul characters. Default is `FALSE`.
`long_word`	Specify `TRUE` to index long words that have more than 16 syllables in Korean. Default is `FALSE`.
`japanese`	Specify `TRUE` to index Japanese characters in Unicode (only in the 2-byte area). Default is `FALSE`.
`english`	Specify `TRUE` to index alphanumeric strings. Default is `TRUE`.

2.4.8.5 Limitations

Sentence and paragraph sections are not supported with the KOREAN_MORPH_LEXER.

2.4.8.6 KOREAN_MORPH_LEXER Example: Setting Composite Attribute

Use the composite attribute to control how composite nouns are indexed.

2.4.8.6.1 NGRAM Example

When you specify NGRAM for the composite attribute, composite nouns are indexed with all possible component tokens. For example, the following composite noun (information processing institute)

Description of the illustration a1.jpg

is indexed as six tokens:

Description of the illustration a2.jpg

Description of the illustration a3.jpg

Specify NGRAM indexing as follows:

begin
ctx_ddl.create_preference('my_lexer','KOREAN_MORPH_LEXER');
ctx_ddl.set_attribute('my_lexer','COMPOSITE','NGRAM');
end

To create the index:

create index koreanx on korean(text) indextype is ctxsys.context
parameters ('lexer my_lexer');

2.4.8.6.2 COMPONENT_WORD Example

When you specify COMPONENT_WORD for the composite attribute, composite nouns and their components are indexed. For example, the following composite noun (information processing institute)

Description of the illustration a1.jpg

is indexed as four tokens:

Description of the illustration a1.jpg

Description of the illustration comp.jpg

Specify COMPONENT_WORD indexing as follows:

begin
ctx_ddl.create_preference('my_lexer','KOREAN_MORPH_LEXER');
ctx_ddl.set_attribute('my_lexer','COMPOSITE','COMPONENT_WORD');
end

To create the index:

create index koreanx on korean(text) indextype is ctxsys.context
parameters ('lexer my_lexer');

2.4.9 USER_LEXER

Use USER_LEXER to plug in your own language-specific lexing solution. This enables you to define lexers for languages that are not supported by Oracle Text. It also enables you to define a new lexer for a language that is supported but whose lexer is inappropriate for your application.

The user-defined lexer you register with Oracle Text is composed of two routines that you must supply:

Table 2-31 User-Defined Routines for USER_LEXER

User-Defined Routine	Description
Indexing Procedure	Stored procedure (PL/SQL) which implements the tokenization of documents and stop words. Output must be an XML document as specified in this section.
Query Procedure	Stored procedure (PL/SQL) which implements the tokenization of query words. Output must be an XML document as specified in this section.

2.4.9.1 Limitations

The following features are not supported with the USER_LEXER:

CTX_DOC.GIST and CTX_DOC.THEMES
CTX_QUERY.HFEEDBACK
ABOUT query operator
CTXRULE index type
VGRAM indexing algorithm

2.4.9.2 USER_LEXER Attributes

USER_LEXER has the following attributes:

Table 2-32 USER_LEXER Attributes

Attribute	Attribute Value
`INDEX_PROCEDURE`	Name of a stored procedure. No default provided.
`INPUT_TYPE`	`VARCHAR2`, `CLOB`. Default is `CLOB`.
`QUERY_PROCEDURE`	Name of a stored procedure. No default provided.

2.4.9.3 INDEX_PROCEDURE

This callback stored procedure is called by Oracle Text as needed to tokenize a document or a stop word found in the stoplist object.

2.4.9.3.1 Requirements

This procedure can be a PL/SQL stored procedure.

The index owner must have EXECUTE privilege on this stored procedure.

This stored procedure must not be replaced or dropped after the index is created. You can replace or drop this stored procedure after the index is dropped.

2.4.9.3.2 Parameters

Two different interfaces are supported for the user-defined lexer indexing procedure:

VARCHAR2 Interface
CLOB Interface

2.4.9.3.3 Restrictions

This procedure must not perform any of the following operations:

Rollback
Explicitly or implicitly commit the current transaction
Enter any other transaction control statement
Alter the session language or territory

The child elements of the root element tokens of the XML document returned must be in the same order as the tokens occur in the document or stop word being tokenized.

The behavior of this stored procedure must be deterministic with respect to all parameters.

2.4.9.4 INPUT_TYPE

Two different interfaces are supported for the User-defined lexer indexing procedure. One interface enables the document or stop word and the corresponding tokens encoded as XML to be passed as VARCHAR2 datatype whereas the other interface uses the CLOB datatype. This attribute indicates the interface implemented by the stored procedure specified by the INDEX_PROCEDURE attribute.

2.4.9.4.1 VARCHAR2 Interface

BASIC_WORDLIST Attributes Table 2-40 describes the interface that enables the document or stop word from stoplist object to be tokenized to be passed as VARCHAR2 from Oracle Text to the stored procedure and for the tokens to be passed as VARCHAR2 as well from the stored procedure back to Oracle Text.

Your user-defined lexer indexing procedure should use this interface when all documents in the column to be indexed are smaller than or equal to 32512 bytes and the tokens can be represented by less than or equal to 32512 bytes. In this case the CLOB interface given in Table 2-34 can also be used, although the VARCHAR2 interface will generally perform faster than the CLOB interface.

This procedure must be defined with the following parameters:

Table 2-33 VARCHAR2 Interface for INDEX_PROCEDURES

Parameter Position Parameter Mode Parameter Datatype Description

Parameter Position	Parameter Mode	Parameter Datatype	Description
1	`IN`	`VARCHAR2`	Document or stop word from stoplist object to be tokenized. If the document is larger than 32512 bytes then Oracle Text will report a document level indexing error.
2	`IN` `OUT`	`VARCHAR2`	Tokens encoded as XML. If the document contains no tokens, then either NULL must be returned or the tokens element in the XML document returned must contain no child elements. Byte length of the data must be less than or equal to 32512. To improve performance, use the `NOCOPY` hint when declaring this parameter. This passes the data by reference, rather than passing data by value. The XML document returned by this procedure should not include unnecessary whitespace characters (typically used to improve readability). This reduces the size of the XML document which in turn minimizes the transfer time. To improve performance, index_procedure should not validate the XML document with the corresponding XML schema at run-time. Note that this parameter is `IN` `OUT` for performance purposes. The stored procedure has no need to use the `IN` value.
3	`IN`	`BOOLEAN`	Oracle Text sets this parameter to `TRUE` when Oracle Text needs the character offset and character length of the tokens as found in the document being tokenized. Oracle Text sets this parameter to `FALSE` when Text is not interested in the character offset and character length of the tokens as found in the document being tokenized. This implies that the XML attributes off and len must not be used.

IN

VARCHAR2

Document or stop word from stoplist object to be tokenized.

If the document is larger than 32512 bytes then Oracle Text will report a document level indexing error.

IN OUT

VARCHAR2

Tokens encoded as XML.

If the document contains no tokens, then either NULL must be returned or the tokens element in the XML document returned must contain no child elements.

Byte length of the data must be less than or equal to 32512.

To improve performance, use the NOCOPY hint when declaring this parameter. This passes the data by reference, rather than passing data by value.

The XML document returned by this procedure should not include unnecessary whitespace characters (typically used to improve readability). This reduces the size of the XML document which in turn minimizes the transfer time.

To improve performance, index_procedure should not validate the XML document with the corresponding XML schema at run-time.

Note that this parameter is IN OUT for performance purposes. The stored procedure has no need to use the IN value.

IN

BOOLEAN

Oracle Text sets this parameter to TRUE when Oracle Text needs the character offset and character length of the tokens as found in the document being tokenized.

Oracle Text sets this parameter to FALSE when Text is not interested in the character offset and character length of the tokens as found in the document being tokenized. This implies that the XML attributes off and len must not be used.

2.4.9.4.2 CLOB Interface

Table 2-34 describes the CLOB interface that enables the document or stop word from stoplist object to be tokenized to be passed as CLOB from Oracle Text to the stored procedure and for the tokens to be passed as CLOB as well from the stored procedure back to Oracle Text.

The user-defined lexer indexing procedure should use this interface when at least one of the documents in the column to be indexed is larger than 32512 bytes or the corresponding tokens are represented by more than 32512 bytes.

Table 2-34 CLOB Interface for INDEX_PROCEDURE

Parameter Position Parameter Mode Parameter Datatype Description

Parameter Position	Parameter Mode	Parameter Datatype	Description
1	`IN`	`CLOB`	Document or stop word from stoplist object to be tokenized.
2	`IN` `OUT`	`CLOB`	Tokens encoded as XML. If the document contains no tokens, then either NULL must be returned or the tokens element in the XML document returned must contain no child elements. To improve performance, use the `NOCOPY` hint when declaring this parameter. This passes the data by reference, rather than passing data by value. The XML document returned by this procedure should not include unnecessary whitespace characters (typically used to improve readability). This reduces the size of the XML document which in turn minimizes the transfer time. To improve performance, index_procedure should not validate the XML document with the corresponding XML schema at run-time. Note that this parameter is `IN` `OUT` for performance purposes. The stored procedure has no need to use the IN value. The `IN` value will always be a truncated `CLOB`.
3	`IN`	`BOOLEAN`	Oracle Text sets this parameter to `TRUE` when Oracle Text needs the character offset and character length of the tokens as found in the document being tokenized. Oracle Text sets this parameter to `FALSE` when Text is not interested in the character offset and character length of the tokens as found in the document being tokenized. This implies that the XML attributes off and len must not be used.

IN

CLOB

Document or stop word from stoplist object to be tokenized.

IN OUT

CLOB

Tokens encoded as XML.

If the document contains no tokens, then either NULL must be returned or the tokens element in the XML document returned must contain no child elements.

To improve performance, use the NOCOPY hint when declaring this parameter. This passes the data by reference, rather than passing data by value.

To improve performance, index_procedure should not validate the XML document with the corresponding XML schema at run-time.

Note that this parameter is IN OUT for performance purposes. The stored procedure has no need to use the IN value. The IN value will always be a truncated CLOB.

IN

BOOLEAN

Oracle Text sets this parameter to TRUE when Oracle Text needs the character offset and character length of the tokens as found in the document being tokenized.

The first and second parameters are temporary CLOBS. Avoid assigning these CLOB locators to other locator variables. Assigning the formal parameter CLOB locator to another locator variable causes a new copy of the temporary CLOB to be created resulting in a performance hit.

2.4.9.5 QUERY_PROCEDURE

This callback stored procedure is called by Oracle Text as needed to tokenize words in the query. A space-delimited group of characters (excluding the query operators) in the query will be identified by Oracle Text as a word.

2.4.9.5.1 Requirements

This procedure can be a PL/SQL stored procedure.

The index owner must have EXECUTE privilege on this stored procedure.

This stored procedure must not be replaced or be dropped after the index is created. You can replace or drop this stored procedure after the index is dropped.

2.4.9.5.2 Restrictions

This procedure must not perform any of the following operations:

Rollback
Explicitly or implicitly commit the current transaction
Enter any other transaction control statement
Alter the session language or territory

The child elements of the root element tokens of the XML document returned must be in the same order as the tokens occur in the query word being tokenized.

The behavior of this stored procedure must be deterministic with respect to all parameters.

2.4.9.5.3 Parameters

Table 2-35 describes the interface for the user-defined lexer query procedure:

Table 2-35 User-defined Lexer Query Procedure XML Schema Attributes

Parameter Position Parameter Mode Parameter Datatype Description

Parameter Position	Parameter Mode	Parameter Datatype	Description
1	`IN`	`VARCHAR2`	Query word to be tokenized.
2	`IN`	`CTX_ULEXER.WILDCARD_TAB`	Character offsets of wildcard characters (% and _) in the query word. If the query word passed in by Oracle Text does not contain any wildcard characters then this index-by table will be empty. The wildcard characters in the query word must be preserved in the tokens returned in order for the wildcard query feature to work properly. The character offset is 0 (zero) based. Offset information follows USC-2 codepoint semantics.
3	`IN` `OUT`	`VARCHAR2`	Tokens encoded as XML. If the query word contains no tokens then either NULL must be returned or the tokens element in the XML document returned must contain no child elements. The length of the data must be less-than or equal to 32512 bytes.

IN

VARCHAR2

Query word to be tokenized.

IN

CTX_ULEXER.WILDCARD_TAB

Character offsets of wildcard characters (% and _) in the query word. If the query word passed in by Oracle Text does not contain any wildcard characters then this index-by table will be empty.

The wildcard characters in the query word must be preserved in the tokens returned in order for the wildcard query feature to work properly.

The character offset is 0 (zero) based. Offset information follows USC-2 codepoint semantics.

IN OUT

VARCHAR2

Tokens encoded as XML.

If the query word contains no tokens then either NULL must be returned or the tokens element in the XML document returned must contain no child elements.

The length of the data must be less-than or equal to 32512 bytes.

2.4.9.6 Encoding Tokens as XML

The sequence of tokens returned by your stored procedure must be represented as an XML 1.0 document. The XML document must be valid with respect to the XML Schemas given in the following sections.

XML Schema for No-Location, User-defined Indexing Procedure
XML Schema for User-defined Indexing Procedure with Location
XML Schema for User-defined Lexer Query Procedure

2.4.9.6.1 Limitations

To boost performance of this feature, the XML parser in Oracle Text will not perform validation and will not be a full-featured XML compliant parser. This implies that only minimal XML features will be supported. The following XML features are not supported:

Document Type Declaration (for example, <!DOCTYPE [...]>) and therefore entity declarations. Only the following built-in entities can be referenced: lt, gt, amp, quot, and apos.
CDATA sections.
Comments.
Processing Instructions.
XML declaration (for example, <?xml version="1.0" ...?>).
Namespaces.
Use of elements and attributes other than those defined by the corresponding XML Schema.
Character references (for example ট).
xml:space attribute.
xml:lang attribute

2.4.9.7 XML Schema for No-Location, User-defined Indexing Procedure

This section describes additional constraints imposed on the XML document returned by the user-defined lexer indexing procedure when the third parameter is FALSE. The XML document returned must be valid with respect to the following XML Schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="tokens">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:choice minOccurs="0" maxOccurs="unbounded"> 
          <xsd:element name="eos" type="EmptyTokenType"/>
          <xsd:element name="eop" type="EmptyTokenType"/>
          <xsd:element name="num" type="xsd:token"/> 
          <xsd:group ref="IndexCompositeGroup"/>
        </xsd:choice>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <!-- 
  Enforce constraint that compMem element must be preceeded by word element
  or compMem element for indexing 
  -->
  <xsd:group name="IndexCompositeGroup">
    <xsd:sequence>
      <xsd:element name="word" type="xsd:token"/>
      <xsd:element name="compMem" type="xsd:token" minOccurs="0"
maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:group>

  <!-- EmptyTokenType defines an empty element without attributes -->
  <xsd:complexType name="EmptyTokenType"/>

</xsd:schema>

Here are some of the constraints imposed by this XML Schema:

The root element is tokens. This is mandatory. It has no attributes.
The root element can have zero or more child elements. The child elements can be one of the following elements: eos, eop, num, word, and compMem. Each of these represent a specific type of token.
The compMem element must be preceded by a word element or a compMem element.
The eos and eop elements have no attributes and must be empty elements.
The num, word, and compMem elements have no attributes. Oracle Text will normalize the content of these elements as follows: convert whitespace characters to space characters, collapse adjacent space characters to a single space character, remove leading and trailing spaces, perform entity reference replacement, and truncate to 64 bytes.

Table 2-36 describes the element names defined in the preceding XML Schema.

Table 2-36 User-defined Lexer Indexing Procedure XML Schema Element Names

Element	Description
word	This element represents a simple word token. The content of the element is the word itself. Oracle Text does the work of identifying this token as being a stop word or non-stop word and processing it appropriately.
num	This element represents an arithmetic number token. The content of the element is the arithmetic number itself. Oracle Text treats this token as a stop word if the stoplist preference has `NUMBERS` added as the stopclass. Otherwise this token is treated the same way as the word token. Supporting this token type is optional. Without support for this token type, adding the `NUMERBS` stopclass will have no effect.
eos	This element represents end-of-sentence token. Oracle Text uses this information so that it can support `WITHIN` `SENTENCE` queries. Supporting this token type is optional. Without support for this token type, queries against the `SENTENCE` section will not work as expected.
eop	This element represents end-of-paragraph token. Oracle Text uses this information so that it can support `WITHIN` `PARAGRAPH` queries. Supporting this token type is optional. Without support for this token type, queries against the `PARAGRAPH` section will not work as expected.
compMem	Same as the word element, except that the implicit word offset is the same as the previous word token. Support for this token type is optional.

2.4.9.7.1 Example

Document: Vom Nordhauptbahnhof und aus der Innenstadt zum Messegelände.

Tokens:

<tokens>
  <word> VOM </word>
  <word> NORDHAUPTBAHNHOF </word>
  <compMem>NORD</compMem>
  <compMem>HAUPT </compMem>
  <compMem>BAHNHOF </compMem>
  <compMem>HAUPTBAHNHOF </compMem>
  <word> UND </word>
  <word> AUS </word>
  <word> DER </word>
  <word> INNENSTADT </word>
  <word> ZUM </word>
  <word> MESSEGELÄNDE </word>
  <eos/>
</tokens>

2.4.9.7.2 Example

Document: Oracle Database 11g Release 1

Tokens:

<tokens>
  <word> ORACLE11G</word>
  <word> RELEASE </word>
  <num> 1 </num>
</tokens>

2.4.9.7.3 Example

Document: WHERE salary<25000.00 AND job = 'F&B Manager'

Tokens:

<tokens>
  <word> WHERE </word>
  <word> salary&lt;2500.00 </word>
  <word> AND </word>
  <word> job </word>
  <word> F&amp;B </word>
  <word> Manager </word>
</tokens>

2.4.9.8 XML Schema for User-defined Indexing Procedure with Location

This section describes additional constraints imposed on the XML document returned by the user-defined lexer indexing procedure when the third parameter is TRUE. The XML document returned must be valid according to the following XML schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="tokens">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
          <xsd:element name="eos" type="EmptyTokenType"/>
          <xsd:element name="eop" type="EmptyTokenType"/>
          <xsd:element name="num" type="DocServiceTokenType"/>
          <xsd:group ref="DocServiceCompositeGroup"/>
        </xsd:choice>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

  <!-- 
  Enforce constraint that compMem element must be preceeded by word element
  or compMem element for document service
  -->
  <xsd:group name="DocServiceCompositeGroup">
    <xsd:sequence>
      <xsd:element name="word" type="DocServiceTokenType"/>
      <xsd:element name="compMem" type="DocServiceTokenType" minOccurs="0"
           maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:group>

  <!-- EmptyTokenType defines an empty element without attributes -->
  <xsd:complexType name="EmptyTokenType"/>

  <!-- 
  DocServiceTokenType defines an element with content and mandatory attributes 
  -->
  <xsd:complexType name="DocServiceTokenType">
    <xsd:simpleContent>
      <xsd:extension base="xsd:token">
        <xsd:attribute name="off" type="OffsetType" use="required"/>
        <xsd:attribute name="len" type="xsd:unsignedShort" use="required"/>
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

  <xsd:simpleType name="OffsetType">
    <xsd:restriction base="xsd:unsignedInt">
      <xsd:maxInclusive value="2147483647"/>
    </xsd:restriction>
  </xsd:simpleType>

</xsd:schema>

Some of the constraints imposed by this XML Schema are as follows:

The root element is tokens. This is mandatory. It has no attributes.
The root element can have zero or more child elements. The child elements can be one of the following elements: eos, eop, num, word, and compMem. Each of these represent a specific type of token.
The compMem element must be preceded by a word element or a compMem element.
The eos and eop elements have no attributes and must be empty elements.
The num, word, and compMem elements have two mandatory attributes: off and len. Oracle Text will normalize the content of these elements as follows: convert whitespace characters to space characters, collapse adjacent space characters to a single space character, remove leading and trailing spaces, perform entity reference replacement, and truncate to 64 bytes.
The off attribute value must be an integer between 0 and 2147483647 inclusive.
The len attribute value must be an integer between 0 and 65535 inclusive.

Table 2-36 describes the element types defined in the preceding XML Schema.

Table 2-37 describes the attributes defined in the preceding XML Schema.

Table 2-37 User-defined Lexer Indexing Procedure XML Schema Attributes

Attribute Description

Attribute	Description
off	This attribute represents the character offset of the token as it appears in the document being tokenized. The offset is with respect to the character document passed to the user-defined lexer indexing procedure, not the document fetched by the datastore. The document fetched by the datastore may be pre-processed by the filter object or the section group object, or both, before being passed to the user-defined lexer indexing procedure. The offset of the first character in the document being tokenized is 0 (zero). Offset information follows USC-2 codepoint semantics.
len	This attribute represents the character length (same semantics as SQL function `LENGTH`) of the token as it appears in the document being tokenized. The length is with respect to the character document passed to the user-defined lexer indexing procedure, not the document fetched by the datastore. The document fetched by the datastore may be pre-processed by the filter object or the section group object before being passed to the user-defined lexer indexing procedure. Length information follows USC-2 codepoint semantics.

off

This attribute represents the character offset of the token as it appears in the document being tokenized.

The offset is with respect to the character document passed to the user-defined lexer indexing procedure, not the document fetched by the datastore. The document fetched by the datastore may be pre-processed by the filter object or the section group object, or both, before being passed to the user-defined lexer indexing procedure.

The offset of the first character in the document being tokenized is 0 (zero). Offset information follows USC-2 codepoint semantics.

len

This attribute represents the character length (same semantics as SQL function LENGTH) of the token as it appears in the document being tokenized.

The length is with respect to the character document passed to the user-defined lexer indexing procedure, not the document fetched by the datastore. The document fetched by the datastore may be pre-processed by the filter object or the section group object before being passed to the user-defined lexer indexing procedure.

Length information follows USC-2 codepoint semantics.

Sum of off attribute value and len attribute value must be less than or equal to the total number of characters in the document being tokenized. This is to ensure that the document offset and characters being referenced are within the document boundary.

2.4.9.8.1 Example

Document: User-defined Lexer.

Tokens:

<tokens>
  <word off="0" len="4"> USE </word>
  <word off="5" len="7"> DEF </word>
  <word off="13" len="5"> LEX </word>
  <eos/>
</tokens>

2.4.9.9 XML Schema for User-defined Lexer Query Procedure

This section describes additional constraints imposed on the XML document returned by the user-defined lexer query procedure. The XML document returned must be valid with respect to the following XML Schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="tokens">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
          <xsd:element name="num" type="QueryTokenType"/>
          <xsd:group ref="QueryCompositeGroup"/>
        </xsd:choice>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

<!--
Enforce constraint that compMem element must be preceeded by word element
or compMem element for query
-->
  <xsd:group name="QueryCompositeGroup">
    <xsd:sequence>
      <xsd:element name="word" type="QueryTokenType"/>
      <xsd:element name="compMem" type="QueryTokenType" minOccurs="0"
                                              maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:group>

  <!-- 
  QueryTokenType defines an element with content and with an optional attribute
  -->
  <xsd:complexType name="QueryTokenType">
    <xsd:simpleContent>
      <xsd:extension base="xsd:token">
        <xsd:attribute name="wildcard" type="WildcardType" use="optional"/>
      </xsd:extension>
    </xsd:simpleContent>
  </xsd:complexType>

  <xsd:simpleType name="WildcardType">
    <xsd:restriction base="WildcardBaseType">
      <xsd:minLength value="1"/>
      <xsd:maxLength value="64"/>
    </xsd:restriction>     
  </xsd:simpleType>

  <xsd:simpleType name="WildcardBaseType">
    <xsd:list>
      <xsd:simpleType>
        <xsd:restriction base="xsd:unsignedShort">
          <xsd:maxInclusive value="378"/>
        </xsd:restriction>
      </xsd:simpleType>
    </xsd:list>
  </xsd:simpleType>

</xsd:schema>

Here are some of the constraints imposed by this XML Schema:

The root element is tokens. This is mandatory. It has no attributes.
The root element can have zero or more child elements. The child elements can be one of the following elements: num and word. Each of these represent a specific type of token.
The compMem element must be preceded by a word element or a compMem element.

The purpose of compMem is to enable USER_LEXER queries to return multiple forms for a single query. For example, if a user-defined lexer indexes the word bank as BANK(FINANCIAL) and BANK(RIVER), the query procedure can return the first term as a word and the second as a compMem element:
```
<tokens>
  <word>BANK(RIVER)</word>
  <compMem>BANK(FINANCIAL)</compMem>
</tokens>
```
See Table 2-38, "User-defined Lexer Query Procedure XML Schema Attributes" for more on the compMem element.
The num and word elements have a single optional attribute: wildcard. Oracle Text will normalize the content of these elements as follows: convert whitespace characters to space characters, collapse adjacent space characters to a single space character, remove leading and trailing spaces, perform entity reference replacement, and truncate to 64 bytes.
The wildcard attribute value is a white-space separated list of integers. The minimum number of integers is 1 and the maximum number of integers is 64. The value of the integers must be between 0 and 378 inclusive. The intriguers in the list can be in any order.

Table 2-36 describes the element types defined in the preceding XML Schema.

Table 2-38 describes the attribute defined in the preceding XML Schema.

Table 2-38 User-defined Lexer Query Procedure XML Schema Attributes

Attribute Description

Attribute	Description
`compMem`	Same as the `word` element, but its implicit word offset is the same as the previous `word` token. Oracle Text will equate this token with the previous `word` token and with subsequent `compMem` tokens using the query `EQUIV` operator.
`wildcard`	Any% or _ characters in the query which are not escaped by the user are considered wildcard characters because they are replaced by other characters. These wildcard characters in the query must be preserved during tokenization in order for the wildcard query feature to work properly. This attribute represents the character offsets (same semantics as SQL function `LENGTH`) of wildcard characters in the content of the element. Oracle Text will adjust these offsets for any normalization performed on the content of the element. The characters pointed to by the offsets must either be% or _ characters. The offset of the first character in the content of the element is 0. Offset information follows USC-2 codepoint semantics. If the token does not contain any wildcard characters then this attribute must not be specified.

compMem

Same as the word element, but its implicit word offset is the same as the previous word token. Oracle Text will equate this token with the previous word token and with subsequent compMem tokens using the query EQUIV operator.

wildcard

Any% or _ characters in the query which are not escaped by the user are considered wildcard characters because they are replaced by other characters. These wildcard characters in the query must be preserved during tokenization in order for the wildcard query feature to work properly. This attribute represents the character offsets (same semantics as SQL function LENGTH) of wildcard characters in the content of the element. Oracle Text will adjust these offsets for any normalization performed on the content of the element. The characters pointed to by the offsets must either be% or _ characters.

The offset of the first character in the content of the element is 0. Offset information follows USC-2 codepoint semantics.

If the token does not contain any wildcard characters then this attribute must not be specified.

2.4.9.9.1 Example

Query word: pseudo-%morph%

Tokens:

<tokens>
  <word> PSEUDO </word>
  <word wildcard="1 7"> %MORPH% </word>
</tokens>

2.4.9.9.2 Example

Query word: <%>
Tokens:
<tokens>
  <word wildcard="5"> &lt;%&gt; </word>
</tokens>

2.4.10 WORLD_LEXER

Use the WORLD_LEXER to index text columns that contain documents of different languages. For example, use this lexer to index a text column that stores English, Japanese, and German documents.

WORLD_LEXER differs from MULTI_LEXER in that WORLD_LEXER automatically detects the language(s) of a document. Unlike MULTI_LEXER, WORLD_LEXER does not require you to have a language column in your base table nor to specify the language column when you create the index. Moreover, it is not necessary to use sub-lexers, as with MULTI_LEXER. (See "MULTI_LEXER".)

WORLD_LEXER supports all database character sets, and for languages whose character sets are Unicode-based, it supports the Unicode 5.0 standard. For a list of languages that WORLD_LEXER can work with, see "World Lexer Features".

2.4.10.1 WORLD_LEXER Attribute

The WORLD_VGRAM_LEXER has the following attribute:

Table 2-39 WORLD_LEXER Attribute

Attribute	Attribute Value
`mixed_case`	Enable mixed-case (upper- and lower-case) searches of text (for example, cat and Cat). Allowable values are `YES` and `NO` (default).

2.4.10.2 WORLD_LEXER Example

Here is an example of creating an index using WORLD_LEXER.

exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');
create index doc_idx on doc(data)
  indextype is CONTEXT
  parameters ('lexer MYLEXER
               stoplist CTXSYS.EMPTY_STOPLIST');

2.5 Wordlist Type

Use the wordlist preference to enable the query options such as stemming, fuzzy matching for your language. You can also use the wordlist preference to enable substring and prefix indexing, which improves performance for wildcard queries with CONTAINS and CATSEARCH.

To create a wordlist preference, you must use BASIC_WORDLIST, which is the only type available.

2.5.1 BASIC_WORDLIST

Use BASIC_WORDLIST type to enable stemming and fuzzy matching or to create prefix indexes with Text indexes.

BASIC_WORDLIST has the following attributes:

Table 2-40 BASIC_WORDLIST Attributes

Attribute	Attribute Values
`stemmer`	Specify which language stemmer to use. You can specify one of the following stemmers: NULL (no stemming) ENGLISH (English inflectional) DERIVATIONAL (English derivational) DUTCH FRENCH GERMAN ITALIAN SPANISH AUTO (Automatic language-detection for stemming, derived from the database session language. If the database session language is AMERICAN or ENGLISH, then the ENGLISH stemmer is used. Does not auto-detect JAPANESE.) JAPANESE
`fuzzy_match`	Specify which fuzzy matching cluster to use. You can specify one of the following types: AUTO (Automatic language detection for stemming.) CHINESE_VGRAM DUTCH ENGLISH FRENCH GENERIC GERMAN ITALIAN JAPANESE_VGRAM KOREAN OCR SPANISH
`fuzzy_score`	Specify a default lower limit of fuzzy score. Specify a number between 0 and 80. Text with scores below this number is not returned. Default is 60.
`fuzzy_numresults`	Specify the maximum number of fuzzy expansions. Use a number between 0 and 5,000. Default is 100.
`substring_index`	Specify `TRUE` for Oracle Text to create a substring index. A substring index improves left-truncated and double-truncated wildcard queries such as %ing or %benz%. Default is `FALSE`.
`prefix_index`	Specify `TRUE` to enable prefix indexing. Prefix indexing improves performance for right truncated wildcard searches such as TO%. Default is `FALSE`.
`prefix_min_length`	Specify the minimum length of indexed prefixes. Default is 1. Length information must follow USC-2 codepoint semantics.
`prefix_max_length`	Specify the maximum length of indexed prefixes. Default is 64. Length information must follow USC-2 codepoint semantics.
`wildcard_maxterms`	Specify the maximum number of terms in a wildcard expansion. The maximum value is 50000 and the default value is 20000. If you specify a value of 0, then the number of wildcard expansions will be unbounded.Note that when set to 0, the system may run out of memory due to the high number of wildcard expansions.
`ndata_base_letter`	Specify whether characters that have diacritical marks are converted to their base form before being stored in the Text index or queried by the `NDATA` operator. `FALSE` (default) or `TRUE` When set to `FALSE`, no base lettering is used.
`ndata_alternate_spelling`	Specify whether to enable alternate spelling for German, Danish, and Swedish. Enabling alternate spelling allows you to index `NDATA` section data and query using the `NDATA` operator in alternate form. `FALSE` (default) or `TRUE` When set to `FALSE`, no alternate spelling is used.
`ndata_thesaurus`	Name of the thesaurus used for alternate name expansion.
`ndata_join_particles`	A list of colon-separated name particles that can be joined with a name that follows them.

stemmer

Specify the stemmer used for word stemming in Text queries. When you do not specify a value for STEMMER, the default is ENGLISH.

Specify AUTO for the system to automatically set the stemming language according to the language setting of the database session. If the database language is AMERICAN or ENGLISH, then the ENGLISH stemmer is automatically used. Otherwise, the stemmer that maps to the database session language is used.

When there is no stemmer for a language, the default is NULL. With the NULL stemmer, the stem operator is ignored in queries.

You can create your own stemming user-dictionary. See "Stemming User-Dictionaries" for more information.

Note:

The STEMMER attribute of BASIC_WORDLIST preference will be ignored if:

INDEX_STEMS attribute of BASIC_LEXER preference is set to BOKMAL, CATALAN, CROATIAN, CZECH, DANISH, FINNISH, GREEK, HEBREW, HUNGARIAN, NYNORSK, POLISH, PORTUGUESE, ROMANIAN, RUSSIAN, SERBIAN, SLOVAK, SLOVENIAN, SWEDISH, ENGLISH_NEW, DERIVATIONAL_NEW, DUTCH_NEW, FRENCH_NEW, GERMAN_NEW, ITALIAN_NEW, or SPANISH_NEW.

Or
INDEX_STEMS attribute of AUTO_LEXER preference is set to YES.

Or
The database session language causes MULTI_LEXER to choose a SUB_LEXER with the same setting as 1 or 2 above.

In these cases, the same stemmer that is used by the BASIC_LEXER or AUTO_LEXER during indexing will be used to determine the stem of the query term during query.

fuzzy_match

Specify which fuzzy matching routines are used for the column. Fuzzy matching is currently supported for English, Japanese, and, to a lesser extent, the Western European languages.

Note:

The fuzzy_match attributes value for Chinese and Korean are dummy attribute values that prevent the English and Japanese fuzzy matching routines from being used on Chinese and Korean text.

The default for fuzzy_match is GENERIC.

Specify AUTO for the system to automatically set the fuzzy matching language according to language setting of the session.

fuzzy_score

Specify a default lower limit of fuzzy score. Specify a number between 0 and 80. Text with scores below this number are not returned. The default is 60.

Fuzzy score is a measure of how close the expanded word is to the query word. The higher the score the better the match. Use this parameter to limit fuzzy expansions to the best matches.

fuzzy_numresults

Specify the maximum number of fuzzy expansions. Use a number between 0 and 5000. The default is 100.

Setting a fuzzy expansion limits the expansion to a specified number of the best matching words.

substring_index

Specify TRUE for Oracle Text to create a substring index. A substring index improves performance for left-truncated or double-truncated wildcard queries such as %ing or %benz%. The default is false.

Substring indexing has the following impact on indexing and disk resources:

Index creation and DML processing is up to 4 times slower
Index creation with substring_index enabled requires more rollback segments during index flushes than with substring index off. Oracle recommends that you do either of the following when creating a substring index:
- Make available double the usual rollback or
- Decrease the index memory to reduce the size of the index flushes to disk

prefix_index

Specify yes to enable prefix indexing. Prefix indexing improves performance for right truncated wildcard searches such as TO%. Default is NO.

Note:

Enabling prefix indexing increases index size.

Prefix indexing chops up tokens into multiple prefixes to store in the $I table. For example, words TOKEN and TOY are normally indexed as follows in the $I table:

Token	Type	Information
TOKEN	0	DOCID 1 POS 1
TOY	0	DOCID 1 POS 3

With prefix indexing, Oracle Text indexes the prefix substrings of these tokens as follows with a new token type of 6:

Token	Type	Information
TOKEN	0	DOCID 1 POS 1
TOY	0	DOCID 1 POS 3
T	6	DOCID 1 POS 1 POS 3
TO	6	DOCID 1 POS 1 POS 3
TOK	6	DOCID 1 POS 1
TOKE	6	DOCID 1 POS 1
TOKEN	6	DOCID 1 POS 1
TOY	6	DOCID 1 POS 3

Wildcard searches such as TO% are now faster because Oracle Text does no expansion of terms and merging of result sets. To obtain the result, Oracle Text need only examine the (TO,6) row.

prefix_min_length

Specify the minimum length of indexed prefixes. Default is 1.

For example, setting prefix_min_length to 3 and prefix_max_length to 5 indexes all prefixes between 3 and 5 characters long.

Note:

A wildcard search whose pattern is below the minimum length or above the maximum length is searched using the slower method of equivalence expansion and merging.

prefix_max_length

Specify the maximum length of indexed prefixes. Default is 64.

For example, setting prefix_min_length to 3 and prefix_max_length to 5 indexes all prefixes between 3 and 5 characters long.

Note:

A wildcard search whose pattern is below the minimum length or above the maximum length is searched using the slower method of equivalence expansion and merging.

wildcard_maxterms

Specify the maximum number of terms in a wildcard (%) expansion. Use this parameter to keep wildcard query performance within an acceptable limit. Oracle Text returns an error when the wildcard query expansion exceeds this number.

ndata_base_letter

Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, and so on) are converted to their base form before being stored in the Text index or queried by the NDATA operator. The default is FALSE (base-letter conversion disabled). For more information on base-letter conversions, see "Base-Letter Conversion".

ndata_alternate_spelling

Specify whether to enable alternate spelling for German, Danish, and Swedish. Enabling alternate spelling allows you to index NDATA section data and query using the NDATA operator in alternate form.

When ndata_base_letter is enabled at the same time as ndata_alternate_spelling, NDATA section data is serially transformed first by alternate spelling and then by base lettering. For more information about the alternate spelling conventions Oracle Text uses, see "Alternate Spelling".

ndata_thesaurus

Specify a name of the thesaurus used for alternate name expansion. The indexing engine expands names in documents using synonym rings in the thesaurus. A user should make use of homographic disambiguating feature of the thesaurus to distinguish common nicknames.

An example is:

Albert
  SYN Al
  SYN Bert
Alfred
  SYN Al
  SYN Fred

A simple definition such as the above will put Albert, Alfred, Al, Bert, and Fred into the same synonym ring. This will cause an unexpected expansion such that the expansion of Bert includes Fred. To prevent this, you can use homographic disambiguation as in:

Albert
  SYN Al (Albert)
  SYN Bert (Albert)
Alfred
  SYN Al (Alfred)
  SYN Fred (Alfred)

This forms two synonym rings, Albert-Al-Bert and Alfred-Al-Fred. Thus, the expansion of Bert no longer includes Fred. A more detailed example is:

begin
  ctx_ddl.create_preference('NDAT_PREF', 'BASIC_WORDLIST');
  ctx_ddl.set_attribute('NDATA_PREF', 'NDATA_ALTERNATE_SPELLING', 'FALSE');
  ctx_ddl.set_attribute('NDATA_PREF', 'NDATA_BASE_LETTER', 'TRUE');
  ctx_ddl.set_attribute('NDATA_PREF', 'NDATA_THESAURUS', 'NICKNAMES');
end;

Note:

A sample thesaurus for names can be found in the $ORACLE_HOME/ctx/sample/thes directory. This file is dr0thsnames.txt.

ndata_join_particles

Specify a list of colon-separated name particles that can be joined with a name that follows them. A name particle, such as da, is written separately from or joined with its following name like da Vinci or daVinci. The indexing engine generates index data for both separated and join versions of a name when it finds a name particle specified in this prefence. The same happens in the query processing for better recall.

2.5.2 BASIC_WORDLIST Example

The following example shows the use of the BASIC_WORDLIST type.

2.5.2.1 Enabling Fuzzy Matching and Stemming

The following example enables stemming and fuzzy matching for English. The preference STEM_FUZZY_PREF sets the number of expansions to the maximum allowed. This preference also instructs the system to create a substring index to improve the performance of double-truncated searches.

begin 
  ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); 
  ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH');
  ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0');
  ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000');
  ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE');
  ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH');
end;

To create the index in SQL, enter the following statement:

create index fuzzy_stem_subst_idx on mytable ( docs ) 
  indextype is ctxsys.context parameters ('Wordlist STEM_FUZZY_PREF');

2.5.2.2 Enabling Sub-string and Prefix Indexing

The following example sets the wordlist preference for prefix and sub-string indexing. For prefix indexing, it specifies that Oracle Text create token prefixes between 3 and 4 characters long:

begin

ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST'); 
ctx_ddl.set_attribute('mywordlist','PREFIX_INDEX','TRUE');
ctx_ddl.set_attribute('mywordlist','PREFIX_MIN_LENGTH',3);
ctx_ddl.set_attribute('mywordlist','PREFIX_MAX_LENGTH', 4);
ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX', 'YES');

end

2.5.2.3 Setting Wildcard Expansion Limit

Use the wildcard_maxterms attribute to set the maximum allowed terms in a wildcard expansion.

--- create a sample table
drop table quick ;
create table quick 
  ( 
    quick_id number primary key, 
    text      varchar(80) 
  ); 

--- insert a row with 10 expansions for 'tire%'
insert into quick ( quick_id, text ) 
  values ( 1, 'tire tirea tireb tirec tired tiree tiref tireg tireh tirei tirej');
commit;

--- create an index using wildcard_maxterms=100
begin 
    Ctx_Ddl.Create_Preference('wildcard_pref', 'BASIC_WORDLIST'); 
    ctx_ddl.set_attribute('wildcard_pref', 'wildcard_maxterms', 100) ;
end; 
/
create index wildcard_idx on quick(text)
    indextype is ctxsys.context 
    parameters ('Wordlist wildcard_pref') ;

--- query on 'tire%' - should work fine
select quick_id from quick
  where contains ( text, 'tire%' ) > 0;

--- now re-create the index with wildcard_maxterms=5

drop index wildcard_idx ;

begin 
    Ctx_Ddl.Drop_Preference('wildcard_pref'); 
    Ctx_Ddl.Create_Preference('wildcard_pref', 'BASIC_WORDLIST'); 
    ctx_ddl.set_attribute('wildcard_pref', 'wildcard_maxterms', 5) ;
end; 
/

create index wildcard_idx on quick(text)
    indextype is ctxsys.context 
    parameters ('Wordlist wildcard_pref') ;

--- query on 'tire%' gives "wildcard query expansion resulted in too many terms"
select quick_id from quick
  where contains ( text, 'tire%' ) > 0;

2.6 Storage Types

Use the storage preference to specify tablespace and creation parameters for tables associated with a Text index. The system provides a single storage type called BASIC_STORAGE:

Table 2-41 Storage Types

Type	Description
`BASIC_STORAGE`	Indexing type used to specify the tablespace and creation parameters for the database tables and indexes that constitute a Text index.

2.6.1 BASIC_STORAGE

The BASIC_STORAGE type specifies the tablespace and creation parameters for the database tables and indexes that constitute a Text index.

The clause you specify is added to the internal CREATE TABLE (CREATE INDEX for the i_index_clause) statement at index creation. You can specify most allowable clauses, such as storage, LOB storage, or partitioning. However, you cannot specify an index organized table clause.

See Also:

For more information about how to specify CREATE TABLE and CREATE INDEX statements, see Oracle Database SQL Language Reference.

BASIC_STORAGE has the following attributes:

Table 2-42 BASIC_STORAGE Attributes

Attribute	Attribute Value
`i_index_clause`	Parameter clause for dr$indexname$X index creation. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `INDEX` statement. The default clause is: `'COMPRESS 2'` which instructs Oracle Text to compress this index table. If you choose to override the default, Oracle recommends including `COMPRESS 2` in your parameter clause to compress this table, because such compression saves disk space and helps query performance.
`i_rowid_index_clause`	Parameter clause to specify the storage clause for the $R index on dr$rowid column of the $I table. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `INDEX` statement. This clause is only used by the `CTXCAT` index type.
`i_table_clause`	Parameter clause for dr$indexname$I table creation. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `TABLE` statement. The I table is the index data table. Note: Oracle strongly recommends that you do not specify "disable storage in row" for $I LOBs, as this will greatly degrade the query performance.
`k_table_clause`	Parameter clause for dr$indexname$K table creation. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `TABLE` statement. The K table is the keymap table.
`r_table_clause`	Parameter clause for dr$indexname$R table creation. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `TABLE` statement. The R table is the rowid table. The default clause is: `'LOB(DATA) STORE AS (CACHE)'.` If you modify this attribute, always include this clause for good performance.
`n_table_clause`	Parameter clause for dr$indexname$N table creation. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `TABLE` statement. The N table is the negative list table.
`p_table_clause`	Parameter clause for the substring index if you have enabled `SUBSTRING_INDEX` in the `BASIC_WORDLIST`. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `INDEX` statement. The P table is an index-organized table so the storage clause you specify must be appropriate to this type of table.
`s_table_clause`	Parameter clause for dr$indexname$S table creation. Specify storage and tablespace clauses to add to the end of the internal `CREATE` `TABLE` statement. The default clause is `nocompress`. For performance reasons, $S table must be created on a tablespace with db block size >= 4K without overflow segment and without a `PCTTHRESHOLD` clause. If $S is created on a tablespace with db block size < 4K, or is created with an overflow segment or with `PCTTHRESHOLD` clause, then appropriate errors will be raised during `CREATE` `INDEX`. The S table is the table that stores `SDATA` section values. If this clause is specified for a storage preference in an index without `SDATA`, then it will have no effect on the index, and index creation will still succeed.

2.6.1.1 Storage Default Behavior

By default, BASIC_STORAGE attributes are not set. In such cases, the Text index tables are created in the index owner's default tablespace. Consider the following statement, entered by user IUSER, with no BASIC_STORAGE attributes set:

create index IOWNER.idx on TOWNER.tab(b) indextype is ctxsys.context;

In this example, the text index is created in IOWNER's default tablespace.

2.6.1.2 Storage Example s

The following examples specify that the index tables are to be created in the foo tablespace with an initial extent of 1K:

begin
ctx_ddl.create_preference('mystore', 'BASIC_STORAGE');
ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'K_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'R_TABLE_CLAUSE',
                        'tablespace users storage (initial 1K) lob
                         (data) store as (disable storage in row cache)');
ctx_ddl.set_attribute('mystore', 'N_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'I_INDEX_CLAUSE',
                        'tablespace foo storage (initial 1K) compress 2');
ctx_ddl.set_attribute('mystore', 'P_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'S_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)');
end;

2.7 Section Group Types

To enter WITHIN queries on document sections, you must create a section group before you define your sections. Specify your section group in the parameter clause of CREATE INDEX.

To create a section group, you can specify one of the following group types with the CTX_DDL.CREATE_SECTION_GROUP procedure:

Table 2-43 Section Group Types

Type	Description
`NULL_SECTION_GROUP`	Use this group type when you define no sections or when you define only `SENTENCE` or `PARAGRAPH` sections. This is the default.
`BASIC_SECTION_GROUP`	Use this group type for defining sections where the start and end tags are of the form `<A>` and `</A>`. Note: This group type does not support input such as unbalanced parentheses, comments tags, and attributes. Use `HTML_SECTION_GROUP` for this type of input.
`HTML_SECTION_GROUP`	Use this group type for indexing HTML documents and for defining sections in HTML documents.
`XML_SECTION_GROUP`	Use this group type for indexing XML documents and for defining sections in XML documents. All sections to be indexed must be manually defined for this group.
`AUTO_SECTION_GROUP`	Use this group type to automatically create a zone section for each start-tag/end-tag pair in an XML document. The section names derived from XML tags are case sensitive as in XML. Attribute sections are created automatically for XML tags that have attributes. Attribute sections are named in the form tag@attribute. Special sections can be added to `AUTO_SECTION_GROUP` for `WITHIN` `SENTENCE` and `WITHIN` `PARAGRAPH` searches. Once a sentence or paragraph section is added to the `AUTO_SECTION_GROUP`, sections with corresponding tag names '`sentence`' or '`paragraph`' (case insensitive) are treated as stop sections. Stop sections, empty tags, processing instructions, and comments are not indexed. The following limitations apply to automatic section groups: You cannot add zone, field, sdata, or special sections to an automatic section group. You can define a stop section that applies only to one particular type; that is, if you have two different XML DTDs, both of which use a tag called `FOO`, you can define `(TYPE1)FOO` to be stopped, but`(TYPE2)FOO` to not be stopped. The length of the indexed tags, including prefix and namespace, cannot exceed 64 bytes. Tags longer than this are not indexed.
`PATH_SECTION_GROUP`	Use this group type to index XML documents. Behaves like the `AUTO_SECTION_GROUP`. The difference is that with this section group you can do path searching with the `INPATH` and `HASPATH` operators. Queries are also case-sensitive for tag and attribute names. Stop sections are not allowed.
`NEWS_SECTION_GROUP`	Use this group for defining sections in newsgroup formatted documents according to RFC 1036.

2.7.1 Section Group Examples

This example shows the use of section groups in both HTML and XML documents.

2.7.1.1 Creating Section Groups in HTML Documents

The following statement creates a section group called htmgroup with the HTML group type.

begin
ctx_ddl.create_section_group('htmgroup', 'HTML_SECTION_GROUP');
end;

You can optionally add sections to this group using the procedures in the CTX_DDL package, such as CTX_DDL.ADD_SPECIAL_SECTION or CTX_DDL.ADD_ZONE_SECTION. To index your documents, enter a statement such as:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
parameters('filter ctxsys.null_filter section group htmgroup');

See Also:

For more information on section groups, see Chapter 7, "CTX_DDL Package"

2.7.1.2 Creating Sections Groups in XML Documents

The following statement creates a section group called xmlgroup with the XML_SECTION_GROUP group type.

begin
ctx_ddl.create_section_group('xmlgroup', 'XML_SECTION_GROUP');
end;

You can optionally add sections to this group using the procedures in the CTX_DDL package, such as CTX_DDL.ADD_ATTR_SECTION or CTX_DDL.ADD_STOP_SECTION. To index your documents, enter a statement such as:

create index myindex on docs(htmlfile) indextype is ctxsys.context 
parameters('filter ctxsys.null_filter section group xmlgroup');

See Also:

For more information on section groups, see Chapter 7, "CTX_DDL Package"

2.7.1.3 Automatic Sectioning in XML Documents

The following statement creates a section group called auto with the AUTO_SECTION_GROUP group type. This section group automatically creates sections from tags in XML documents.

begin

ctx_ddl.create_section_group('auto', 'AUTO_SECTION_GROUP');

end;

CREATE INDEX myindex on docs(htmlfile) INDEXTYPE IS ctxsys.context 
PARAMETERS('filter ctxsys.null_filter section group auto');

2.8 Classifier Types

This section describes the classifier types used to create a preference for CTX_CLS.TRAIN and CTXRULE index creation. The following two classifier types are supported:

RULE_CLASSIFIER
SVM_CLASSIFIER

Note:

In Oracle Database XE Edition, RULE_CLASSIFIER and SVM_CLASSIFIER are not supported because the Data Mining option is not available. This is also true for KMEAN_CLUSTERING.

2.8.1 RULE_CLASSIFIER

Use the RULE_CLASSIFIER type for creating preferences for the query rule generating procedure, CTX_CLS.TRAIN and for CTXRULE creation. The rules generated with this type are essentially query strings and can be easily examined. The queries generated by this classifier can use the AND, NOT, or ABOUT operators. The WITHIN operator is supported for queries on field sections only.

This type has the following attributes:

Table 2-44 RULE_CLASSIFIER Attributes

Attribute	Data Type	Default	Min Value	Max Value	Description
`THRESHOLD`	I	50	1	99	Specify threshold (in percentage) for rule generation. One rule is output only when its confidence level is larger than threshold.
`MAX_TERMS`	I	100	20	2000	For each class, a list of relevant terms is selected to form rules. Specify the maximum number of terms that can be selected for each class.
`MEMORY_SIZE`	I	500	10	4000	Specify memory usage for training in MB. Larger values improve performance.
`NT_THRESHOLD`	F	0.001	0	0.90	Specify a threshold for term selection. There are two thresholds guiding two steps in selecting relevant terms. This threshold controls the behavior of the first step. At this step, terms are selected as candidate terms for the further consideration in the second step. The term is chosen when the ratio of the occurrence frequency over the number of documents in the training set is larger than this threshold.
`TERM_THRESHOLD`	I	10	0	100	Specify a threshold as a percentage for term selection. This threshold controls the second step term selection. Each candidate term has a numerical quantity calculated to imply its correlation with a given class. The candidate term will be selected for this class only when the ratio of its quantity value over the maximum value for all candidate terms in the class is larger than this threshold.
`PRUNE_LEVEL`	I	75	0	100	Specify how much to prune a built decision tree for better coverage. Higher values mean more aggressive pruning and the generated rules will have larger coverage but less accuracy.

2.8.2 SVM_CLASSIFIER

Use the SVM_CLASSIFIER type for creating preferences for the rule generating procedure, CTX_CLS.TRAIN, and for CTXRULE creation. This classifier type represents the Support Vector Machine method of classification and generates rules in binary format. Use this classifier type when you need high classification accuracy.

This type has the following attributes:

Table 2-45 SVM_CLASSIFIER Attributes

Attribute Name	Data Type	Default	Min Value	Max Value	Description
`MAX_DOCTERMS`	I	50	10	8192	Specify the maximum number of terms representing one document.
`MAX_FEATURES`	I	3,000	1	100,000	Specify the maximum number of distinct features.
`THEME_ON`	B	FALSE	NULL	NULL	Specify `TRUE` to use themes as features. Classification with themes requires an installed knowledge base. A knowledge base may or may not have been installed with Oracle Text. For more information on knowledge bases, see the Oracle Text Application Developer's Guide.
`TOKEN_ON`	B	TRUE	NULL	NULL	Specify `TRUE` to use regular tokens as features.
`STEM_ON`	B	FALSE	NULL	NULL	Specify `TRUE` to use stemmed tokens as features. This only works when turning `INDEX_STEM` on for the lexer.
`MEMORY_SIZE`	I	500	10	4000	Specify approximate memory size in MB.
`SECTION_WEIGHT`	1	2	0	100	Specify the occurrence multiplier for adding a term in a field section as a normal term. For example, by default, the term cat in "<A>cat</A>" is a field section term and is treated as a normal term with occurrence equal to 2, but you can specify that it be treated as a normal term with a weight up to 100. `SECTION_WEIGHT` is only meaningful when the index policy specifies a field section.

2.9 Cluster Types

This section describes the cluster types used for creating preferences for the CTX_CLS.CLUSTERING procedure.

Note:

In Oracle Database XE Edition, KMEAN_CLUSTERING is not supported because the Data Mining option is not available. This is also true for RULE_CLASSIFIER and SVM_CLASSIFIER.

See Also:

For more information about clustering, see "CLUSTERING" in Chapter 6, "CTX_CLS Package" as well as the Oracle Text Application Developer's Guide

2.9.1 KMEAN_CLUSTERING

This clustering type has the following attributes:

Table 2-46 KMEAN_CLUSTERING Attributes

Attribute Name	Data Type	Default	Min Value	Max Value	Description
`MAX_DOCTERMS`	I	50	10	8192	Specify the maximum number of distinct terms representing one document.
`MAX_FEATURES`	I	3,000	1	500,000	Specify the maximum number of distinct features.
`THEME_ON`	B	FALSE	NULL	NULL	Specify `TRUE` to use themes as features. Clustering with themes requires an installed knowledge base. A knowledge base may or may not have been installed with Oracle Text. For more information on knowledge bases, see Oracle Text Application Developer's Guide.
`TOKEN_ON`	B	TRUE	NULL	NULL	Specify `TRUE` to use regular tokens as features.
`STEM_ON`	B	FALSE	NULL	NULL	Specify `TRUE` to use stemmed tokens as features. This only works when turning `INDEX_STEM` on for the lexer.
`MEMORY_SIZE`	I	500	10	4000	Specify approximate memory size in MB.
`SECTION_WEIGHT`	1	2	0	100	Specify the occurrence multiplier for adding a term in a field section as a normal term. For example, by default, the term cat in "<A>cat</A>" is a field section term and is treated as a normal term with occurrence equal to 2, but you can specify that it be treated as a normal term with a weight up to 100. `SECTION_WEIGHT` is only meaningful when the index policy specifies a field section.
`CLUSTER_NUM`	I	200	2	20000	Specify the total number of leaf clusters to be generated.

2.10 Stoplists

Stoplists identify the words in your language that are not to be indexed. In English, you can also identify stopthemes that are not to be indexed. By default, the system indexes text using the system-supplied stoplist that corresponds to your database language.

Oracle Text provides default stoplists for most common languages including English, French, German, Spanish, Chinese, Dutch, and Danish. These default stoplists contain only stopwords.

See Also:

For more information about the supplied default stoplists, see Appendix E, "Oracle Text Supplied Stoplists"

2.10.1 Multi-Language Stoplists

You can create multi-language stoplists to hold language-specific stopwords. A multi-language stoplist is useful when you use the MULTI_LEXER to index a table that contains documents in different languages, such as English, German.

To create a multi-language stoplist, use the CTX_DLL.CREATE_STOPLIST procedure and specify a stoplist type of MULTI_STOPLIST. Add language specific stopwords with CTX_DDL.ADD_STOPWORD.

At indexing time, the language column of each document is examined, and only the stopwords for that language are eliminated. At query time, the session language setting determines the active stopwords, like it determines the active lexer when using the multi-lexer.

2.10.2 Creating Stoplists

Create your own stoplists using the CTX_DLL.CREATE_STOPLIST procedure. With this procedure you can create a BASIC_STOPLIST for single language stoplist, or you can create a MULTI_STOPLIST for a multi-language stoplist.

When you create your own stoplist, you must specify it in the parameter clause of CREATE INDEX.

To create stoplists for Chinese or Japanese languages, use the CHINESE_LEXER or JAPANESE_LEXER respectively, and update the appropriate lexicon to be @contained_such_stopwords.

2.10.3 Modifying the Default Stoplist

The default stoplist is always named CTXSYS.DEFAULT_STOPLIST. Use the following procedures to modify this stoplist:

CTX_DDL.ADD_STOPWORD
CTX_DDL.REMOVE_STOPWORD
CTX_DDL.ADD_STOPTHEME
CTX_DDL.ADD_STOPCLASS

When you modify CTXSYS.DEFAULT_STOPLIST with the CTX_DDL package, you must re-create your index for the changes to take effect.

2.10.3.1 Dynamic Addition of Stopwords

You can add stopwords dynamically to a default or custom stoplist with ALTER INDEX. When you add a stopword dynamically, you need not re-index, because the word immediately becomes a stopword and is removed from the index.

Note:

Even though you can dynamically add stopwords to an index, you cannot dynamically remove stopwords. To remove a stopword, you must use CTX_DDL.REMOVE_STOPWORD, drop your index and re-create it.

2.11 System-Defined Preferences

When you install Oracle Text, some indexing preferences are created. You can use these preferences in the parameter clause of CREATE INDEX or define your own.

The default index parameters are mapped to some of the system-defined preferences described in this section.

See Also:

For more information about default index parameters, see "Default Index Parameters"

System-defined preferences are divided into the following categories:

Data Storage
Filter
Lexer
Section Group
Stoplist
Storage
Wordlist

2.11.1 Data Storage

This section discusses the types associated with data storage preferences.

2.11.1.1 CTXSYS.DEFAULT_DATASTORE

This preference uses the DIRECT_DATASTORE type. Use this preference to create indexes for text columns in which the text is stored directly in the column.

2.11.1.2 CTXSYS.FILE_DATASTORE

This preference uses the FILE_DATASTORE type.

2.11.1.3 CTXSYS.URL_DATASTORE

This preference uses the URL_DATASTORE type.

2.11.2 Filter

This section discusses the types associated with filtering preferences.

2.11.2.1 CTXSYS.NULL_FILTER

This preference uses the NULL_FILTER type.

2.11.2.2 CTXSYS.AUTO_FILTER

This preference uses the AUTO_FILTER type.

2.11.3 Lexer

This section discusses the types associated with lexer preferences.

2.11.3.1 CTXSYS.DEFAULT_LEXER

The default lexer depends on the language used at install time. The following sections describe the default settings for CTXSYS.DEFAULT_LEXER for each language.

2.11.3.1.1 American and English Language Settings

If your language is English, this preference uses the BASIC_LEXER with the index_themes attribute disabled.

2.11.3.1.2 Danish Language Settings

If your language is Danish, this preference uses the BASIC_LEXER with the following option enabled:

Alternate spelling (alternate_spelling attribute set to DANISH)

2.11.3.1.3 Dutch Language Settings

If your language is Dutch, this preference uses the BASIC_LEXER with the following options enabled:

composite indexing (composite attribute set to DUTCH)

2.11.3.1.4 German and German DIN Language Settings

If your language is German, then this preference uses the BASIC_LEXER with the following options enabled:

Case-sensitive indexing (mixed_case attribute enabled)
Composite indexing (composite attribute set to GERMAN)
Alternate spelling (alternate_spelling attribute set to GERMAN)

2.11.3.1.5 Finnish, Norwegian, and Swedish Language Settings

If your language is Finnish, Norwegian, or Swedish, this preference uses the BASIC_LEXER with the following option enabled:

Alternate spelling (alternate_spelling attribute set to SWEDISH)

2.11.3.1.6 Japanese Language Settings

If you language is Japanese, this preference uses the JAPANESE_VGRAM_LEXER.

2.11.3.1.7 Korean Language Settings

If your language is Korean, this preference uses the KOREAN_MORPH_LEXER. All attributes for the KOREAN_MORPH_LEXER are enabled.

2.11.3.1.8 Chinese Language Settings

If your language is Simplified or Traditional Chinese, this preference uses the CHINESE_VGRAM_LEXER.

2.11.3.1.9 Other Languages

For all other languages not listed in this section, this preference uses the BASIC_LEXER with no attributes set.

See Also:

To learn more about these options, see "BASIC_LEXER"

2.11.3.2 CTXSYS.DEFAULT_EXTRACT_LEXER

This preference uses AUTO_LEXER with the following options:

alternate_spelling is NONE
base_letter is NO
mixed_case is YES

2.11.3.3 CTXSYS.BASIC_LEXER

This preference uses the BASIC_LEXER.

2.11.4 Section Group

This section discusses the types associated with section group preferences.

2.11.4.1 CTXSYS.NULL_SECTION_GROUP

This preference uses the NULL_SECTION_GROUP type.

2.11.4.2 CTXSYS.HTML_SECTION_GROUP

This preference uses the HTML_SECTION_GROUP type.

2.11.4.3 CTXSYS.AUTO_SECTION_GROUP

This preference uses the AUTO_SECTION_GROUP type.

2.11.4.4 CTXSYS.PATH_SECTION_GROUP

This preference uses the PATH_SECTION_GROUP type.

2.11.5 Stoplist

This section discusses the types associated with stoplist preferences.

2.11.5.1 CTXSYS.DEFAULT_STOPLIST

This stoplist preference defaults to the stoplist of your database language.

See Also:

For a complete list of the stop words in the supplied stoplists, see Appendix E, "Oracle Text Supplied Stoplists"

2.11.5.2 CTXSYS.EMPTY_STOPLIST

This stoplist has no words.

2.11.6 Storage

This section discusses the types associated with storage preferences.

2.11.6.1 CTXSYS.DEFAULT_STORAGE

This storage preference uses the BASIC_STORAGE type.

2.11.7 Wordlist

This section discusses the types associated with wordlist preferences.

2.11.7.1 CTXSYS.DEFAULT_WORDLIST

This preference uses the language stemmer for your database language. If your language is not listed in Table 2-40, then this preference defaults to the NULL stemmer and the GENERIC fuzzy matching attribute.

2.12 System Parameters

This section describes the Oracle Text system parameters, which are divided into the following categories:

General System Parameters
Default Index Parameters

2.12.1 General System Parameters

When you install Oracle Text, in addition to the system-defined preferences, the following system parameters are set:

Table 2-47 General System Parameters

System Parameter	Description
`MAX_INDEX_MEMORY`	This is the maximum indexing memory that can be specified in the parameter clause of `CREATE` `INDEX` and `ALTER` `INDEX`. The maximum value for this parameter is 2 GB -1.
`DEFAULT_INDEX_MEMORY`	This is the default indexing memory used with `CREATE` `INDEX` and `ALTER` `INDEX`.
`LOG_DIRECTORY`	This is the directory for `CTX_OUTPUT` log files.
`CTX_DOC_KEY_TYPE`	This is the default input key type, either `ROWID` or `PRIMARY_KEY`, for the `CTX_DOC` procedures. Set to `ROWID` at install time. See Also: CTX_DOC.SET_KEY_TYPE.

View system defaults by querying the CTX_PARAMETERS view. Change defaults using the CTX_ADM.SET_PARAMETER procedure.

2.12.2 Default Index Parameters

This section describes the index parameters that you can use when you create CONTEXT and CTXCAT indexes.

2.12.2.1 CONTEXT Index Parameters

The following default parameters are used when you create a CONTEXT index and do not specify preferences in the parameter clause of CREATE INDEX. Each default parameter names a system-defined preference to use for data storage, filtering, lexing, and so on.

Table 2-48 Default CONTEXT Index Parameters

Parameter	Used When	Default Value
`DEFAULT_DATASTORE`	No datastore preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_DATASTORE
`DEFAULT_FILTER_FILE`	No filter preference specified in parameter clause of `CREATE` `INDEX`, and either of the following conditions is true: Your files are stored in external files (BFILES) or Specify a datastore preference that uses `FILE_DATASTORE`	CTXSYS.AUTO_FILTER
`DEFAULT_FILTER_BINARY`	No filter preference specified in parameter clause of `CREATE` `INDEX`, and Oracle Text detects that the text column datatype is `RAW`, `LONG` `RAW`, or `BLOB`.	CTXSYS.AUTO_FILTER
`DEFAULT_FILTER_TEXT`	No filter preference specified in parameter clause of `CREATE` `INDEX`, and Oracle Text detects that the text column datatype is either `LONG`, `VARCHAR2`, `VARCHAR`, `CHAR`, or `CLOB`.	CTXSYS.NULL_FILTER
`DEFAULT_SECTION_HTML`	No section group specified in parameter clause of `CREATE` `INDEX`, and when either of the following conditions is true: Your datastore preference uses `URL_DATASTORE` or Your filter preference uses `AUTO_FILTER`.	CTXSYS.HTML_SECTION_GROUP
`DEFAULT_SECTION_TEXT`	No section group specified in parameter clause of `CREATE` `INDEX`, and when you do not use either `URL_DATASTORE` or `AUTO_FILTER`.	CTXSYS.NULL_SECTION_GROUP
`DEFAULT_STORAGE`	No storage preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_STORAGE
`DEFAULT_LEXER`	No lexer preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_LEXER
`DEFAULT_STOPLIST`	No stoplist specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_STOPLIST
`DEFAULT_WORDLIST`	No wordlist preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_WORDLIST

2.12.2.2 CTXCAT Index Parameters

The following default parameters are used when you create a CTXCAT index with CREATE INDEX and do not specify any parameters in the parameter string. The CTXCAT index supports only the index set, lexer, storage, stoplist, and wordlist parameters. Each default parameter names a system-defined preference.

Table 2-49 Default CTXCAT Index Parameters

Parameter	Used When	Default Value
`DEFAULT_CTXCAT_INDEX_SET`	No index set specified in parameter clause of `CREATE` `INDEX`.	n/a
`DEFAULT_CTXCAT_STORAGE`	No storage preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_STORAGE
`DEFAULT_CTXCAT_LEXER`	No lexer preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_LEXER
`DEFAULT_CTXCAT_STOPLIST`	No stoplist specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_STOPLIST
`DEFAULT_CTXCAT_WORDLIST`	No wordlist preference specified in parameter clause of `CREATE` `INDEX`. Note that while you can specify a wordlist preference for `CTXCAT` indexes, most of the attributes do not apply, because the catsearch query language does not support wildcarding, fuzzy, and stemming. The only attribute that is useful is `PREFIX_INDEX` for Japanese data.	CTXSYS.DEFAULT_WORDLIST

2.12.2.3 CTXRULE Index Parameters

Table 2-50 lists the default parameters that are used when you create a CTXRULE index with CREATE INDEX and do not specify any parameters in the parameter string. The CTXRULE index supports only the lexer, storage, stoplist, and wordlist parameters. Each default parameter names a system-defined preference.

Table 2-50 Default CTXRULE Index Parameters

Parameter	Used When	Default Value
`DEFAULT_CTXRULE_LEXER`	No lexer preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_LEXER
`DEFAULT_CTXRULE_STORAGE`	No storage preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_STORAGE
`DEFAULT_CTXRULE_STOPLIST`	No stoplist specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_STOPLIST
`DEFAULT_CTXRULE_WORDLIST`	No wordlist preference specified in parameter clause of `CREATE` `INDEX`.	CTXSYS.DEFAULT_WORDLIST
`DEFAULT_CLASSIFIER`	No classifier preference is specified in parameter clause.	`RULE_CLASSIFIER`

CTXRULE Index Limitations

The CTXRULE index does not support the following query operators:

Fuzzy
Soundex

It also does not support the following BASIC_WORDLIST attributes:

SUBSTRING_INDEX
PREFIX_INDEX

2.12.2.4 Viewing Default Values

View system defaults by querying the CTX_PARAMETERS view. For example, to see all parameters and values, enter the following statement:

SQL> SELECT par_name, par_value from ctx_parameters;

2.12.2.5 Changing Default Values

Change a default value using the CTX_ADM.SET_PARAMETER procedure to name another custom or system-defined preference to use as default.