Indexing Special Characters

When you use the BASIC_LEXER preference type, you can specify how non-alphanumeric characters such as hyphens and periods are indexed in relation to the tokens that contain them. For example, you can specify that Oracle Text include or exclude hyphen character (-) when indexing a word such as Web-site.

These characters fall into BASIC_LEXER categories according to the behavior you require during indexing. The way you set the lexer to behave for indexing is the way it behaves for query parsing.

Some of the special characters you can set are as follows:

Printjoin Characters

Define a non-alphanumeric character as printjoin when you want this character to be included in the token during indexing.

For example, if you want your index to include hyphens and underscore characters, define them as printjoins. This means that words such as web-site are indexed as web-site. A query on website does not find web-site.

Skipjoin Characters

Define a non-alphanumeric character as a skipjoin when you do not want this character to be indexed with the token that contains it.

For example, with the hyphen (-) character defined as a skipjoin, the word web-site is indexed as website. A query on web-site finds documents containing website and web-site.

Other Characters

Other characters can be specified to control other tokenization behavior such as token separation (startjoins, endjoins, whitespace), punctuation identification (punctuations), number tokenization (numjoins), and word continuation after line-breaks (continuation). These categories of characters have defaults, which you can modify.

See Also:

Oracle Text Reference to learn more about the BASIC_LEXERtype