15 Oracle Text Utilities

This chapter discusses the utilities shipped with Oracle Text. The following topics are included:

Thesaurus Loader (ctxload)
Entity Extraction User Dictionary Loader (ctxload)
Knowledge Base Extension Compiler (ctxkbtc)
Lexical Compiler (ctxlc)

15.1 Thesaurus Loader (ctxload)

Use ctxload to import a thesaurus file into the Oracle Text thesaurus tables.

An import file is an ASCII flat file that contains entries for synonyms, broader terms, narrower terms, or related terms, which can be used to expand queries.

See Also:

For examples of import files for thesaurus importing, see "Structure of ctxload Thesaurus Import File" in Appendix C, "Text Loading Examples for Oracle Text"

15.1.1 Text Loading

The ctxload program no longer supports the loading of text columns. To load files to a text column in batch mode, Oracle recommends that you use SQL*Loader.

15.1.2 ctxload Syntax

ctxload -user username[/password][@sqlnet_address]
        -name object_name 
        -file file_name 
      
       [-thes]
       [-thescase y|n]
       [-thesdump]
       [-log file_name]
       [-trace]
       [-drop]

15.1.2.1 Mandatory Arguments

-user

Specify the user name and password of the user running ctxload.

The user name and password can be followed immediately by @sqlnet_address to permit logging on to remote databases. The value for sqlnet_address is a database connect string. If the TWO_TASK environment variable is set to a remote database, then you do not need to specify a value for sqlnet_address to connect to the database.

-name object_name

When you use ctxload to import a thesaurus, use object_name to specify the name of the thesaurus to be imported.

Use object_name to identify the thesaurus in queries that use thesaurus operators.

Note:

Thesaurus name must be unique. If the name specified for the thesaurus is identical to an existing thesaurus, then ctxload returns an error and does not overwrite the existing thesaurus.

-file file_name

When ctxload is used to import a thesaurus, use file_name to specify the name of the import file that contains the thesaurus entries.

When ctxload is used to export a thesaurus, use file_name to specify the name of the export file created by ctxload.

Note:

If the name specified for the thesaurus dump file is identical to an existing file, then ctxload overwrites the existing file.

15.1.2.2 Optional Arguments

-thes

Import a thesaurus. Specify the source file with the -file argument. Specify the name of the thesaurus to be imported with -name.

-thescase y | n

Specify y to create a case-sensitive thesaurus with the name specified by -name and populate the thesaurus with entries from the thesaurus import file specified by -file. If -thescase is y (the thesaurus is case-sensitive), ctxload enters the terms in the thesaurus exactly as they appear in the import file.

The default for -thescase is n (case-insensitive thesaurus).

Note:

-thescase is valid for use only with the -thes argument.

-thesdump

Export a thesaurus. Specify the name of the thesaurus to be exported with the -name argument. Specify the destination file with the -file argument.

-log

Specify the name of the log file to which ctxload writes any national-language supported (Globalization Support) messages generated during processing. If you do not specify a log file name, the messages appear on the standard output.

-trace

Enables SQL statement tracing using ALTER SESSION SET SQL_TRACE TRUE. This command captures all processed SQL statements in a trace file, which can be used for debugging. The location of the trace file is operating-system dependent and can be modified using the USER_DUMP_DEST initialization parameter.

See Also:

For more information about SQL trace and the USER_DUMP_DEST initialization parameter, see Oracle Database Administrator's Guide

15.1.3 ctxload Examples

This section provides examples for some of the operations that ctxload can perform.

See Also:

For more document loading examples, see Appendix C, "Text Loading Examples for Oracle Text"

15.1.3.1 Thesaurus Import Example

The following example imports a thesaurus named tech_doc from an import file named tech_thesaurus.txt:

ctxload -user jsmith/123abc -thes -name tech_doc -file tech_thesaurus.txt

15.1.3.2 Thesaurus Export Example

The following example dumps the contents of a thesaurus named tech_doc into a file named tech_thesaurus.out:

ctxload -user jsmith/123abc -thesdump -name tech_doc -file tech_thesaurus.out

15.2 Entity Extraction User Dictionary Loader (ctxload)

Use ctxload to import an entity extraction user dictionary into Oracle Text tables.

An import file is an XML flat file containing entries for entities, with their associated types and alternate forms.

15.2.1 ctxload Syntax

ctxload -user username[/password][@sqlnet_address]
        -extract
        -name entity extraction policy name 
        -file user-dictionary file name 
        [-drop] to drop a user-dictionary from a policy

15.2.1.1 Mandatory Arguments

-user

Specify the user name and password of the user running ctxload.

-name entity extraction policy name

When you use ctxload to import an entity extraction dictionary, use object_name to specify the entity extraction policy to associate the dictionary with. An entity extraction policy can have only one user dictionary.

-file user-dictionary file name

Use file to specify the name of the XML file containing the user dictionary.

-drop

Drop the user dictionary currently associated with an entity extraction policy.

15.2.2 Considerations When Creating a User Dictionary

The following are some considerations for when creating a user dictionary:

Entity mentions are case-sensitive. They cannot contain any null characters.
Entity type names are case-insensitive. They cannot contain any null or comma characters.
Customers will be able to assign two or more entity types to a single entity mention. For example, the entity "Washington" could be assigned the type "CITY" and also the type "STATE".
The content of a user's dictionary is invisible to other users.
The maximum byte length of an entity mention is 512 bytes by the server-side database character set.
The maximum byte length of an entity type name is 30 bytes by the server-side database character set.
User-defined entity type names must start with the letter "x".

15.2.3 XML Schema

The entity extraction dictionary follows this XML schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<xsd:element name="dictionary">
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name="entities" type="entityType" maxOccurs="unbounded"/>
    </xsd:sequence>
  <xsd:complexType>
</xsd:element>

<xsd:complexType name="entityType">
  <xsd:sequence>
    <xsd:element name="entity" type="entType" maxOccurs="unbounded"/>
  </xsd:sequence>
  </xsd:attribute name="language" type="xsd:string"/>
</xsd:complexType>

<xsd:complexType name="entType">
  <xsd:sequence>
    <xsd:element name="value" type="xsd:string"/>
    <xsd:element name="type" type="xsd:string" minOccurs="1" maxOccurs="unbounded"/>
    <xsd:element name="alternate" type="xsd:string" minOccurs="0" maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

The following tables illustrate some aspects of the XML schema for the entity extraction dictionary.

Element Name	Description
dictionary	Collection of entities
entities	Collection of entities per language
entity	Each entity
value	Entity mention
type	Entity type
alternate	Alternate form of entity

Attribute Name	Element Name	Description
language	Entities	Language name of each entity in entities

15.2.4 ctxload Example

The following is an example of an entity extraction user dictionary file that can be loaded using ctxload:

<?xml version="1.0" encoding="utf-8"?>
<dictionary>
  <entities>
    <entity>
      <value>New York</value>
      <type>city</type>
    </entity>
  </entities>
  <entities language="german">
    <entity>
      <value>Deutschland</value>
      <type>country</type>
    <entity>
  </entities>
  <entities language="english">
    <entity>
      <value>Astra</value>
        <type>person</type>
        <type>organization</type>
    </entity>
    <entity>
      <value>George W. Bush<value>
        <type>person</type>
      <alternate>G. W. Bush<alternate>
      <alternate>G. Bush<alternate>
    </entity>
  </entities>
</dictionary>

15.3 Knowledge Base Extension Compiler (ctxkbtc)

The knowledge base is the information source that Oracle Text uses to perform theme analysis, such as theme indexing, processing ABOUT queries, and to document theme extraction with the CTX_DOC package. A knowledge base is supplied for English and French.

With the ctxkbtc compiler, you can:

Extend your knowledge base by compiling one or more thesauri with the Oracle Text knowledge base. The extended information can be application-specific terms and relationships. During theme analysis, the extended portion of the knowledge base overrides any terms and relationships in the knowledge base where there is overlap.
Create a new user-defined knowledge base by compiling one or more thesauri. In languages other than English and French, this feature can be used to create a language-specific knowledge base.

Note:
Only CTXSYS can extend the knowledge base.

See Also:
For more information about the knowledge base packaged with Oracle Text, see http://www.oracle.com/technology/products/text/
For more information about the ABOUT operator, see ABOUT operator in Chapter 3, "Oracle Text CONTAINS Query Operators"

For more information about document services, see Chapter 8, "CTX_DOC Package"

15.3.1 Knowledge Base Character Set

Knowledge bases can be in any single-byte character set. Supplied knowledge bases are in WE8ISO8859P1. You can store an extended knowledge base in another character set such as US7ASCII.

15.3.2 ctxkbtc Syntax

ctxkbtc -user uname/passwd

[-name thesname1 [thesname2 ... thesname16]]
[-revert]
[-stoplist stoplistname]
[-verbose]
[-log filename]

-user: Specify the user name and password for the administrator creating an extended knowledge base. This user must have write permission to the ORACLE_HOME directory.
-name thesname1 [thesname2 ... thesname16]: Specify the names of the thesauri (up to 16) to be compiled with the knowledge base to create the extended knowledge base. The thesauri you specify must already be loaded with ctxload with the "-thescase Y" option
-revert: Reverts the extended knowledge base to the default knowledge base provided by Oracle Text.
-stoplist stoplistname: Specify the name of the stoplist. Stopwords in the stoplist are added to the knowledge base as useless words that are prevented from becoming themes or contributing to themes. Add stopthemes after running this command using CTX_DLL.ADD_STOPTHEME.
-verbose: Displays all warnings and messages, including non-Globalization Support messages, to the standard output.
-log: Specify the log file for storing all messages. When you specify a log file, no messages are reported to standard out.

15.3.3 ctxkbtc Usage Notes

Before running ctxkbtc, you must set the NLS_LANG environment variable to match the database character set.
The user issuing ctxkbtc must have write permission to the ORACLE_HOME, because the program writes files to this directory.
Before being compiled, each thesaurus must be loaded into Oracle Text case sensitive with the "-thescase Y" option in ctxload.
Running ctxkbtc twice removes the previous extension.

15.3.4 ctxkbtc Limitations

The ctxkbtc program has the following limitations:

When upgrading or downgrading your database to a different release, for theme indexing and related features to work correctly, Oracle recommends that you recompile your extended knowledge base in the new environment.
Before extending the knowledge base, you must terminate all server processes that have invoked any knowledge base-related Text functions during their lifetime.
There can be only one user extension for each language for each installation. Because a user extension affects all users at the installation, only the CTXSYS user can extend the knowledge base.
In an Oracle RAC environment, the ORACLE_HOME can either be shared between multiple nodes, or each node can have its own ORACLE_HOME. The following requirements apply:
- Before using any knowledge base-dependent functionality in any of the Oracle RAC nodes, ctxkbtc must be run in every ORACLE_HOME in the Oracle RAC environment.
- When using ctxkbtc, the exact same input thesaurus content must be used in every ORACLE_HOME in the Oracle RAC environment.

15.3.5 ctxkbtc Constraints on Thesaurus Terms

Terms are case sensitive. If a thesaurus has a term in uppercase, for example, the same term present in lowercase form in a document will not be recognized.

The maximum length of a term is 80 characters.

Disambiguated homographs are not supported.

15.3.6 ctxkbtc Constraints on Thesaurus Relations

The following constraints apply to thesaurus relations:

BTG and BTP are the same as BT. NTG and NTP are the same as NT.
Only preferred terms can have a BT, NTs or RTs.
If a term has no USE relation, it will be treated as its own preferred term.
If a set of terms are related by SYN relations, only one of them may be a preferred term.
An existing category cannot be made a top term.
There can be no cycles in BT and NT relations.
A term can have at most one preferred term and at most one BT. A term may have any number of NTs.
An RT of a term cannot be an ancestor or descendent of the term. A preferred term may have any number of RTs up to a maximum of 32.
The maximum height of a tree is 16 including the top term level.
When multiple thesauri are being compiled, a top term in one thesaurus should not have a broader term in another thesaurus.

Note:
The thesaurus compiler tolerates some violations of the preceding rules. For example, if a term has multiple BTs, then the compiler ignores all but the last one it encounters.
Similarly, BTs between existing knowledge base categories result only in a warning message.

Oracle recommends that you do not set up extended storage bases with violations. Using extended storage bases containing violations can produce undesired results.

15.3.7 Extending the Knowledge Base

Extend the supplied knowledge base by compiling one or more thesauri with the Oracle Text knowledge base. The extended information can be application-specific terms and relationships. During theme analysis, the extended portion of the knowledge base overrides any terms and relationships in the knowledge base where there is overlap.

When extending the knowledge base, Oracle recommends that new terms be linked to one of the categories in the knowledge base for best results in theme proving when appropriate.

See Also:

For complete description of the supplied knowledge base, see http://www.oracle.com/technology/products/text/

If new terms are kept completely disjoint from existing categories, fewer themes from new terms will be proven. The result of this is poorer precision and recall with ABOUT queries as well poor quality of gists and theme highlighting.

Link new terms to existing terms by making an existing term the broader term for the new terms.

15.3.7.1 Example for Extending the Knowledge Base

You purchase a medical thesaurus medthes containing a hierarchy of medical terms. The four top terms in the thesaurus are the following:

Anesthesia and Analgesia
Anti-Allergic and Respiratory System Agents
Anti-Inflammatory Agents, Antirheumatic Agents, and Inflammation Mediators
Antineoplastic and Immunosuppressive Agents

To link these terms to the existing knowledge base, add the following entries to the medical thesaurus to map the new terms to the existing health and medicine branch:

health and medicine
 NT Anesthesia and Analgesia
 NT Anti-Allergic and Respiratory System Agents
 NT Anti-Inflamammatory Agents, Antirheumatic Agents, and Inflamation Mediators
 NT Antineoplastic and Immunosuppressive Agents

Set your globalization support language environment variable to match the database character set. For example, if your database character set is WE8ISO8859P1 and you are using American English, set your NLS_LANG as follows:

setenv NLS_LANG AMERICAN_AMERICA.WE8ISO8859P1

Assuming the medical thesaurus is in a file called med.thes, load the thesaurus as medthes with ctxload as follows:

ctxload -thes -thescase y -name medthes -file med.thes -user ctxsys/ctxsys

To link the loaded thesaurus medthes to the knowledge base, use ctxkbtc as follows:

ctxkbtc -user ctxsys/ctxsys -name medthes

15.3.8 Adding a Language-Specific Knowledge Base

Extend theme functionality to languages other than English or French by loading your own knowledge base for any single-byte whitespace delimited language, including Spanish.

Theme functionality includes theme indexing, ABOUT queries, theme highlighting, and the generation of themes, gists, and theme summaries with the CTX_DOC PL/SQL package.

Extend theme functionality by adding a user-defined knowledge base. For example, you can create a Spanish knowledge base from a Spanish thesaurus.

To load your language-specific knowledge base, follow these steps:

Load your custom thesaurus using ctxload.
Set NLS_LANG so that the language portion is the target language. The charset portion must be a single-byte character set.
Compile the loaded thesaurus using ctxkbtc:

ctxkbtc -user ctxsys/ctxsys -name my_lang_thes

This command compiles your language-specific knowledge base from the loaded thesaurus. To use this knowledge base for theme analysis during indexing and ABOUT queries, specify the NLS_LANG language as the THEME_LANGUAGE attribute value for the BASIC_LEXER preference.

15.3.8.1 Limitations for Adding a Knowledge Base

The following limitations hold for adding knowledge bases:

Oracle Text supplies knowledge bases in English and French only. You must provide your own thesaurus for any other language.
You can only add knowledge bases for languages with single-byte character sets. You cannot create a knowledge base for languages which can be expressed only in multibyte character sets. If the database is a multibyte universal character set, such as UTF-8, the NLS_LANG parameter must still be set to a compatible single-byte character set when compiling the thesaurus.
Adding a knowledge base works best for whitespace delimited languages.
You can have at most one knowledge base for each globalization support language.
Obtaining hierarchical query feedback information such as broader terms, narrower terms and related terms does not work in languages other than English and French. In other languages, the knowledge bases are derived entirely from your thesauri. In such cases, Oracle recommends that you obtain hierarchical information directly from your thesauri.

15.3.9 Order of Precedence for Multiple Thesauri

When multiple thesauri are to be compiled, precedence is determined by the order in which thesauri are listed in the arguments to the compiler, assumed to be most preferred first. A user-defined thesaurus always has precedence over the built-in knowledge base.

15.3.10 Size Limits for Extended Knowledge Base

The following table lists the size limits associated with creating and compiling an extended knowledge base.

Table 15-1 Size Limit for the Extended Knowledge Base

Description of Parameter	Limit
Number of RTs (from + to) for each term	32
Number of terms for each single hierarchy (for example, all narrower terms for a given top term)	64000
Number of new terms in an extended knowledge base	1 million
Number of separate thesauri that can be compiled into a user extension to the KB	16

15.4 Lexical Compiler (ctxlc)

The Lexical Compiler (ctxlc) is a command-line utility that enables you to create your own Chinese and Japanese lexicons (dictionaries). Such a lexicon may either be generated from a user-supplied word list or from the merging of a word list with the system lexicon for that language.

ctxlc creates the new lexicon in your current directory. The new lexicon consists of three files, drold.dat, drolk.dat, and droli.dat. To change your system lexicon for Japanese or Chinese, overwrite the system lexicon with these files.

The Lexical Compiler can also generate wordlists from the system lexicons for Japanese and Chinese, enabling you to see their contents. These word lists go to the standard output and thus can be redirected into a file of your choice.

After overwriting the system lexicon, you need to re-create your indexes before querying them.

15.4.1 Syntax of ctxlc

ctxlc has the following syntax:

ctxlc -ja | -zh [ -n ] -ics character_set -i input_file

ctxlc -ja | -zh -ocs character_set [ > output_file ]

15.4.1.1 Mandatory Arguments

-ja | -zh: Specify the language of the lexicon to modify or create. -ja indicates the Japanese lexicon; -zh indicates the Chinese lexicon.
-ics character_set: Specify the character set of the input file denoted by -i input_file. input_file is the list of words, one word to a line, to use in creating the new lexicon.
-i input_file: Specify the file containing words to use in creating a new lexicon.
-ocs character_set: Specify the character set of the text file to be output.

15.4.1.2 Optional Arguments

-n: Specify -n to create a new lexicon that consists only of user-supplied words taken from input_file. If -n is not specified, then the new lexicon consists of a merge of the system lexicon with input_file. Also, when -n is not selected, a text file called drolt.dat, is created in the current directory to enable you to inspect the contents of the merged lexicon without having to enter another ctxlc command.

15.4.2 Performance Considerations

You can add up to 1,000,000 new words to a lexicon. However, creating a very large lexicon can reduce performance in indexing and querying. Performance is best when the lexicon character set is UTF-8. There is no performance impact on the Chinese or Japanese V-gram lexers, as they do not use lexicons.

15.4.3 ctxlc Usage Notes

Oracle recommends the following practices with regard to ctxlc:

Save your plain text dictionary file in your environment for emergency use.
When upgrading or downgrading your database to a different release, recompile your plain text dictionary file in the new environment so that the user lexicon will work correctly.

15.4.4 Example

In this example, you create a new Japanese lexicon from the file jadict.txt, a word list that uses the JA16EUC character set. Because you are not specifying -n, the new lexicon is the result of merging jadict.txt with the system Japanese lexicon. Then replace the existing Japanese lexicon with the new, merged one.

% ctxlc -ja -ics JA16EUC -i jadict.txt

This creates new files in the current directory:

% ls
drold.dat
drolk.dat
droli.dat
drolt.dat

The system lexicon files for Japanese and Chinese are named droldxx.dat drolkxx.dat, and drolixx.dat, where xx is either JA (for Japanese) or ZH (for Chinese). Rename the three new files and copy them to the directory containing the system Japanese lexicon.

% mv drold.dat droldJA.dat
% mv drolk.dat drolkJA.dat
% mv droli.dat droliJA.dat
% cp *dat $ORACLE_HOME/ctx/data/jalx

This replaces the system Japanese lexicon with one that is a merge of the old system lexicon and your wordlist from jadict.txt.

You can also use ctxlc to get a dump of a system lexicon. This example dumps the Chinese lexicon to a file called new_chinese_dict.txt in the current directory:

% ctxlc -zh -ocs UTF8 > new_chinese_dict.txt

This creates a file, new_japanese.dict.txt, using the UTF8 character set, in the current directory.