Creating a Model that Includes Text Mining

Oracle Data Mining supports unstructured text within columns of VARCHAR2, CHAR, CLOB, BLOB, and BFILE, as described in Table 7-1.

Table 7-1 Column Data Types That May Contain Unstructured Text

Data Type	Description
`BFILE` and `BLOB`	Oracle Data Mining interprets `BLOB` and `BFILE` as text only if you identify the columns as text when you create the model. If you do not identify the columns as text, then `CREATE_MODEL` returns an error.
`CLOB`	Oracle Data Mining interprets `CLOB` as text.
`CHAR`	Oracle Data Mining interprets `CHAR` as categorical by default. You can identify columns of `CHAR` as text when you create the model.
`VARCHAR2`	Oracle Data Mining interprets `VARCHAR2` with data length > 4000 as text. Oracle Data Mining interprets `VARCHAR2` with data length <= 4000 as categorical by default. You can identify these columns as text when you create the model.

Note:

Text is not supported in nested columns or as a target in supervised data mining.

The settings described in Table 7-2 control the term extraction process for text attributes in a model. Instructions for specifying model settings are in "Specifying Model Settings".

Table 7-2 Model Settings for Text

Setting Name Data Type Setting Value Description

Setting Name	Data Type	Setting Value	Description
`ODMS_TEXT_POLICY_NAME`	`VARCHAR2(4000)`	Name of an Oracle Text policy object created with `CTX_DDL.CREATE_POLICY`	Affects how individual tokens are extracted from unstructured text. See "Creating a Text Policy".
`ODMS_TEXT_MAX_FEATURES`	`INTEGER`	1 <= value <= 100000	Maximum number of features to use from the document set (across all documents of each text column) passed to `CREATE_MODEL`. Default is 3000.

ODMS_TEXT_POLICY_NAME

VARCHAR2(4000)

Name of an Oracle Text policy object created with CTX_DDL.CREATE_POLICY

Affects how individual tokens are extracted from unstructured text. See "Creating a Text Policy".

ODMS_TEXT_MAX_FEATURES

INTEGER

1 <= value <= 100000

Maximum number of features to use from the document set (across all documents of each text column) passed to CREATE_MODEL.

Default is 3000.

A model can include one or more text attributes. A model with text attributes can also include categorical and numerical attributes.

To create a model that includes text attributes:

Create an Oracle Text policy object, as described in "Creating a Text Policy".
Specify the model configuration settings that are described in Table 7-2.
Specify which columns should be treated as text and, optionally, provide text transformation instructions for individual attributes. See "Configuring a Text Attribute".
Pass the model settings and text transformation instructions to DBMS_DATA_MINING.CREATE_MODEL. See "Embedding Transformations in a Model".

Note:

All algorithms except O-Cluster can support columns of unstructured text.

The use of unstructured text is not recommended for association rules (Apriori).