Skip Headers
Oracle® Data Mining Application Developer's Guide,
10g Release 2 (10.2)

Part Number B14340-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

3 Managing Models

Models created by ODM APIs or by Oracle Data Miner are stored in the Database. This chapter provides information about viewing, accessing, configuring, exporting and importing models.

This chapter contains the following topics:

3.1 Models in the Database

A model is identified by its name. Like tables in the database, a model has storage associated with it. But unlike a table, the form, shape, and content of this storage is opaque to the user. A model is not a database schema object.

You can view the contents of a model — that is, the patterns and rules that constitute a mining model — using algorithm-specific GET_MODEL_DETAILS functions in the DBMS_DATA_MINING PL/SQL package. These functions are documented in Oracle Database PL/SQL Packages and Types Reference. See "Exploring Model Details" for information on model details in the Java API.

You can view a list of the models in your schema by querying the DM_USER_MODELS view. The columns of the DM_USER_MODELS view are described in Table 3-1.

Table 3-1 DM_USER_MODELS View

Column Data Type Description

name

VARCHAR2(25)

Name of the model.

function_name

VARCHAR2(30)

The model function. See Chapter 1 for an overview of mining functions.

algorithm_name

VARCHAR2(30)

The algorithm used by the model. See Chapter 1 for algorithms used by the mining functions.

ctime_creation_date

DATE

The date on which the model was created.

build_duration

NUMBER

The duration of the model build process.

target_attribute

VARCHAR2(30)

The attribute designated as the target of a classification model.

model_size

NUMBER

The size of the model in megabytes.


Note:

Metadata about models is stored in system tables whose names have the prefix DM$ or DM. You should not attempt to query or modify these system tables, and you should not use DM$ or DM_ prefixes in the names of any tables used by ODM applications.

The following query lists the demo programs in the DM_USER schema.

SQL> select 'NAME', 'FUNCTION_NAME', 'ALGORITHM_NAME' from DM_USER_MODELS;

NAME                  FUNCTION_NAME         ALGORITHM_NAME          
--------------------------------------------------------------------
ABN_SH_CLAS_SAMPLE    CLASSIFICATION        ADAPTIVE_BAYES_NETWORK  
AI_SH_SAMPLE          ATTRIBUTE_IMPORTANCE  MINIMUM_DESCRIPTION_LENGTH
AR_SH_SAMPLE          ASSOCIATION_RULES     APRIORI_ASSOCIATION_RULES 
DT_SH_CLAS_SAMPLE     CLASSIFICATION        DECISION_TREE             
KM_SH_CLUS_SAMPLE     CLUSTERING            KMEANS                    
NB_SH_CLAS_SAMPLE     CLASSIFICATION        NAIVE_BAYES               
NMF_SH_SAMPLE         FEATURE_EXTRACTION    NONNEGATIVE_MATRIX_FACTOR  
OC_SH_CLUS_SAMPLE     CLUSTERING            O_CLUSTER    
SVMC_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES    
SVMO_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES  
SVMR_SH_REGR_SAMPLE   REGRESSION            SUPPORT_VECTOR_MACHINES    
T_SVM_CLAS_SAMPLE     CLASSIFICATION        SUPPORT_VECTOR_MACHINES    

See Also:

Oracle Data Mining Administrator's Guide for information about installing, running, and viewing the demo programs.

3.1.1 Model Names

Although ODM models are not stored as Oracle schema objects, their names must conform to Database requirements for nonquoted identifiers. Additionally, model names must be less than 25 bytes long.

Oracle requires that nonquoted identifiers contain only alphanumeric characters, the underscore (_), dollar sign ($), and pound sign (#); the initial character must be alphabetic. Oracle strongly discourages the use of the dollar sign and pound sign in nonquoted literals.

See Also:

Oracle Database SQL Reference for information on schema object naming requirements.

3.1.2 Model Access

Oracle Data Mining does not support a general privilege model that spans multiple users. GRANT and REVOKE of read and update privileges on a mining model across user schemas are not yet supported.

You can only read and update models in your own schema. If you want to modify the settings of a model or view its details, you must be logged in with the identity of the schema that owns the model. Results of all mining operations are generated in the schema that owns the model.

Models in one schema can be exported to other schemas. You can import a model into your own schema once it has been exported to an accessible location.

3.2 Import/Export

Mining models are included when a database or schema is exported or imported with the Oracle Data Pump utility. You can export and import individual models or groups of models using the ODM SQL and Java APIs.

You can use the EXPORT_MODEL procedure in the DBMS_DATA_MINING package to export a model or a group of models to a dump file. Models can be imported from the dump file using IMPORT_MODEL.

The Java API uses the ExportTask and ImportTask standard JDM interfaces to provide the same export/import functionality.

See Also:

3.3 Model Settings

A settings table is a relational table that provides configuration information for a specific model. You must create a settings table if you want a model to have any nondefault characteristics. You will supply the name of the settings table when you create the model.

You must create the settings table in the schema of the model. You can choose the name of the settings table, but the column names and their types must be defined as shown.

Column Name Data Type
setting_name VARCHAR2(30)
setting_value VARCHAR2(128)

The values inserted into the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. Depending on what the setting name denotes, the value for the setting_value column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_value column is defined to be VARCHAR2. You can explicitly cast numerical inputs to string using the TO_CHAR() function, or you can rely on the implicit type conversion provided by the Database.

The settings described in Table 3-2 apply to a mining function. Use these settings to specify the algorithm that the model will use, the location of cost matrix and prior probabilities tables, and other function-specific characteristics. See Table 1-1, "Predictive Data Mining Functions" and Table 1-2, "Descriptive Data Mining Functions" for information about mining functions.

Table 3-2 Data Mining Function Settings

Algorithm Settings Setting Value (with Permissible Value Ranges)

algo_name

Classification: One of:

  • algo_naive_bayes (Default)

  • algo_support_vector_machines

    (Use this setting for both SVM and One-Class SVM

  • algo_adaptive_bayes_network

  • algo_decision_tree

Regression:

  • algo_support_vector_machines

Association Rules:

  • algo_apriori_association_rules

Clustering:

  • algo_kmeans (Default)

  • algo_o_cluster

Feature Extraction:

  • algo_nonnegative_matrix_factor

Attribute Importance:

  • algo_ai_mdl

clas_cost_table_name

The name of a relational table that specifies a cost matrix. The column requirements for this table are described in "Costs".

This input is applicable only for Decision Tree algorithms, since this is the only algorithm that supports a cost matrix at build time. The cost matrix table must be present in the current user's schema.

clas_priors_table_name

The name of a relational table that specifies prior probabilities. The column requirements for this table are described in "Priors".

This input is applicable only for classification algorithms. Decision Tree is the only classification algorithm that does not use priors. The prior probabilities table must be present in the current user's schema.

For SVM classification, this setting identifies a table of weights.

clus_num_clusters

TO_CHAR(numeric_expr >= 1)

Number of clusters generated by a clustering algorithm.

Default is 10.

feat_num_features

TO_CHAR(numeric_expr > = 1)

Number of features to be extracted.

Default value estimated from the data by the algorithm.

asso_max_rule_length

TO_CHAR(2 <= numeric_expr <= 20)

Maximum rule length for AR algorithm.

Default is 4.

asso_min_confidence

TO_CHAR(0 <= numeric_expr <= 1)

Minimum confidence value for AR algorithm

Default is 0.1.

asso_min_support

TO_CHAR(0 <= numeric_expr <= 1)

Minimum support value for AR algorithm

Default is 0.1.


Table 3-3 through Table 3-9 provide algorithm-specific settings. You can use these settings to tune the behavior of the algorithm.

Table 3-3 Algorithm Settings for Adaptive Bayes Network

Setting Name Setting Value (with Permissible Value Ranges)

abns_model_type

Model type for Adaptive Bayes Network:

  • abns_single_feature

  • abns_multi_feature (Default)

  • abns_naive_bayes)

abns_max_build_minutes

TO_CHAR(numeric_expr >= 0)

The maximum time threshold for completion of model build.

Default is 0, which implies no time limit.

abns_max_nb_predictors

TO_CHAR(numeric_expr > 0)

Maximum number of predictors, measured by their MDL ranking, to be considered for building an ABN model of type abns_naive_bayes.

Default is 10.

abns_max_predictors

TO_CHAR(numeric_expr > 0)

Maximum number of predictors, measured by their MDL ranking, to be considered for building an ABN model of type abns_single_feature or abns_multi_feature.

Default is 25.


Table 3-4 Algorithm Settings for Naive Bayes

Setting Name Setting Value (with Permissible Value Ranges)

nabs_singleton_threshold

TO_CHAR(0 <= numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.01

nabs_pairwise_threshold

TO_CHAR(0 <= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default is 0.01.


Table 3-5 Algorithm Settings for Decision Tree

Setting Name Setting Value (with Permissible Value Ranges)

tree_impurity_metric

Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity for the entities in the node. Homogeneity is measured in accordance with a metric. For classification (Binary or multi-class targets), the supported metrics are gini and entropy.

  • tree_impurity_entropy

  • tree_impurity_gini (Default)

tree_term_max_depth

TO_CHAR(2 <= numeric_expr <= 20)

Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node).

Default is 7.

tree_term_minpct_node

TO_CHAR(0 <= numeric_expr <= 10)

No child shall have fewer records than this number, which is expressed as a percentage of the training rows.

Default is 0.05, indicating 0.05%.

tree_term_minpct_split

TO_CHAR(0 <= numeric_expr <= 20)

Criteria for splits: minimum number of records in a parent node expressed as a percent of the total number of records used to train the model. No split is attempted if number of records is below this value.

Default is 0.1, indicating 0.1%.

tree_term_minrec_node

TO_CHAR(numeric_expr >= 0)

No child shall have fewer records than this number.

Default is 10.

tree_term_minrec_split

TO_CHAR(numeric_expr >= 0)

Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if number of records is below this value.

Default is 20.


Table 3-6 Algorithm Settings for Support Vector Machines

Setting Name Setting Value (with Permissible Value Ranges)

svms_active_learning

Whether active learning is enabled or disabled:

  • svms_al_disable

  • svms_al_enable (Default)

When active learning is enabled, the SVM algorithm uses active learning to build a reduced size model. When active learning is disabled, the SVM algorithm builds a standard model.

svms_kernel_function

Kernel for Support Vector Machine:

  • svms_linear (Default

  • svms_gaussian

svms_kernel_cache_size

TO_CHAR(numeric_expr > 0)

Value of kernel cache size for SVM algorithm. Applies to Gaussian kernel only.

Default is 50000000 bytes.

svms_conv_tolerance

TO_CHAR(numeric_expr > 0)

Convergence tolerance for SVM algorithm

Default is 0.001.

svms_std_dev

TO_CHAR(numeric_expr > 0)

Value of standard deviation for SVM algorithm

This is applicable only for Gaussian kernel

Default value estimated from the data by the algorithm

svms_complexity_factor

TO_CHAR(numeric_expr > 0)

Value of complexity factor for SVM algorithm (both classification and regression)

Default value estimated from the data by the algorithm

svms_epsilon

TO_CHAR(numeric_expr > 0)

Value of epsilon factor for SVM Regression

Default value estimated from the data by the algorithm

svms_outlier_rate

TO_CHAR(0 <numeric_expr < 1)

The desired rate of outliers in the training data. Valid for One-Class SVM models only. Cannot be used with svms_complexity_factor.

Default is 0.1.


Table 3-7 Algorithm Settings for Non-Negative Matrix Factorization

Setting Name Setting Value (with Permissible Value Ranges)

nmfs_random_seed

TO_CHAR(numeric_expr)

Random seed for NMF algorithm.

Default is –1.

nmfs_num_iterations

TO_CHAR(1 <= numeric_expr <= 500)

Number of iterations for NMF algorithm

Default is 50

nmfs_conv_tolerance

TO_CHAR(0 < numeric_expr <= 0.5)

Convergence tolerance for NMF algorithm

Default is 0.05


Table 3-8 Algorithm Settings for O-Cluster

Setting Name Setting Value (with Permissible Value Ranges)

oclt_max_buffer

TO_CHAR(numeric_expr > 0)

Buffer size for O-Cluster.

Default is 50,000.

oclt_sensitivity

TO_CHAR(0 <=numeric_expr <= 1)

A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density.

Default is 0.5.


Table 3-9 Algorithm Settings for k-Means

Setting Name Setting Value (with Permissible Value Ranges)

kmns_distance

Distance Function for k-Means Clustering:

  • kmns_euclidean (Default)

  • kmns_cosine

  • kmns_fast_cosine

kmns_iterations

TO_CHAR(0 < numeric_expr <= 20)

Number of iterations for k-Means algorithm

Default is 3

kmns_conv_tolerance

TO_CHAR(0< numeric_expr <= 0.5)

Convergence tolerance for k-Means algorithm

Default is 0.01

kmns_split_criterion

Split criterion for k-Means Clustering:

  • kmns_variance (Default)

  • kmns_size

kmns_num_bins

TO_CHAR(numeric_expr > 0)

Number of histogram bins. Specifies the number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin.

Default is 10.

kmns_block_growth

TO_CHAR(1 < numeric_expr <= 5)

Growth factor for memory allocated to hold cluster data

Default value is 2

kmns_min_pct_attr_support

TO_CHAR(0<= numeric_expr<=1)

The fraction of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster.

Setting the parameter value too high in data with missing values can result in very short or even empty rules.

Default is 0.1.


This example creates a settings table for an SVM classification model and edits the individual values using SQL DML.

CREATE TABLE drugstore_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(128))

BEGIN
-- override the default for convergence tolerance for SVM Classification
INSERT INTO drugstore_model_settings (setting_name, setting_value)
VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081));
COMMIT;
END;

The table function GET_DEFAULT_SETTINGS provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using SQL DML.

BEGIN
CREATE TABLE drug_store_settings AS
SELECT setting_name, setting_value
  FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS
 WHERE setting_name LIKE 'SVMS_%';
-- update the values using appropriate DML
END;

You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS.

BEGIN
CREATE TABLE my_new_model_settings AS
SELECT setting_name, setting_value 
  FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_model'));
END;

3.3.1 Costs

In classification models, you can specify a cost matrix to represent the costs associated with false positive and false negative predictions. A cost matrix can be used in testing and scoring most classification models.

The Decision Tree algorithm can use a cost matrix at build time.To specify the cost matrix, you must create a cost matrix table and provide its name in the clas_cost_table_name setting for the Decision Tree model. See "Build Settings" for an example.

If you are using the Java API, instantiate a CostMatrix object and specify the name of the table as a parameter to the dmeConn.saveObject method for the object.

The cost matrix table must have these columns.

Column Name Data Type
actual_target_value VARCHAR2(4000) for categorical targets

NUMBER for numeric targets

predicted_target_value VARCHAR2(4000)

NUMBER for numeric targets

cost NUMBER

ODM enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.

The data input to each test computation (a COMPUTE procedure in PL/SQL, or a TestMetrics object in Java) is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the computation generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY results, without rebuilding the model and applying it against the same test data for every iteration.

Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY operation along with the results of APPLY on your scoring data. RANK_APPLY will provide your new data ranked by cost.

See Also:

Oracle Data Mining Concepts for more information on cost matrix.

3.3.2 Priors

In most classification models, you can specify prior probabilities to offset differences in distribution between the build data and the real population (scoring data). Priors can be used in building any classification model that uses a Bayesian algorithm. Priors are not used by the Decision Tree algorithm.

To specify prior probabilities, you must create a priors table and provide its name in the clas_priors_table_name setting for the model. If you are using the Java API, use a setPriorProbabilitiesMap object in the classification function settings for the model.

SVM Classification uses weights to correct for differences in target distribution. Use the priors table to specify weights for SVM Classification models.

The priors table must have these columns.

Column Name Data Type
target_value VARCHAR2 for categorical targets

NUMBER for numeric targets

prior_probability NUMBER

See Also:

Oracle Data Mining Concepts for more information on prior probabilities.