Oracle® Data Mining Application Developer's Guide 11g Release 2 (11.2) Part Number E12218-06 |
|
|
PDF · Mobi · ePub |
This chapter explains how data is interpreted by Oracle Data Mining. It describes the requirements for a data mining case table, and it explains the notion of data attributes and model attributes. Data transformations are discussed in Oracle Data Mining Concepts.
This chapter contains the following sections:
The data that you wish to mine must be defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can be identified by a unique case ID. The table or view itself is referred to as a case table.
The CUSTOMERS
table in the SH
schema is an example of a table that could be used for mining. All the information for each customer is contained in a single row. The case ID is the CUST_ID
column. The rows listed in Example 3-1 are selected from SH.CUSTOMERS
.
SQL> select cust_id, cust_gender, cust_year_of_birth, cust_main_phone_number from sh.customers where cust_id < 11; CUST_ID CUST_GENDER CUST_YEAR_OF_BIRTH CUST_MAIN_PHONE_NUMBER ------- ----------- ---- ------------- ------------------------- 1 M 1946 127-379-8954 2 F 1957 680-327-1419 3 M 1939 115-509-3391 4 M 1934 577-104-2792 5 M 1969 563-667-7731 6 F 1925 682-732-7260 7 F 1986 648-272-6181 8 F 1964 234-693-8728 9 F 1936 697-702-2618 10 F 1947 601-207-4099
Note:
Oracle Data Mining requires single-record case data for all types of models except association models, which can be built on native transactional data. See "Market Basket Data".The columns of the case table hold the attributes that describe each case. In Example 3-1, the attributes are: CUST_GENDER
, CUST_YEAR_OF_BIRTH
, and CUST_MAIN_PHONE_NUMBER
. The attributes are the predictors in a supervised model and the descriptors in an unsupervised model. The case ID, CUST_ID
, can be viewed as a special attribute; it is not a predictor or a descriptor.
Oracle Data Mining accepts the following column data types:
VARCHAR2
, CHAR
NUMBER
, FLOAT
DM_NESTED_CATEGORICALS
DM_NESTED_NUMERICALS
See Oracle Data Mining Concepts for information about converting the data type if necessary.
The nested types are described in "Nested Data". The case ID column cannot have a nested type.
You need two case tables to build a classification or regression model. One set of rows is used for building (training) the model, another set of rows is used for testing the model. It is often convenient to derive the build data and test data from the same data set. For example, you might select 60% of the rows for building the model and 40% for testing the model.
Models that implement other mining functions, such as attribute importance, clustering, association, or feature extraction, do not use separate test data.
Most data mining models can be applied to separate data. The data to which you apply the model is called the apply data or scoring data. Oracle Data Mining supports the scoring operation for classification, regression, anomaly detection, clustering, and feature extraction.
The scoring process matches column names in the scoring data with the names of the columns that were used to build the model. The scoring process does not require all the columns to be present in the scoring data. If the data types do not match, Oracle Data Mining attempts to perform type coercion. For example, if a column called PRODUCT_RATING
is VARCHAR2
in the build data but NUMBER
in the scoring data, Oracle Data Mining will effectively apply a TO_CHAR()
function to convert it.
The column in the test or scoring data must undergo the same transformations as the corresponding column in the build data. For example, if the AGE
column in the build data was transformed from numbers to the values CHILD
, ADULT
, and SENIOR
, then the AGE
column in the scoring data must undergo the same transformation so that the model can properly evaluate it.
Note:
Oracle Data Mining can embed user-specified transformation instructions in the model and reapply them whenever the model is applied. When the transformation instructions are embedded in the model, you do not need to specify them for the test or scoring data sets.Oracle Data Mining also supports Automatic Data Preparation (ADP). When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model along with any user-specified transformations. Mining models that contain embedded transformations are known as supermodels.
Automatic and embedded data transformations are discussed in Oracle Data Mining Concepts.
Attributes are the items of data used in data mining. In predictive models, attributes are the predictors that affect a given outcome. In descriptive models, attributes are the items of information being analyzed for natural groupings or associations. A table of employee data might contain attributes such as job title, date of hire, salary, age, gender, and so on.
Data attributes are columns in the data sets used to build, test, or score a model. Model attributes are the data representations used internally by the model.
Data attributes and model attributes can be the same. For example a column called SIZE
, with values S
, M
, and L
, might be an attribute used by an algorithm to build a model. Internally, the model attribute SIZE
would most likely be the same as the data attribute from which it was derived.
On the other hand, a nested column SALES_PROD
, containing the sales figures for a group of products, would not correspond to a model attribute. The data attribute would be SALES_PROD
, but each product with its corresponding sales figure (each row in the nested column) would be a model attribute.
Transformations also cause a discrepancy between data attributes and model attributes. For example, a transformation could apply a calculation to two data attributes and store the result in a new attribute. The new attribute would be a model attribute that has no corresponding data attribute. Other transformations such as binning, normalization, and outlier treatment, cause the model's representation of an attribute to be different from the data attribute in the case table.
See Also:
Oracle Data Mining Concepts for information on transformations
The target of a supervised model is a special kind of attribute. The target column in the build data contains the historical values used to build (train) the model. The target column in the test data contains the historical values to which the predictions are compared. The target column in the scoring data holds the results when the model is applied.
Clustering, feature extraction, association, and anomaly detection models do not use a target.
You can query the *_MINING_MODEL_ATTRIBUTES
view to find the target for a given model, as shown in Example 3-2.
Model attributes are either numerical or categorical. Data attributes, which are columns in a case table, have Oracle data types.
Oracle Data Mining interprets NUMBER
, FLOAT
, and DM_NESTED_NUMERICALS
as numerical, and CHAR
, VARCHAR2
, and DM_NESTED_CATEGORICALS
as categorical. There is one exception: If the target of a classification model is NUMBER
or FLOAT
, it will be interpreted as categorical.
Numerical attributes can theoretically have an infinite number of values. The values have an implicit order, and the differences between them are also ordered.
Categorical attributes have values that belong to a finite number of discrete categories or classes. There is no implicit order associated with the values. Some categoricals are binary: They have only two possible values, such as yes
or no
, or male
or female
. The term multi-class is used to describe models when the categorical target has more than two values. For example, a target of clothing sizes could have the values small
, medium
, or large
.
The target of a classification model is categorical. The target of a regression model is numerical. The target of an attribute importance model is either categorical or numerical.
The model signature is the set of data attributes used to build a model. Some or all of the attributes in the signature should be present for scoring. If some columns are not present, they are disregarded. If columns with the same names but different data types are present, the model attempts to convert the data type.
The model signature does not necessarily include all the columns in the build data. Algorithm-specific criteria may cause the model to ignore certain columns. Other columns may be eliminated by transformations. Only the data attributes actually used to build the model are included in the signature.
The target and case ID columns are not included in the signature.
The columns in the model signature, plus the target (if the model is supervised), are listed in the data dictionary view, ALL/USER/DBA_MINING_MODEL_ATTRIBUTES
. When used with the ALL
prefix, this view returns the signature and target for all mining models accessible to the current user. When used with the USER
prefix, it returns the model signature and target for all the mining models in the user's schema. The DBA
prefix is only available to DBAs.
The columns in the ALL_MINING_MODEL_ATTRIBUTES
view are described as follows. Details are in Table 3-1.
SQL> describe all_mining_model_attributes Name Null? Type ---------------------------------------------------------- OWNER NOT NULL VARCHAR2(30) MODEL_NAME NOT NULL VARCHAR2(30) ATTRIBUTE_NAME NOT NULL VARCHAR2(30) ATTRIBUTE_TYPE VARCHAR2(11) DATA_TYPE VARCHAR2(12) DATA_LENGTH NUMBER DATA_PRECISION NUMBER DATA_SCALE NUMBER USAGE_TYPE VARCHAR2(8) TARGET VARCHAR2(3)
Table 3-1 ALL_MINING_MODEL_ATTRIBUTES
Column | Description |
---|---|
|
Owner of the mining model. |
|
Name of the mining model. |
|
Name of the data attribute (column). |
|
Type of the model attribute derived by the model from the data attribute. The attribute type can be either numerical or categorical. This information is only meaningful if there is a one-to-one mapping between the data attribute and the model attribute. If the data attribute has undergone transformations that change the way it is used by the model, then the |
|
The Oracle data type of the data attribute (column): NUMBER FLOAT CHAR VARCHAR2 NESTED TABLE If the value is DM_NESTED_NUMERICALS or DM_NESTED_CATEGORICALS If the data type is See "Nested Data" for details. |
|
Length of the data type |
|
Precision of a fixed point number, which is the total number of significant decimal digits, is represented as p in the data type |
|
Scale of a fixed point number. Scale, which is the number of digits from the decimal to the least significant digit, is represented as s in the data type |
|
Indicates that the attribute was used to construct the model. Some attributes may be eliminated by transformations or algorithmic processing. The * |
|
Whether or not the attribute is a target. The value can be either If the attribute is a target, and it has undergone transformations for manipulation by the algorithm, the description in |
The query in Example 3-2 returns information about the data attributes of the model T_SVM_CLAS_SAMPLE
, an SVM model generated by one of the Data Mining sample programs. The query returns the name and data type of each of the data attributes in the model signature, whether the attribute is used as a numerical or as a categorical, and whether or not the attribute is a target.
Example 3-2 ALL_MINING_MODEL_ATTRIBUTES
SQL> select model_name, attribute_name, attribute_type, data_type, target from user_mining_model_attributes where model_name = 'T_SVM_CLAS_SAMPLE'; MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TYPE DATA_TYPE TARGET ------------------- --------------------- --------------- ----------- ------ T_SVM_CLAS_SAMPLE COMMENTS NUMERICAL NESTED TABLE NO T_SVM_CLAS_SAMPLE AGE NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE CUST_MARITAL_STATUS CATEGORICAL VARCHAR2 NO T_SVM_CLAS_SAMPLE COUNTRY_NAME CATEGORICAL VARCHAR2 NO T_SVM_CLAS_SAMPLE CUST_INCOME_LEVEL CATEGORICAL VARCHAR2 NO T_SVM_CLAS_SAMPLE EDUCATION CATEGORICAL VARCHAR2 NO T_SVM_CLAS_SAMPLE OCCUPATION CATEGORICAL VARCHAR2 NO T_SVM_CLAS_SAMPLE HOUSEHOLD_SIZE CATEGORICAL VARCHAR2 NO T_SVM_CLAS_SAMPLE YRS_RESIDENCE NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE BULK_PACK_DISKETTES NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE FLAT_PANEL_MONITOR NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE HOME_THEATER_PACKAGE NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE BOOKKEEPING_APPLICATION NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE PRINTER_SUPPLIES NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE Y_BOX_GAMES NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE OS_DOC_SET_KANJI NUMERICAL NUMBER NO T_SVM_CLAS_SAMPLE CUST_GENDER CATEGORICAL CHAR NO T_SVM_CLAS_SAMPLE AFFINITY_CARD NUMERICAL NUMBER YES
The model attribute name consists of two parts: a column name, and a subcolumn name.
column_name[.subcolumn_name]
The column_name
component is the name of the data attribute. It is present in all model attribute names. Nested attributes also have a subcolumn_name
component as shown in Example 3-3.
Example 3-3 Model Attributes Derived from a Nested Column
The nested column SALESPROD
has three rows.
SALESPROD(ATTRIBUTE_NAME, VALUE) -------------------------------- ((PROD1, 300), (PROD2, 245), (PROD3, 679))
The name of the data attribute is SALESPROD
. Its associated model attributes are:
SALESPROD.PROD1 SALESPROD.PROD2 SALESPROD.PROD3
Model details reveal information about model attributes and their treatment by the algorithm. There is a separate GET_MODEL_DETAILS
routine for each algorithm.
Transformation and reverse transformation expressions are associated with model attributes. The transformations are applied to the model for algorithmic processing. The reverse transformations are applied for model details. The information returned to the user by GET_MODEL_DETAILS
is expressed in the form of the original data attributes, or as close to it as possible.
Reverse transformations are also applied to the target when a supervised model is scored. Reverse transformations support model transparency. Transparency is discussed in Oracle Data Mining Concepts.
Example 3-4 shows the definition of the GET_MODEL_DETAILS
function for an Attribute Importance model. The PIPELINED
keyword instructs Oracle Database to return the rows as single values instead of returning all the rows as a single value.
Example 3-4 Model Details for an Attribute Importance Model
The syntax of the GET_MODEL_DETAILS
function for Attribute Importance models is shown as follows.
DBMS_DATA_MINING.GET_MODEL_DETAILS_AI ( model_name VARCHAR2) RETURN DM_RANKED_ATTRIBUTES PIPELINED;
The function returns DM_RANKED_ATTRIBUTES
, a virtual table. The columns are the model details. There is one row for each model attribute in the specified model. The columns are described as follows.
(attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), importance_value NUMBER, rank NUMBER(38))
Oracle Data Mining requires a case table in single-record case format, with each record in a separate row. What if some or all of your data is in multi-record case format, with each record in several rows? What if you want one attribute to represent a series or collection of values, such as a student's test scores or the products purchased by a customer?
This kind of one-to-many relationship is usually implemented as a join between tables. For example, you might join your customer table to a sales table and thus associate a list of products purchased with each customer.
Oracle Data Mining supports dimensioned data through nested columns. To include dimensioned data in your case table, create a view and cast the joined data to one of the Data Mining nested table types. Each row in the nested column consists of an attribute name/value pair. Oracle Data Mining internally processes each nested row as a separate attribute.
See Also:
Sample code for converting to a nested table in "Example: Creating a Nested Column for Mining".The algorithms that support nested data are listed in Table 3-2
Table 3-2 Oracle Data Mining Algorithms that Support Nested Data
Algorithm | Mining Function |
---|---|
Apriori |
association rules |
GLM |
classification and regression |
k-Means |
clustering |
MDL |
attribute importance |
Naive Bayes |
classification |
NMF |
feature extraction |
SVM |
classification, regression, and anomaly detection |
Oracle Database supports user-defined data types that make it possible to model real-world entities as objects in the database. Collection types are object data types for modeling multi-valued attributes. Nested tables are collection types. Nested tables can be used anywhere that other data types can be used. You can learn more about collection types in Oracle Database Object-Relational Developer's Guide.
Oracle Data Mining supports two nested object types: one for numerical attributes, the other for categorical attributes.
The DM_NESTED_NUMERICALS
object type is a nested table of numerical attributes. Each row is a single DM_NESTED_NUMERICAL
.
The nested numerical attributes (rows) are described as follows.
SQL> describe dm_nested_numerical
Name Null? Type
----------------------------------------- -------- ----------------------------
ATTRIBUTE_NAME VARCHAR2(4000)
VALUE NUMBER
The collection of numerical attributes (table) is described as follows.
SQL> describe dm_nested_numericals
DM_NESTED_NUMERICALS TABLE OF SYS.DM_NESTED_NUMERICAL
Name Null? Type
----------------------------------------- -------- ----------------------------
ATTRIBUTE_NAME VARCHAR2(4000)
VALUE NUMBER
The DM_NESTED_CATEGORICALS
object type is a nested table of categorical attributes. Each row is a single DM_NESTED_CATEGORICAL
.
The nested categorical attributes (rows) are described as follows.
SQL> describe dm_nested_categorical
Name Null? Type
----------------------------------------- -------- ----------------------------
ATTRIBUTE_NAME VARCHAR2(4000)
VALUE VARCHAR2(4000)
The collection of categorical attributes (table) is described as follows.
SQL> describe dm_nested_categoricals
DM_NESTED_CATEGORICALS TABLE OF SYS.DM_NESTED_CATEGORICAL
Name Null? Type
----------------------------------------- -------- ----------------------------
ATTRIBUTE_NAME VARCHAR2(4000)
VALUE VARCHAR2(4000)
Example 3-5 shows data from a view of a sales table. It includes sales for three of the many products sold in four regions. This data is not suitable for mining at the product level because sales for each case (product), is stored in several rows.
Example 3-5 Product Sales per Region in Multi-Record Case Format
PRODUCT REGION SALES ------- -------- ---------- Prod1 NE 556432 Prod2 NE 670155 Prod3 NE 3111 . . Prod1 NW 90887 Prod2 NW 100999 Prod3 NW 750437 . . Prod1 SE 82153 Prod2 SE 57322 Prod3 SE 28938 . . Prod1 SW 3297551 Prod2 SW 4972019 Prod3 SW 884923 . .
Example 3-6 shows how this data could be transformed for mining. The case ID column would be PRODUCT
. SALES_PER_REGION
, a nested column of type DM_NESTED_NUMERICALS
, would be a data attribute. This table is suitable for mining, because the information for each case is stored in a single row.
Example 3-6 Product Sales per Region in Single-Record Case Format
PRODUCT SALES_PER_REGION (ATTRIBUTE_NAME, VALUE) ------ -------------------------- Prod1 ('NE' , 556432) ('NW' , 90887) ('SE' , 82153) ('SW' , 3297551) Prod2 ('NE' , 670155) ('NW' , 100999) ('SE' , 57322) ('SW' , 4972019) Prod3 ('NE' , 3111) ('NW' , 750437) ('SE' , 28938) ('SW' , 884923) . .
Oracle Data Mining treats each nested row as a separate model attribute, as shown in Example 3-7. (Note that the presentation in this example is conceptual only. The data is not actually pivoted before being processed.)
Example 3-7 Model Attributes Derived From SALES_PER_REGION
PRODUCT SALES_PER_REGION.NE SALES_PER_REGION.NW SALES_PER_REGION.SE SALES_PER_REGION.SW ------- ------------------ ------------------- ------------------ ------------------- Prod1 556432 90887 82153 3297551 Prod2 670155 100999 57322 4972019 Prod3 3111 750437 28938 884923 . .
Example 3-8 shows how to define a nested column for data mining. This example uses transactional market basket data.
Example 3-8 Convert to a Nested Column
The view SALES_TRANS_CUST
provides a list of transaction IDs to identify each market basket and a list of the products in each basket.
SQL> describe sales_trans_cust Name Null? Type ----------------------------------------------------- -------- ---------------- TRANS_ID NOT NULL NUMBER PROD_NAME NOT NULL VARCHAR2(50) QUANTITY NUMBER
The following SQL statement transforms this data to a column of type DM_NESTED_NUMERICALS
in a view called SALES_TRANS_CUST_NESTED
. This view can be used as a case table for mining.
SQL> CREATE VIEW sales_trans_cust_nested AS SELECT trans_id, CAST(COLLECT(DM_NESTED_NUMERICAL( prod_name, quantity)) AS DM_NESTED_NUMERICALS) custprods FROM sales_trans_cust GROUP BY trans_id;
This query returns two rows from the transformed data.
SQL> select * from sales_trans_cust_nested where trans_id < 101000 and trans_id > 100997; TRANS_ID CUSTPRODS(ATTRIBUTE_NAME, VALUE) ------- ------------------------------------------------ 100998 DM_NESTED_NUMERICALS (DM_NESTED_NUMERICAL('O/S Documentation Set - English', 1) 100999 DM_NESTED_NUMERICALS (DM_NESTED_NUMERICAL('CD-RW, High Speed Pack of 5', 2), DM_NESTED_NUMERICAL('External 8X CD-ROM', 1), DM_NESTED_NUMERICAL('SIMM- 16MB PCMCIAII card', 1))
Market basket data identifies the items sold in a set of baskets or transactions. Oracle Data Mining provides the association mining function for market basket analysis.
Association models use the Apriori algorithm to generate association rules that describe how items tend to be purchased in groups. For example, an association rule might assert with 65% confidence that 80% of the people who buy peanut butter also buy jelly.
Market basket data is usually transactional. In transactional data, a case is a transaction and the data for a transaction is stored in multiple rows. Oracle Data Mining association models can be built on transactional data or on single-record case data. The ODMS_ITEM_ID_COLUMN_NAME
and ODMS_ITEM_VALUE_COLUMN_NAME
settings specify whether or not the data for association rules is in transactional format.
Note:
Association models are the only type of model that can be built on native transactional data. For all other types of models, Oracle Data Mining requires that the data be presented in single-record case formatThe Apriori algorithm assumes that the data is transactional and that it has many missing values. Apriori interprets all missing values as sparse data, and it has its own native mechanisms for handling sparse data.
See Also:
See Oracle Database PL/SQL Packages and Types Reference for information on the ODMS_ITEM_ID_COLUMN_NAME
and ODMS_ITEM_VALUE_COLUMN_NAME
settings
Oracle Data Mining distinguishes between sparse data and data that contains random missing values. The latter means that some attribute values are unknown. Sparse data, on the other hand, contains values that are assumed to be known, although they are not represented in the data.
A typical example of sparse data is market basket data. Out of hundreds or thousands of available items, only a few are present in an individual case (the basket or transaction). All the item values are known, but they are not all included in the basket. Present values may have a quantity, while the items that are not represented are sparse (with a known quantity of zero).
Oracle Data Mining interprets missing data as follows:
Missing — Missing values in columns with a simple data type (not nested) are assumed to be missing at random.
Sparse — Missing values in nested columns indicate sparsity.
The examples in this section illustrate how Oracle Data Mining identifies data as either sparse or missing at random.
A sales table contains point-of-sale data for a group of products, sold in several stores to different customers over a period of time. A particular customer will only have bought some of the products. Those products that a customer did not buy will not appear as rows in the sales table.
If you were to figure out the amount of money a customer has spent for each product, the unpurchased products would have an inferred amount of zero. The value is not random or unknown; it is zero, even though no row appears in the table.
Note that the sales data is dimensioned (by product, stores, customers, and time) and would be represented as nested data for mining.
Since missing values in a nested column will always indicate sparsity, you should make sure that this interpretation is appropriate for the data that you wish to mine. For example, when trying to mine a multi-record case data set containing users' movie ratings of a large movie database, the missing ratings would be unknown (missing at random), but Oracle Data Mining would treat the data as sparse and infer a rating of zero for the missing value.
A table of customer data contains demographic data about customers. The case ID column is the customer ID. The attributes are age, education, profession, gender, house-hold size, and so on. Not all the data may be available for each customer. Any missing values are considered to be missing at random. For example, if the age of customer 1 and the profession of customer 2 are not present in the data, that information is simply unknown. It does not indicate sparsity.
Note that the customer data is not dimensioned. There is a one-to-one mapping between the case and each of its attributes. None of the attributes are nested.
Missing value treatment depends on the algorithm and on the nature of the data (categorical or numerical, sparse or missing at random). Missing value treatment is summarized in Table 3-3.
Note:
Oracle Data Mining performs the same missing value treatment whether or not Automatic Data Preparation is being used.Table 3-3 Missing Value Treatment by Algorithm
If you want Oracle Data Mining to treat missing data as sparse instead of missing at random or missing at random instead of sparse, transform it before building the model.
If you want missing values to be treated as sparse, but Oracle Data Mining would interpret them as missing at random, you can use a SQL function like NVL to replace the nulls with a value such as "NA". Oracle Data Mining will not perform missing value treatment if there is a specified value. See Oracle Database SQL Language Reference
If you want missing nested attributes to be treated as missing at random, you can transform the nested rows into physical attributes in separate columns — as long as the case table stays within the 1000 column limitation imposed by the Database. Fill in all of the possible attribute names, and specify them as null.