This chapter describes feature selection, attribute importance, and feature extraction. Oracle Data Mining supports attribute importance as a supervised mining function and feature extraction as an unsupervised mining function. For an overview of supervised and unsupervised data mining, see Chapter 3.
This chapter contains these topics:
Sometimes too much information can reduce the effectiveness of data mining. Some of the columns of data attributes assembled for building and testing a model may not contribute meaningful information to the model. Some may actually detract from the quality and accuracy of the model.
For example, you might collect a great deal of data about a given population because you want to predict the likelihood of a certain illness within this group. Some of this information, perhaps much of it, will have little or no effect on susceptibility to the illness. Attributes such as the number of cars per household may have no effect whatsoever.
Irrelevant attributes add noise to the data and affect model accuracy. Noise increases the size of the model and the time and system resources needed for model building and scoring.
Data sets with many attributes may contain groups of attributes that are correlated. These attributes may actually be measuring the same underlying feature. Their presence together in the build data can skew the logic of the algorithm and affect the accuracy of the model.
Wide data (many attributes) generally presents processing challenges for data mining algorithms. Model attributes are the dimensions of the processing space used by the algorithm. The higher the dimensionality of the processing space, the higher the computation cost involved in algorithmic processing.
To minimize the effects of noise, correlation, and high dimensionality, some form of dimension reduction is sometimes a desirable preprocessing step for data mining. Feature selection and extraction are two approaches to dimension reduction.
Feature selection — Selecting the most relevant attributes
Feature extraction — Combining attributes into a new reduced set of features
Finding the most significant predictors is the goal of some data mining projects. For example, a model might seek to find the principal characteristics of clients who pose a high credit risk.
Oracle Data Mining supports the attribute importance mining function, which ranks attributes according to their importance in predicting a target. Attribute importance does not actually perform feature selection since all the predictors are retained in the model. In true feature selection, the attributes that are ranked below a given threshold of importance are removed from the model.
Feature selection is useful as a preprocessing step to improve computational efficiency in predictive modeling. Oracle Data Mining implements feature selection for optimization within the Decision Tree algorithm and within Naive Bayes when ADP is enabled. GLM models can be configured to perform feature selection as a preprocessing step.
Oracle Data Mining does not support the scoring operation for attribute importance. The results of attribute importance are the attributes of the build data ranked according to their predictive influence. The ranking and the measure of importance can be used in selecting training data for classification models.
Feature extraction is an attribute reduction process. Unlike feature selection, which selects and retains the most significant attributes, feature extraction actually transforms the attributes. The transformed attributes, or features, are linear combinations of the original attributes.
The feature extraction process results in a much smaller and richer set of attributes. The maximum number of features may be user-specified or determined by the algorithm. By default, it is determined by the algorithm.
Models built on extracted features may be of higher quality, because the data is described by fewer, more meaningful attributes.
Feature extraction projects a data set with higher dimensionality onto a smaller number of dimensions. As such it is useful for data visualization, since a complex data set can be effectively visualized when it is reduced to two or three dimensions.
Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.
Feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.
Oracle Data Mining supports the Minimum Description Length algorithm for attribute importance. See Chapter 15, "Minimum Description Length".
Oracle Data Mining supports these feature extraction algorithms:
Non-Negative Matrix Factorization (NMF). See Chapter 17, "Non-Negative Matrix Factorization".
Singular Value Decomposition (SVD) and Prediction Component Analysis (PCA). See Chapter 19, "Singular Value Decomposition".