Oracle® Data Mining Concepts 11g Release 2 (11.2) Part Number E16808-04 |
|
|
PDF · Mobi · ePub |
This chapter describes association, the unsupervised mining function for discovering association rules.
See Also:
"Unsupervised Data Mining"This chapter contains the following topics:
Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules.
Oracle Data Mining does not support the scoring operation for association modeling. The results of an association model are the rules that identify patterns of association within the data. Association rules can be ranked by support (How often do these items occur together in the data?) and confidence (How likely are these items to occur together in the data?).
Association rules are often used to analyze sales transactions. For example, it might be noted that customers who buy cereal at the grocery store often buy milk at the same time. In fact, association analysis might find that 85% of the checkout sessions that include cereal also include milk. This relationship could be formulated as the following rule.
Cereal implies milk with 85% confidence
This application of association modeling is called market-basket analysis. It is valuable for direct marketing, sales promotions, and for discovering business trends. Market-basket analysis can also be used effectively for store layout, catalog design, and cross-sell.
Association modeling has important applications in other domains as well. For example, in e-commerce applications, association rules may be used for Web page personalization. An association model might find that a user who visits pages A and B is 70% likely to also visit page C in the same session. Based on this rule, a dynamic link could be created for users who are likely to be interested in page C. The association rule could be expressed as follows.
A and B imply C with 70% confidence
See Also:
"Confidence"Unlike other data mining functions, association is transaction-based. In transaction processing, a case consists of a transaction such as a market basket or Web session. The collection of items in the transaction is an attribute of the transaction. Other attributes might be the date, time, location, or user ID associated with the transaction.
The collection of items in the transaction is a multi-record attribute. Transactional data is said to be in multi-record case format. An example is shown in Figure 8-1.
Association models can be built using either transactional or nontransactional (single-record case) data. For all other types of models, Oracle Data Mining requires nontransactional data. To build any model other than an association model on transactional data, the data must first be transformed to single-record case format.
See Also:
Figure 4-3, "Sample Build Data for Regression" and Figure 7-2, "Build Data for Clustering" for examples of single-record case format
Oracle Data Mining Application Developer's Guide for information about transforming transactional data to single-record case
In transactional data, a collection of items is associated with each case. The collection could theoretically include all possible members of the collection. For example, all products could theoretically be purchased in a single market-basket transaction. However, in actuality, only a tiny subset of all possible items are present in a given transaction; the items in the market-basket represent only a small fraction of the items available for sale in the store.
When an item is not present in a collection, it may have a null value or it may simply be missing. Many of the items may be missing or null, since many of the items that could be in the collection are probably not present in any individual transaction.
Missing rows in a collection indicate sparsity. This means that a high proportion of the nested rows are not populated. The Oracle Data Mining association algorithm is optimized for processing sparse data.
See Also:
Oracle Data Mining Application Developer's Guide for information about Oracle Data Mining and sparse dataThe first step in association analysis is the enumeration of itemsets. An itemset is any combination of two or more items in a transaction.
The maximum number of items in an itemset is user-specified. If the maximum is two, all the item pairs will be counted. If the maximum is greater than two, all the item pairs, all the item triples, and all the item combinations up to the specified maximum will be counted.
The maximum number of items in an itemset is specified by the ASSO_MAX_RULE_LENGTH
setting, which also applies to the rules derived from the itemsets.
See Also:
"Association Rules" to learn about the relationship between itemsets and rules
Oracle Database PL/SQL Packages and Types Reference for descriptions of the build settings for association rules
Table 8-1 shows the itemsets derived from the transactions in Figure 8-1, assuming that ASSO_MAX_RULE_LENGTH
is set to 3.
Transaction | Itemsets |
---|---|
11 |
(B,D) (B,E) (D,E) (B,D,E) |
12 |
(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E) |
13 |
(B,C) (B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E) |
Tip:
Decrease the maximum rule length if you want to decrease the build time for the model and generate simpler rules.Association rules are calculated from itemsets. If rules are generated from all possible itemsets, there may be a very high number of rules and the rules may not be very meaningful. Also, the model may take a long time to build. Typically it is desirable to only generate rules from itemsets that are well-represented in the data. Frequent itemsets are those that occur with a minimum frequency specified by the user.
The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for rules.
The ASSO_MIN_SUPPORT
setting specifies the minimum frequent itemset support. It also applies to the rules derived from the frequent itemsets.
See Also:
"Association Rules" to learn about the relationship between frequent itemsets and rules
Oracle Database PL/SQL Packages and Types Reference for descriptions of the build settings for association rules
Table 8-2 shows the itemsets from Table 8-1 that are frequent itemsets with support > 66%.
Frequent Itemset | Transactions | Support |
---|---|---|
(B,C) |
2 of 3 |
67% |
(B,D) |
2 of 3 |
67% |
(B,E) |
3 of 3 |
100% |
(C,E) |
2 of 3 |
67% |
(D,E) |
2 of 3 |
67% |
(B,C,E) |
2 of 3 |
67% |
(B,D,E) |
2 of 3 |
67% |
Tip:
Increase the minimum support if you want to decrease the build time for the model and generate fewer rules.See Also:
Chapter 10, "Apriori" for information about the calculation of association rulesThis example shows association rules mined from sales transactions in the SH
schema. Sales
is a fact table linked to products
, customers
, and other dimension tables through foreign keys. Oracle Data Miner automatically converts the transactional data to single-record case.
The items in each transaction are products; each transaction is uniquely identified by a customer ID. Figure 8-2 shows the dialog in Oracle Data Miner for selecting transactional data.
Figure 8-3 shows the dialog for selecting the unique transaction identifier.
A model with default settings built on this data generates many rules. One way to limit the number of rules is to raise the support and confidence. Figure 8-4 shows Confidence raised to 65% and Support raised to 75% in the Advanced Settings dialog.
Figure 8-4 Advanced Settings for Association Rules
Figure 8-5 shows the rules that are returned when you increase the confidence and support.
You can filter the rules in a number of different ways. The dialog in Figure 8-6 specifies that only rules with "Mouse Pad" in the antecedent, and "Keyboard Wrist Rest" in the consequent should be returned.
Figure 8-7 shows the three rules that result from the filtering criteria specified in Figure 8-6. The first rule states that a customer who purchases a mouse pad and a 1.44 MB External 3.5 Diskette is likely to also buy a keyboard wrist rest at same time. The confidence for this rule is 99%. The support is 77%.
Figure 8-7 Display rules with mouse pad in antecedent
See Also:
"Confidence" for a discussion of confidenceOracle Data Mining uses the Apriori algorithm to calculate association rules for items in frequent itemsets.
See Also:
Chapter 10, "Apriori"