Oracle Text enables you to classify documents in the following ways:
Rule-Based Classification. In rule-based classification, you group your documents together, decide on categories, and formulate the rules that define those categories; these rules are actually query phrases. You then index the rules and use the MATCHES
operator to classify documents.
Advantage: Rule-based classification is very accurate for small document sets. Results are always based on what you define, because you write the rules.
Disadvantages: Defining rules can be tedious for large document sets with many categories. As your document set grows, you may need to write correspondingly more rules.
Supervised Classification. This method is similar to rule-based classification, but the rule writing step is automated with CTX_CLS.TRAIN
. CTX_CLS.TRAIN
formulates a set of classification rules from a sample set of pre-classified documents that you provide. As with rule-based classification, you use MATCHES
operator to classify documents.
Oracle Text offers two versions of supervised classification, one using the RULE_CLASSIFIER
preference and one using the SVM_CLASSIFIER
preference. These are discussed in "Supervised Classification".
Advantage: Rules are written for you automatically. This is useful for large document sets.
Disadvantages:
You must assign documents to categories before generating the rules.
Rules may not be as specific or accurate as those you write yourself.
Unsupervised Classification (Clustering). All steps from grouping your documents to writing the category rules are automated with CTX_CLS.CLUSTERING
. Oracle Text statistically analyzes your document set and correlates them with clusters according to content.
Advantages:
You do not need to provide either the classification rules or the sample documents as a training set.
Helps to discover patterns and content similarities in your document set that you might overlook.
In fact, you can use unsupervised classification when you do not have a clear idea of rules or classifications. One possible scenario is to use unsupervised classification to provide an initial set of categories, and to subsequently build on these through supervised classification.
Disadvantages:
Clustering might result in unexpected groupings, because the clustering operation is not user-defined, but based on an internal algorithm.
You do not see the rules that create the clusters.
The clustering operation is CPU-intensive and can take at least the same time as indexing.