Classification Solutions

Oracle Text enables you to classify documents in the following ways:

  • Rule-Based Classification. In rule-based classification, you group your documents together, decide on categories, and formulate the rules that define those categories; these rules are actually query phrases. You then index the rules and use the MATCHES operator to classify documents.

    Advantage: Rule-based classification is very accurate for small document sets. Results are always based on what you define, because you write the rules.

    Disadvantages: Defining rules can be tedious for large document sets with many categories. As your document set grows, you may need to write correspondingly more rules.

  • Supervised Classification. This method is similar to rule-based classification, but the rule writing step is automated with CTX_CLS.TRAIN. CTX_CLS.TRAIN formulates a set of classification rules from a sample set of pre-classified documents that you provide. As with rule-based classification, you use MATCHES operator to classify documents.

    Oracle Text offers two versions of supervised classification, one using the RULE_CLASSIFIER preference and one using the SVM_CLASSIFIER preference. These are discussed in "Supervised Classification".

    Advantage: Rules are written for you automatically. This is useful for large document sets.


    • You must assign documents to categories before generating the rules.

    • Rules may not be as specific or accurate as those you write yourself.

  • Unsupervised Classification (Clustering). All steps from grouping your documents to writing the category rules are automated with CTX_CLS.CLUSTERING. Oracle Text statistically analyzes your document set and correlates them with clusters according to content.


    • You do not need to provide either the classification rules or the sample documents as a training set.

    • Helps to discover patterns and content similarities in your document set that you might overlook.

      In fact, you can use unsupervised classification when you do not have a clear idea of rules or classifications. One possible scenario is to use unsupervised classification to provide an initial set of categories, and to subsequently build on these through supervised classification.


    • Clustering might result in unexpected groupings, because the clustering operation is not user-defined, but based on an internal algorithm.

    • You do not see the rules that create the clusters.

    • The clustering operation is CPU-intensive and can take at least the same time as indexing.