Overview of Document Classification

A major problem facing businesses and institutions today is that of information overload. Sorting out useful documents from documents that are not of interest challenges the ingenuity and resources of both individuals and organizations.

One way to sift through numerous documents is to use keyword search engines. However, keyword searches have limitations. One major drawback is that keyword searches do not discriminate by context. In many languages, a word or phrase may have multiple meanings, so a search may result in many matches that are not on the desired topic. For example, a query on the phrase river bank might return documents about the Hudson River Bank & Trust Company, because the word bank has two meanings.

An alternative strategy is to have human beings sort through documents and classify them by content, but this is not feasible for very large volumes of documents.

Oracle Text offers various approaches to document classification. Under rule-based classification, you write the classification rules yourself. With supervised classification, Oracle Text creates classification rules based on a set of sample documents that you pre-classify. Finally, with unsupervised classification (also known as clustering), Oracle Text performs all the steps, from writing the classification rules to classifying the documents, for you.