Formatted documents such as Microsoft Word and PDF must be filtered to text to be indexed. The type of filtering the system uses is determined by the FILTER
preference type. By default, the system uses the AUTO_FILTER
filter type, which automatically detects the format of your documents and filters them to text.
Oracle Text can index most formats. Oracle Text can also index columns that contain documents with mixed formats.
Oracle Text Reference for information about AUTO_FILTER
supported document and graphics formats
If you have a mixed-format column such as one that contains Microsoft Word, plain text, and HTML documents, you can bypass filtering for plain text or HTML by including a format column in your text table. In the format column, you tag each row TEXT
or BINARY
. Rows that are tagged TEXT
are not filtered.
For example, you can tag the HTML and plain text rows as TEXT
and the Microsoft Word rows as BINARY
. You specify the format column in the CREATE INDEX
parameter clause.
A third format column type, IGNORE
, is provided for when you do not want a document to be indexed at all. This is useful, for example, when you have a mixed-format table that includes plain-text documents in both Japanese and English, but you only want to process the English documents; another example might be that of a mixed-format table that includes both plain-text documents and images. Because IGNORE
is implemented at the datastore level, it can be used with all filters.
You can create your own custom filter to filter documents for indexing. You can create either an external filter that is executed from the file system or an internal filter as a PL/SQL or Java stored procedure.
For external custom filtering, use the USER_FILTER
filter preference type.
For internal filtering, use the PROCEDURE_FILTER
filter type.