MDATA Section

An MDATA section is used to reference user-defined metadata for a document. Using MDATA sections can speed up mixed queries. There is no limit to the number of MDATA sections that can be returned in a query.

Consider the case in which you want to query both according to text content and document type (magazine or newspaper or novel). You could create an index with a column for text and a column for the document type, and then perform a mixed query of this form—in this case, searching for all novels with the phrase Adam Thorpe (author of the novel Ulverton):

SELECT id FROM documents
   WHERE doctype = 'novel'
      AND CONTAINS(text, 'Adam Thorpe')>0;

However, it is usually faster to incorporate the attribute (in this case, the document type) into a field section, rather than use a separate column, and then use a single CONTAINS query:

SELECT id FROM documents
  WHERE CONTAINS(text, 'Adam Thorpe AND novel WITHIN doctype')>0;

There are two drawbacks to this approach:

  • Each time the attribute is updated, the entire text document must be re-indexed, resulting in increased index fragmentation and slower rates of processing DML.

  • Field sections tokenize the section value. This has several effects. Special characters in metadata, such as decimal points or currency characters, are not easily searchable; value searching (searching for Thurston Howell but not Thurston Howell, Jr.) is difficult; multi-word values are queried by phrase, which is slower than single-token searching; and multi-word values do not show up in browse-words, making author browsing or subject browsing impossible.

For these reasons, using MDATA sections instead of field sections may be worthwhile. MDATA sections are indexed like field sections, but metadata values can be added to and removed from documents without the need to re-index the document text. Unlike field sections, MDATA values are not tokenized. Additionally, MDATA section indexing generally takes up less disk space than field section indexing.

Use CTX_DDL.ADD_MDATA_SECTION to add an MDATA section to a section group. This example adds an MDATA section called AUTHOR and gives it the value Soseki Natsume (author of the novel Kokoro).

ctx_ddl.create.section.group('htmgroup', 'HTML_SECTION_GROUP');
ctx_ddl.add_mdata_section('htmgroup', 'author', 'Soseki Natsume');

MDATA values can be changed with CTX_DDL.ADD_MDATA and removed with CTX_DDL.REMOVE_MDATA. Also, MDATA sections can have multiple values. Only the owner of the index is allowed to call CTX_DDL.ADD_MDATA and CTX_DDL.REMOVE_MDATA.

Neither CTX_DDL.ADD_MDATA nor CTX_DDL.REMOVE_MDATA are supported for CTXCAT and CTXRULE indexes.

MDATA values are not passed through a lexer. Instead, all values undergo a simplified normalization as follows:

  • Leading and trailing whitespace on the value is removed.

  • The value is truncated to 64 bytes.

  • The value is indexed as a single value; if the value consists of multiple words, it is not broken up.

  • Case is preserved. If the document is dynamically generated, you can implement case-insensitivity by uppercasing MDATA values and making sure to search only in uppercase.

After a document has had MDATA metadata added to it, you can query for that metadata using the MDATA CONTAINS query operator:

SELECT id FROM documents
   WHERE CONTAINS(text, 'Tokyo and MDATA(author, Soseki Natsume)')>0;

This query will only be successful if an AUTHOR tag has the exact value Soseki Natsume (after simplified tokenization). Soseki or Natsume Soseki will not work.

Other things to note about MDATA:

  • MDATA values are not highlightable, will not appear in the output of CTX_DOC.TOKENS, and will not show up when FILTER PLAINTEXT is enabled.

  • MDATA sections must be unique within section groups. You cannot have an MDATA section named FOO and a zone or field section of the same name in the same section group.

  • Like field sections, MDATA sections cannot overlap or nest. An MDATA section is implicitly closed by the first tag encountered. For instance, in this example:

    <AUTHOR>Dickens <B>Shelley</B> Keats</AUTHOR>
    

    The <B> tag closes the AUTHOR MDATA section; as a result, this document has an AUTHOR of 'Dickens', but not of 'Shelley' or 'Keats'.

  • To prevent race conditions, each call to ADD_MDATA and REMOVE_MDATA locks out other calls on that rowid for that index for all values and sections. However, since ADD_MDATA and REMOVE_MDATA do not commit, it is possible for an application to deadlock when calling them both. It is the application's responsibility to prevent deadlocking.

See Also:

  • The CONTAINS query operators chapter of the Oracle Text Reference for information on the MDATA operator

  • The CTX_DDL package chapter of Oracle Text Reference for information on adding and removing MDATA sections