Basic Example of Using Entity Extraction

The example in this section provides a very simple, basic example of entity extraction. The example assumes that you have a clob containing the following text:

New York, United States of America
The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.

The example uses CTX_ENTITY.EXTRACT to find all the entities in the CLOB value. (For now, do not worry about how the text got into the clob or how we provide the output clob)

Entity extraction requires a new type of policy, an "extract policy," which enables you to specify options. For now, we will create a default policy as follows:

ctx_entity.create_extract_policy( 'mypolicy' );

We can then call extract to do the work. It needs four arguments: the policy name, the document to process, the language, and the output clob (which must have been initialized, for example, by calling dbms_lob.createtemporary).

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )

outclob contains the XML identifying extracted entities. If we display the contents (preferably by selecting it as an XMLTYPE so it gets formatted nicely) we will see the following:

<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="1" offset="150" length="18" source="SuppliedRule">
    <text>Oracle Corporation</text>
    <type>company</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
  <entity id="3" offset="83" length="2" source="SuppliedRule">
    <text>5%</text>
    <type>percent</type>
  </entity>
  <entity id="4" offset="113" length="8" source="SuppliedDictionary">
    <text>software</text>
    <type>product</type>
  </entity>
  <entity id="5" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>state</type>
  </entity>
</entities>

This is fine if we are going to process it with an XML-aware program. However, if we want it in a more "SQL friendly" view, we can use Oracle XML DB functions to convert it as follows:

select xtab.offset, xtab.text, xtab.type, xtab.source
from xmltable( '/entities/entity'
PASSING xmltype(outclob)
  COLUMNS 
    offset number       PATH '@offset',
    lngth number        PATH '@length',
    text   varchar2(50) PATH 'text/text()',
    type   varchar2(50) PATH 'type/text()',
    source varchar2(50) PATH '@source'
) as xtab order by offset;

This produces the following output:

    OFFSET TEXT                      TYPE                 SOURCE
---------- ------------------------- -------------------- --------------------
         0 New York                  city                 SuppliedDictionary
         0 New York                  state                SuppliedDictionary
        10 United States of America  country              SuppliedDictionary
        83 5%                        percent              SuppliedRule
       113 software                  product              SuppliedDictionary
       150 Oracle Corporation        company              SuppliedRule

If we do not want every different type of entity, we can select which types to fetch. We do this by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types. For example:

ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' ) 
 
That would give us the XML
 
<entities>
  <entity id="0" offset="0" length="8" source="SuppliedDictionary">
    <text>New York</text>
    <type>city</type>
  </entity>
  <entity id="2" offset="10" length="24" source="SuppliedDictionary">
    <text>United States of America</text>
    <type>country</type>
  </entity>
</entities>