The example in this section provides a very simple, basic example of entity extraction. The example assumes that you have a clob containing the following text:
New York, United States of America The Dow Jones Industrial Average climbed by 5% yesterday on news of a new software release from database giant Oracle Corporation.
The example uses CTX_ENTITY.EXTRACT to find all the entities in the CLOB value. (For now, do not worry about how the text got into the clob or how we provide the output clob)
Entity extraction requires a new type of policy, an "extract policy," which enables you to specify options. For now, we will create a default policy as follows:
ctx_entity.create_extract_policy( 'mypolicy' );
We can then call extract
to do the work. It needs four arguments: the policy name, the document to process, the language, and the output clob (which must have been initialized, for example, by calling dbms_lob.createtemporary
).
ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob )
outclob
contains the XML identifying extracted entities. If we display the contents (preferably by selecting it as an XMLTYPE so it gets formatted nicely) we will see the following:
<entities> <entity id="0" offset="0" length="8" source="SuppliedDictionary"> <text>New York</text> <type>city</type> </entity> <entity id="1" offset="150" length="18" source="SuppliedRule"> <text>Oracle Corporation</text> <type>company</type> </entity> <entity id="2" offset="10" length="24" source="SuppliedDictionary"> <text>United States of America</text> <type>country</type> </entity> <entity id="3" offset="83" length="2" source="SuppliedRule"> <text>5%</text> <type>percent</type> </entity> <entity id="4" offset="113" length="8" source="SuppliedDictionary"> <text>software</text> <type>product</type> </entity> <entity id="5" offset="0" length="8" source="SuppliedDictionary"> <text>New York</text> <type>state</type> </entity> </entities>
This is fine if we are going to process it with an XML-aware program. However, if we want it in a more "SQL friendly" view, we can use Oracle XML DB functions to convert it as follows:
select xtab.offset, xtab.text, xtab.type, xtab.source from xmltable( '/entities/entity' PASSING xmltype(outclob) COLUMNS offset number PATH '@offset', lngth number PATH '@length', text varchar2(50) PATH 'text/text()', type varchar2(50) PATH 'type/text()', source varchar2(50) PATH '@source' ) as xtab order by offset;
This produces the following output:
OFFSET TEXT TYPE SOURCE ---------- ------------------------- -------------------- -------------------- 0 New York city SuppliedDictionary 0 New York state SuppliedDictionary 10 United States of America country SuppliedDictionary 83 5% percent SuppliedRule 113 software product SuppliedDictionary 150 Oracle Corporation company SuppliedRule
If we do not want every different type of entity, we can select which types to fetch. We do this by adding a fourth argument to the "extract" procedure, with a comma-separated list of entity types. For example:
ctx_entity.extract( 'mypolicy', mydoc, 'ENGLISH', outclob, 'city, country' ) That would give us the XML <entities> <entity id="0" offset="0" length="8" source="SuppliedDictionary"> <text>New York</text> <type>city</type> </entity> <entity id="2" offset="10" length="24" source="SuppliedDictionary"> <text>United States of America</text> <type>country</type> </entity> </entities>