12 Extracting Entities Using Oracle Text

Entity extraction is the process of locating phrases in an input document and classifying these phrases into a set of categories. This information can then be used to better underst and the content of the document. Entities can be concrete objects, such as people or companies, or more abstract objects, such as percentages. Table 12-1, "Entities and Entity Types" illustrates example entities and the entity types they belong to.

Table 12-1 Entities and Entity Types

Entity	Entity Types
United States of America	Country
25%	Percentage
Oracle Corporation	Organization_Company
John Doe	Person

12.1 How to Use Entity Extraction

Application developers can use entity extraction in a variety of ways. For example, a search application could use entity extraction to provide the capability of finding all documents in a given corpus that contain a company name, even if the user does not know the exact name of the company he is searching for. Entity extraction can be used to examine a corpus and see what names are most frequently mentioned (that is, most popular), which is something that cannot be done with simple keyword searches. Entity extraction can be used to markup documents with useful links. As an example, an application could use entity extraction to find all the places in their documents, and link the text of each place to a map showing the absolute location.

Oracle Text's entity extraction feature provides an out-of-the-box solution that enables application developers to use entity extraction on their documents inside of the database. In addition, entity extraction allows developers to improve their results by adding user-defined rules and dictionaries to drive the extraction. Oracle provides a large set of supplied entity types, and, in addiation, you can define your own.

The examples in this chapter show how to use and tune entity extraction for a given application.

12.1.1 Examples of Entity Extraction

The following examples illustrate different cases of using entity extraction.

Example 12-1 Running Entity Extraction

This example illustrates how to use entity extraction to analyze a set of documents in a database. Suppose that our documents are contained in a table named docs, with the following definition:

create table docs(id number primary key, txt clob);

Then, we create a second table called entities that will contain the extracted entities for the given documents:

create table entities(id number primary key, ents clob);

We now create an entity extraction policy that will govern how the extraction runs. An entity extraction policy represents the various knobs the user has to tune the results of the extraction. For example, the user can set the policy to enable or disable the Oracle-supplied rules or the Oracle-supplied dictionary. Additionally, users can add their own custom rules or dictionary to a given policy.

For now, we will create a policy p1 that uses the default options, which includes both the Oracle-supplied rules and Oracle-supplied dictionary:

exec ctx_entity.create_extract_policy('p1');

We then run entity extraction on our input document in the docs table using the PL/SQL procedure CTX_ENTITY.EXTRACT. We insert the extracted entities back into the entities table for further inspection:

declare
  mydoc clob;
begin
  --put input document into mydoc
  select txt into mydoc from docs where id=1;
  --run extraction using policy p1
  ctx_entity.extract('p1', mydoc, null, myresults);
  --save entities to table
  insert into entities values(1, myresults);
  commit;
end;
/

Finally, we can examine the entities by reading the entities table. Below is a sample of entity extraction output from an example document:

<entities>
  <entity id="0" offset="4068" length="7" source="SuppliedDictionary">
  <text>Chicago</text>
  <type>city</type>
</entity>
<entity id="1" offset="1102" length="8" source="SuppliedDictionary">
  <text>New York</text>
  <type>city</type>
</entity>
<entity id="2" offset="4326" length="8" source="SuppliedDictionary">
  <text>New York</text>
  <type>city</type>
</entity>
<entity id="7" offset="49" length="28" source="SuppliedRule">
  <text>American International Group</text>
  <type>company</type></entity>
<entity id="8" offset="3494" length="10" source="SuppliedRule">
  <text>Imax Corp.</text>
  <type>company</type>
</entity>
<entity id="9" offset="2020" length="18" source="SuppliedRule">
  <text>Investment Company</text>
  <type>company</type></entity>
<entity id="10" offset="509" length="11" source="SuppliedDictionary">
  <text>Pfizer Inc.</text>
  <type>company</type>
</entity>
<entity id="13" offset="23" length="4" source="SuppliedDictionary">
  <text>U.S.</text>
  <type>country</type>
  <normal>
    <value>United States of America</value>
  </normal>
</entity>
<entity id="17" offset="3527" length="6" source="SuppliedRule">
  <text>$14.91</text>
  <type>currency</type>
</entity>
<entity id="19" offset="1970" length="12" source="SuppliedRule">
  <text>$172 billion</text>
  <type>currency</type>
</entity>
<entity id="27" offset="1934" length="12" source="SuppliedRule">
  <text>January 2009</text>
  <type>date</type>
</entity>
<entity id="32" offset="1062" length="11" source="SuppliedRule">
  <text>0.1 percent</text>
  <type>percent</type>
</entity>
<entity id="46" offset="3405" length="23" source="SuppliedDictionary">
  <text>Chief Financial Officer</text>
  <type>person_jobtitle</type>
</entity>
<entity id="47" offset="157" length="9" source="SuppliedDictionary">
  <text>President</text>
  <type>person_jobtitle</type>
</entity>
<entity id="48" offset="751" length="7" source="SuppliedDictionary">
  <text>manager</text>
  <type>person_jobtitle</type>
</entity>
<entity id="49" offset="2599" length="15" source="SuppliedRule">
  <text>Robert W. Baird</text>
  <type>person_name</type>
</entity>
<entity id="56" offset="3482" length="7" source="SuppliedDictionary">
  <text>Florida</text>
  <type>state</type>
</entity>
<entity id="57" offset="842" length="8" source="SuppliedDictionary">
  <text>Michigan</text>
  <type>state</type>
</entity>

In all, entity extraction found 64 entities in this document, of which a small sample are shown here. Entity extraction finds various types of entities, such as cities, states, companies, and currencies. It also provides the location in the document where entities were found, as well what sources were used to determine the entities. In this business-related document, there were many examples of companies, currencies, and percentages. In addition, because this was a business news article focusing on the U.S., we see many U.S. cities and states in the entity extraction output.

This example shows the most basic use of entity extraction. From this starting point, we can start to augment the extraction policy to improve the precision and recall of the results.

Example 12-2 Extracting Only Certain Entity Types

You can limit what types of entities are extracted in the call to CTX_ENTITY.EXTRACT by specifying a comma-delimited list of entity types to extract:

declare
  mydoc clob;
  myresults clob;
begin
  select txt into mydoc from docs where id=1;
  ctx_entity.extract('p1', mydoc, null, myresults, 'city,company');
  insert into entities valueS(1, myresults);
  commit;
end;
/

This call to CRX_ENTITY.EXTRACT will only extract cities and companies. It can lead to a performance improvement, especially when using user-rules.

Example 12-3 Improving Entity Extraction by Adding User-Supplied Rules

In certain scenarios, the supplied entity types might not be descriptive enough. For example, in a discussion about the stock market, a person might want to identify which stocks are gaining or losing, just from unstructured text. Writing user rules with user-defined types provides a way to accomplish this.

Suppose that stocks that are gaining are described by the pattern "climbed n percent" or "jumped n percent" where n is a number. We can write a regular expression to describe this pattern as:

(climbed|jumped) \d+(\.\d+)? percent

This regular expression will match phrases of the form "climbed 5 percent" and "jumped 5.3 percent".

The procedure CTX_ENTITY.ADD_EXTRACT_RULE is used to associate a regular expression with an extraction policy. This procedure takes in a short XML string that contains the regular expression, which portion of the regular expression corresponds to the actual entity, and what type to assign the extracted text. User-defined types must be prefixed with the letter "x".

Suppose that we name our new entity type xPositiveGain. Then the extraction rule will look like the following:

<rule>
  <expression>((climbed|jumped) \d+(\.\d+)? percent)></expression>
  <type refid="1">xPositiveGain</type>
</rule>

Note that the attribute refid in the type tag. The refid corresponds to the backreference number, and indicates what text to extract. For example, the first backreference is the text enclosed by the first set of parentheses. This text will be extracted and classified as an xPositiveGain entity.

Next, we add the extraction rule to the policy, and then recompile the policy:

exec ctx_entity.add_extract_rule('p1', 1,
  '<rule><expression>((climbed|jumped) \d+(\.\d+)?
  percent)></expression><type refid="1">xPositiveGain</type></rule>');
exec ctx_entity.compile('p1');

Now we can rerun the extraction, and will obtain the following additional entities:

<entity id="65" offset="2067" length="19" source="UserRole" ruleid="1">
  <text>climbed 3.8 percent</text>
  <type>xPositiveGain</type>
</entity>
<entity id="66" offset="3505" length="18" source="UserRole" ruleid="1">
  <text>jumped 8.7 percent</text>
  <type>xPositiveGain</type>
</entity>

Example 12-4 Improving Entity Extraction by Adding a User-Supplied Dictionary

In certain scenarios, you might wish to add fixed terms to a user-supplied dictionary. For example, in a discussion about the stock market, you might want to extract the various indexes that are being referenced in a discussion. These can be inserted into a user-defined dictionary that can be used in the extraction.

First, we write a dictionary describing your additional terms. The user-defined dictionary is another XML document describing entities and their types:

<dictionary>
  <entities>
    <entity>
      <value>Dow Jones Industrail Average</value>
      <type>xIndex</type>
    </entity>
    <entity>
      <value>S&amp;P 500</value>
      <type>xIndex</type>
    </entity>
  <entities>
</dictionary>

You then load the dictionary using CTXLOAD, which is a command-line tool that you use to load an entity extraction dictionary:

ctxload -user scott/tiger -extract -name p1 -file ud.xml

Then, we set the compile policy and rerun the extraction:

exec ctx_entity.compile('p1');

declare
  mydoc clob;
  myresults clob;
begin
  select txt into mydoc from docs where id=1;
  ctx_entity.extract('p1', mydoc, null, myresults);
  insert into entities values(1, myresults);
  commit;
end;
/

We now are able to find the following additional entities:

<entity id="67" offset="1231" length="28" source="UserDictionary">
  <text>Dow Jones Industrial Average</text>
  <type>xIndex</type>
</entity>
<entity id="68" offset="1422" length="7" source="UserDictionary">
  <text>S&amp;P 500</text>
  <type>xIndex</type>
</entity>
<entity id="69" offset="1010" length="7" source="UserDictionary">
  <text>S&amp;P 500</text>
  <type>xIndex</type>
</entity>