Example of Creating a New Entity Type Using a User-defined Rule

The example in this section shows how to create a new entity type using a user-defined rule. Rules are defined using a regular-expression-based syntax. The rule is added to an extraction policy, and will then be applied whenever that policy is used.

The rule will identify increases, for example, in a stock index. There are many ways to express an increase. We want our rule to match any of the following expressions:

  climbed by 5%
  increased by over 30 percent
  jumped 5.5%

Therefore, we will create a regular expression that matches any of these, and create a new type of entity. User-defined entities must start with the letter "x", so we will call our entity "xPositiveGain" as follows:

  ctx_entity.add_extract_rule( 'mypolicy', 1,
    '<rule>'                                                          ||
      '<expression>'                                                  ||
         '((climbed|gained|jumped|increasing|increased|rallied)'      ||
         '( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))'    ||
      '</expression>'                                                 ||
      '<type refid="1">xPositiveGain</type>'                          ||
    '</rule>');

Note the use of refid in the example. This tells us which part of the regular expression to actually match, by referencing a pair of parentheses within it. In our case, we want the entire expression, so that is the outermost (and first occurring) parentheses, which is refid=1.

In this case, it is necessary to compile the policy with CTX_ENTITY.COMPILE:

  ctx_entity.compile('mypolicy');

Then we can use it as before:

  ctx_entity.extract('mypolicy', mydoc, null, myresults)

The (abbreviated) output of this is:

<entities>
  ...
  <entity id="6" offset="72" length="18" source="UserRule" ruleid="1">
    <text>climbed by over 5%</text>
    <type>xPositiveGain</type>
  </entity>
</entities>

Finally, we are going to add another user-defined entity, but this time it is using a dictionary. We want to recognize "Dow Jones Industrial Average" as an entity of type xIndex. We will add "S&P 500" as well. To do that, we create an XML file containing the following:

<dictionary>
  <entities>
    <entity>
      <value>dow jones industrial average</value>
      <type>xIndex</type>
    </entity>
    <entity>
      <value>S&amp;P 500</value>
      <type>xIndex</type>
    </entity>
  </entities>
</dictionary>

Case is not significant in this file, but note how the "&" in "S&P" must be specified as the XML entity &amp;. Otherwise, the XML would not be valid.

This XML file is loaded into the system using the CTXLOAD utility. If the file were called dict.load, we would use the following command:

ctxload -user username/password -extract -name mypolicy -file dict.load

You must compile the policy using CTX_ENTITY.COMPILE.