The example in this section shows how to create a new entity type using a user-defined rule. Rules are defined using a regular-expression-based syntax. The rule is added to an extraction policy, and will then be applied whenever that policy is used.
The rule will identify increases, for example, in a stock index. There are many ways to express an increase. We want our rule to match any of the following expressions:
climbed by 5% increased by over 30 percent jumped 5.5%
Therefore, we will create a regular expression that matches any of these, and create a new type of entity. User-defined entities must start with the letter "x", so we will call our entity "xPositiveGain" as follows:
ctx_entity.add_extract_rule( 'mypolicy', 1, '<rule>' || '<expression>' || '((climbed|gained|jumped|increasing|increased|rallied)' || '( (by|over|nearly|more than))* \d+(\.\d+)?( percent|%))' || '</expression>' || '<type refid="1">xPositiveGain</type>' || '</rule>');
Note the use of refid
in the example. This tells us which part of the regular expression to actually match, by referencing a pair of parentheses within it. In our case, we want the entire expression, so that is the outermost (and first occurring) parentheses, which is refid=1
.
In this case, it is necessary to compile the policy with CTX_ENTITY.COMPILE
:
ctx_entity.compile('mypolicy');
Then we can use it as before:
ctx_entity.extract('mypolicy', mydoc, null, myresults)
The (abbreviated) output of this is:
<entities> ... <entity id="6" offset="72" length="18" source="UserRule" ruleid="1"> <text>climbed by over 5%</text> <type>xPositiveGain</type> </entity> </entities>
Finally, we are going to add another user-defined entity, but this time it is using a dictionary. We want to recognize "Dow Jones Industrial Average" as an entity of type xIndex
. We will add "S&P 500" as well. To do that, we create an XML file containing the following:
<dictionary> <entities> <entity> <value>dow jones industrial average</value> <type>xIndex</type> </entity> <entity> <value>S&P 500</value> <type>xIndex</type> </entity> </entities> </dictionary>
Case is not significant in this file, but note how the "&" in "S&P" must be specified as the XML entity &
. Otherwise, the XML would not be valid.
This XML file is loaded into the system using the CTXLOAD
utility. If the file were called dict.load
, we would use the following command:
ctxload -user username/password -extract -name mypolicy -file dict.load
You must compile the policy using CTX_ENTITY.COMPILE
.