Next: Regular Expression Handler, Previous: Components, Up: Components
Information regarding named entity tags is stored using the EntityTag and EntityTagset classes. The EntityTag class stores tag information and allows for an arbitrary level of granularity.
A Named Entity may be marked up as so:
<ENAMEX TYPE="ORGANIZATION">Free Software Foundation</ENAMEX>
or with an additional level of granularity:
<ENAMEX TYPE="ORGANIZATION:CORPORATION">Microsoft</ENAMEX>
The EntityTag class assumes at least 2 levels (ENAMEX/NUMEX/TIMEX and TYPE). Subsequent levels are identified by a separating ':'.
An EntityTag object can be initialised with the opening tag string, root tag and sub-tags (as two strings), or with a vector of strings where each index corresponds to the level (i.e. 0 - root, 1 - type, 2 - sub-type etc).
The EntityTagset class stores a collection of EntityTag objects, and assigns each tag with an index, from which a classification can be calculated. Each tag is assumed to have two classifications, Begin and In.
An EntityTagset can be initialised with either an input stream or string. Typically, a tagset will be read from a file of the following format:
<ENAMEX TYPE="LOCATION"> <TIMEX TYPE="TIME"> <ENAMEX TYPE="ORGANIZATION"> <TIMEX TYPE="DATE"> <NUMEX TYPE="MONEY"> <ENAMEX TYPE="PERSON">
Since the classifications returned by the tagset are used by the classifier, it is important to use the same tagset for training and testing.