Next: , Previous: Components, Up: Components


3.1 Entity Tagset

Information regarding named entity tags is stored using the EntityTag and EntityTagset classes. The EntityTag class stores tag information and allows for an arbitrary level of granularity.

A Named Entity may be marked up as so:

     <ENAMEX TYPE="ORGANIZATION">Free Software Foundation</ENAMEX>

or with an additional level of granularity:

     <ENAMEX TYPE="ORGANIZATION:CORPORATION">Microsoft</ENAMEX>

The EntityTag class assumes at least 2 levels (ENAMEX/NUMEX/TIMEX and TYPE). Subsequent levels are identified by a separating ':'.

An EntityTag object can be initialised with the opening tag string, root tag and sub-tags (as two strings), or with a vector of strings where each index corresponds to the level (i.e. 0 - root, 1 - type, 2 - sub-type etc).

The EntityTagset class stores a collection of EntityTag objects, and assigns each tag with an index, from which a classification can be calculated. Each tag is assumed to have two classifications, Begin and In.

An EntityTagset can be initialised with either an input stream or string. Typically, a tagset will be read from a file of the following format:

     <ENAMEX TYPE="LOCATION">
     <TIMEX TYPE="TIME">
     <ENAMEX TYPE="ORGANIZATION">
     <TIMEX TYPE="DATE">
     <NUMEX TYPE="MONEY">
     <ENAMEX TYPE="PERSON">

Since the classifications returned by the tagset are used by the classifier, it is important to use the same tagset for training and testing.