Next: Tokeniser, Previous: Entity Tagset, Up: Components
Regular expressions used to create entities and as features for matches spanning multiple tokens are handled by the RegexHandler class. This class reads and stores regular expressions and matches each with a custom EntityTag.
Regular expressions can be specified in an external file. The RegexHandler class allows for construction of regular expressions using variables assigned within this file, and mapped to entity tags also specified within the file. The tags used need not correspond to those used by the classifier.
See the example file for more information on how to specify regular expressions and map them to entity tags.
Regular expressions used by AFNER must be in the BOOST regular expression format.
AFNER uses regular expressions in two ways. Firstly, long regular expressions are used to match named entities spanning multiple tokens. A feature indicating that each token within the match is part of a larger match. Since named entities are created from the matches, an entity tag must be assigned to each regular expression provided. Due to possible complexity, these expressions can be constructed. See the example file config/regex_test for details of how to specify which regular expressions should be used.
Secondly, regular expression matches are used as featurs of individual tokens. These smaller regular expressions are used only as features of the tokens they match, no matching will be done over more than one token. As such, these expressions cannot be constructed in the same way as the broader multi-token expressions can. Additionally, no entity type is needed. See the example file config/feature_regex.