Next: Lists, Previous: Tokeniser, Up: Components
Lists are handled by the ListHandler class, which utilises Menno van Zaanen's suffixtree implementation. The ListHandler class pairs each list with an EntityTag, allowing Entities to be created from list matches.
The list of tokens is traversed, and each token searched for in a suffixtree built from a concatenation of all entities in several lists.
Each a search for each token from the string is conducted. The largest possible match is found. First, the string alone is checked. if this matches, then this is recorded as the largest match. Then, the token is checked with the next token afterwards. If this matches as a complete string, then the largest is reset to this string. The process is repeated until the next token no longer matches.
Each token is not concatenated with the next one. As each token only records the offset in the string, the search string is that found in the original string within the bounds of the offsets indicated by the token. In this way, only exact matches will be found, and it is not possible for co-incidental matches to be found.
List matching tokens are marked up, the matches are used as features in the recogniser. List matches are also added as NamedEntity objects.