Next: , Previous: Regular Expression Handler, Up: Components


3.3 Tokeniser

Each token is stored as an object, with private data members in that object recording the offset in the string that the token occurs in. The list of tokens retrieved skips any XML tags, and lists punctuation as separate tokens.

The string is searched for all matches of each regular expression and each match is added as a named entity to the list of entities stored in the decorator ('NEDeco') object.

For example, the string:

     "Company operated, through a 50%-owned joint venture, 27 warehouses in Mexico. Something something $3.60 and $10."

is tokenised to become:

     'Company','operated',',','through','a','50','%-','owned','joint','venture',',','27','warehouses',
     'in','Mexico','.','Something','something','$','3','.','60','and','$','10','.'

Tokenisation is performed using the function tokenise(beginIterator,endIterator,skipXML=true), which returns a vector<Token>. Used in this way, text between the iterators is tokenised and XML is skipped by default. Each token in the vector returned contains offsets from the start of the point of tokenisation.

Tokenisation that retrieves information about marked up named entities can also be done. The function tokeniseWithNEInfo performs this operation, returning a vector of 'NEToken's.