Next: , Previous: SuffixTree, Up: Components


3.5 Classifier

The classifier makes use of classifications assigned to entity tags stored in a EntityTagset object.

The classifier finds named entities using a maximum entropy algorithm adapted from YASMET by Franz Josef Och. Firstly, training data needs to be produced from a corpus of annotated documents. Each document is tokenised and features computed to produce the training data. The training data is produced in the following format:

     <num categories>
     <classification> @ @ <weight of class> <feature 1> <value 1> <feature 2> <value 2>....# @ <weight of class> <feature 1> <value 1> ....

The features for each class are duplicated, and the feature names changed to correspond with the respective class that the data is duplicated for. The actual training data looks like:

     13
     0 @ @ 1 cat0_alphnum 1 cat0_caps 2 .... # @ 0 cat1_alphnum 1 cat1_caps 2 .... # @ 0 cat2_alphnum 1 .... cat12_found6 0 #

This is redundant, as the values will be the same for each feature. This is a legacy of the YASMET code.

Once training data is generated, a model can be produced. The model file is the result of mathematical analysis of the training data and is what is used in classification. The model file is produced by running the original YASMET code with the training data.

For classification, the model file is given to the decorator class. The MaxEnt class is initialised with the model file, and the model is read so data can be classified. A feature vector is computed for each token and a classification is returned by the classifier based on the values of the features in the vector.