Running AFNER - AFNER Documentation

Next: Components, Previous: Overview, Up: Top

2 Running AFNER

AFNER has four modes:

--run In the running mode AFNER uses a model to classify an unseen document. This is the default option.
--train In the training mode AFNER builds a new model based on the documents provided. There are several models available. Use this mode only if you want to create a new model for a different corpus.
--dump In the dumping mode AFNER produces a feature file that uses YASMET format. AFNER does not perform any training or classification.
--count In the counting mode words of the training corpus are counted to build frequency lists. This needs to be done once prior to training if the features PrevClass and ProbClass are used. In the current implementation these features lower the results of AFNER so it is recommented not to run AFNER in this mode. This effectively disables the PrevClass and ProbClass features.

Program options can be set either at command line or in a configuration file. The configuration file can be set with the –config-file or -c option. With many options it may be better to set the options in a configuration file.

     afner -c afner_config.cfg

2.1 Program Options

At command line the options available are:

-h [--help]: produces a help message
-c [--config-file] <filename>: specifies the file containing settings to use.

The following options can be set either at command line or in a configuration file and they can be applied for testing, for training or for both.

2.1.1 Common Options

The common options across all modes are:

-P [--path] <pathname>: a directory containing the data used for training or testing. One or more -P or -f are required.
-f [--file] <filename>: a specific file to use for training or testing. One or more -P or -f are required.
-C [--context] <number>: optional parameter to control the range of contextual features used.
--entity_list <filename>: file containing location of entity lists paired with entity tags. It defaults to config/list_spec.
-g [--tagset] <filename>: file containing list of tags to use. The same tagset should be used for testing and training. Default is config/BBN_tags.
-x [--regex-file] <filename>: file containing regular expressions used by the recogniser. Matches to these regular expressions will be added as entities, and partial matches will be used as features of each token. The same regular expressions should be used for both testing and training. Default is config/regex.
-y [--feature-regex-file] <filename>: file containing regular expressions used only in classification. Each regular expression listed here will be have a corresponding feature to reflect a match. This is different to the broad regular expressions since these named entities will not be created from matches to these regular expressions. Again, the same file should be used for testing and training. Default is config/feature_regex.
--feature-weight <feature name> <int>: sets the weight of a feature. The list of features available and their weights is printed when AFNER is run. This option can be repeated to specify weights to several features.
--default-feature-weight <int>: sets the default weight of all features that are not specified with the option --feature-weight.
--token-frequency-input <filename>: a file containing token frequencies computed during the first training; this file needs to be created once (using the mode for counting) before any training is done.
--prev-token-frequency-input <filename>: a file containing the frequencies of the previous token; this file needs to be created once (using the mode for counting) before any training is done.

2.1.2 Testing (mode `--run`)

The options specific for testing are:

--run: the indicator that sets the testing (aka running) mode. This is the default option.
-M [--model-file] <filename>: the model file to use in classification. Default is config/BBN.mdl.
-F [--format] <NORMAL|SHORT>: The format of output from the recogniser. Either NORMAL or SHORT. Default is NORMAL.
-L [--max-labels] <int> : the maximum number of labels (classifications) that can be assigned to a token.
-S [--single]: allow only single classification per token. Default is to allow multiple classifications.
-e [--threshold] <float>: optional - the minimum probability allowed for named entities, between 0.0 and 1.0. The default is 0.0.
-O [--output-path] <pathname>: location to dump the output. Default is 'afner-output'

2.1.3 Training (mode `--train`)

The options specific for training are:

--train: the indicator that sets the training mode.
-D [--training-data-file] <filename>: the location to print training data for use by YASMET. This is a REQUIRED option.
--output-model-file] <filename>: the model file to be generated. This is a REQUIRED option.

2.1.4 Dumping (mode `--dump`)

The dumping mode does not have any specific options.

2.1.5 Counting (mode `--count`)

The options specific for counting are:

--count: the indicator that sets the counting mode.
--token-frequency-output <filename>: a file where to dump the token frequencies; this file needs to be created once (using the mode for counting) before any training is done.
--prev-token-frequency-output <filename>: a file where to dump the frequencies of the previous token; this file needs to be created once (using the mode for counting) before any training is done.

2.2 Run Mode

In the run mode the program expects a model file (there are several model files available in the directory src/data) and a set of files. The output is a set of files with the named entities marked up as offsets of the original files. A typical run would be like this:

     afner -P inputPath -O outputPath

This would find the entities of all files stored in inputPath by using the default model config/bbn.mdl based on the BBN corpus adapted to the MUC tags.

It is possible to specify other models by using the option -M. There are several models available in the directory data. Alternatively a new model can be generated using AFNER in training mode. It can also be generated by running the YASMET code with the data dumped by using AFNER in dumping mode See Classifier.

The resulting named entities are written to files in the directory specified by the ‘-O’ option; each output file has the same name as the corresponding input file. The results directory is relative to the location of the file being tested See Output. If the directory does not exist, AFNER will attempt to recreate the directory structure.

2.3 Train Mode

To run in train mode AFNER requires:

either training data files (-f [--file] <filename>) or a directory containing training data files (-P [--path] <pathname>)
a file to store training data (-D [--training-data-file] <filename>)
a file containing the tagset to use (-g [--tagset] <filename>)
a model file (--outpu-model-file <filename>) that will be generated after calling YASMET.
if the features PrevClass or ProbClass are used, AFNER requires the files with the token frequencies (--token-frequency-input <filename> and --prev-token-frequency-input <filename>). See the section Count mode below for further details on this.

The following is an example of a typical training run:

     afner --train --output-model-file modelfile -D yasmetDataFile \
     -P trainPath

This example uses all the files in trainPath for training the system and produces the model file modelfile. It also produces the raw input data for YASMET yasmetDataFile.

     afner --train --output-model-file modelfile -D yasmetDataFile \
     -f trainFile

This example uses only one file trainFile for training the system and produces the model file modelfile and the raw input data for YASMET yasmetDataFile.

     afner --train --output-model-file modelfile -D yasmetDataFile \
     -P trainPath1 -P trainPath2 -f trainFile1 -f trainFile2

This example uses the files trainFile1 and trainFile2 plus all files from trainPath1 and trainPath2.

2.4 Dump Mode

The dump mode is exactly the same as the train mode, only that no model is generated. Instead, the file specified by option --training-data-file is generated with the features in a format that YASMET understands.

2.5 Count Mode

Some of AFNER features (PrevClass and ProbClass) need to use information about token frequencies and previous token frequencies. Prior to any training AFNER needs to be run in counting mode to generate these frequencies (options --token-frequency-output <filename> and --prev-token-frequency-output <filename>). However, in the current implementation features PrevClass and ProbClass lower the results of AFNER so it is recommented not to generate token frequencies.

2.6 Evaluating the Results

A python script is provided in the directory src/utilities that can be used to evaluate the accuracy of AFNER. An example run is:

     utilities/test.py -c RemediaAnnot/level4/ resultsNew/

This example uses the annotated corpus stored in RemediaAnnot/level4/ to evaluate the results that are in resultsNew/; these results are the output of AFNER. The evaluation results are sent to standard output.

The evaluation script assumes that the testing files used by AFNER have all the annotation markup removed prior to calling to AFNER. The script utilitities/remove_non_ent_tags.py can be used to remove all markup.