2 Running AFNER
AFNER has four modes:
- --run
In the running mode AFNER uses a model to classify an unseen
document. This is the default option.
- --train
In the training mode AFNER builds a new model based on the documents
provided. There are several models available. Use this mode only if
you want to create a new model for a different corpus.
- --dump
In the dumping mode AFNER produces a feature file that uses YASMET
format. AFNER does not perform any training or classification.
- --count
In the counting mode words of the training corpus are counted to
build frequency lists. This needs to be done once prior to training if
the features PrevClass and ProbClass are used. In the current
implementation these features lower the results of AFNER so it is
recommented not to run AFNER in this mode. This effectively disables
the PrevClass and ProbClass features.
Program options can be set either at command line or in a
configuration file. The configuration file can be set with the
–config-file or -c option. With many options it may be better to set
the options in a configuration file.
afner -c afner_config.cfg
2.1 Program Options
At command line the options available are:
- -h [--help]: produces a help message
- -c [--config-file] <filename>: specifies the file containing settings to use.
The following options can be set either at command line or in a
configuration file and they can be applied for testing, for training
or for both.
2.1.1 Common Options
The common options across all modes are:
- -P [--path] <pathname>: a directory containing the data
used for training or testing. One or more -P or -f
are required.
- -f [--file] <filename>: a specific file to use for training
or testing. One or more -P or -f
are required.
- -C [--context] <number>: optional parameter to control the
range of contextual features used.
- --entity_list <filename>: file containing location of
entity lists paired with entity tags. It defaults to config/list_spec.
- -g [--tagset] <filename>: file containing list
of tags to use. The same tagset should be used for testing and
training. Default is config/BBN_tags.
- -x [--regex-file] <filename>: file containing regular
expressions used by the recogniser. Matches to these regular
expressions will be added as entities, and partial matches will be
used as features of each token. The same regular expressions should be
used for both testing and training. Default is config/regex.
- -y [--feature-regex-file] <filename>: file containing
regular expressions used only in classification. Each regular
expression listed here will be have a corresponding feature to reflect
a match. This is different to the broad regular expressions since
these named entities will not be created from matches to these regular
expressions. Again, the same file should be used for testing and
training. Default is config/feature_regex.
- --feature-weight <feature name> <int>: sets the weight of
a feature. The list of features available and their weights is printed
when AFNER is run. This option can be repeated to specify weights to
several features.
- --default-feature-weight <int>: sets the default weight of
all features that are not specified with the option
--feature-weight.
- --token-frequency-input <filename>: a file containing token
frequencies computed during the first training; this file needs to be
created once (using the mode for counting) before any training is done.
- --prev-token-frequency-input <filename>: a file containing
the frequencies of the previous token; this file needs to be created
once (using the mode for counting) before any training is done.
2.1.2 Testing (mode --run)
The options specific for testing are:
- --run: the indicator that sets the testing (aka running) mode. This
is the default option.
- -M [--model-file] <filename>: the model file to
use in classification. Default is config/BBN.mdl.
- -F [--format] <NORMAL|SHORT>: The format of output from the
recogniser. Either NORMAL or SHORT. Default is NORMAL.
- -L [--max-labels] <int> : the maximum number of labels
(classifications) that can be assigned to a token.
- -S [--single]: allow only single classification per
token. Default is to allow multiple classifications.
- -e [--threshold] <float>: optional - the minimum
probability allowed for named entities, between 0.0 and 1.0. The default is 0.0.
- -O [--output-path] <pathname>: location to dump the output. Default
is 'afner-output'
2.1.3 Training (mode --train)
The options specific for training are:
- --train: the indicator that sets the training mode.
- -D [--training-data-file] <filename>: the location to print
training data for use by YASMET. This is a REQUIRED option.
- --output-model-file] <filename>: the
model file to be generated. This is a REQUIRED option.
2.1.4 Dumping (mode --dump)
The dumping mode does not have any specific options.
2.1.5 Counting (mode --count)
The options specific for counting are:
- --count: the indicator that sets the counting mode.
- --token-frequency-output <filename>: a file where to dump
the token frequencies; this file needs to be created once (using the
mode for counting) before any training is done.
- --prev-token-frequency-output <filename>: a file where to
dump the frequencies of the previous token; this file needs to be
created once (using the mode for counting) before any training is
done.
2.2 Run Mode
In the run mode the program expects a model file (there are several
model files available in the directory src/data) and a set of
files. The output is a set of files with the named entities marked up as
offsets of the original files. A typical run would be like this:
afner -P inputPath -O outputPath
This would find the entities of all files stored in inputPath by
using the default model config/bbn.mdl based on the BBN corpus
adapted to the MUC tags.
It is possible to specify other models by using the option -M. There
are several models available in the directory data. Alternatively a new
model can be generated using AFNER in training mode. It can also be generated by
running the YASMET code with the data dumped by using AFNER in dumping
mode See Classifier.
The resulting named entities are written to files in the directory
specified by the ‘-O’ option; each output file has the same name as the
corresponding input file. The results directory is relative to the
location of the file being tested See Output. If the directory does
not exist, AFNER will attempt to recreate the directory structure.
2.3 Train Mode
To run in train mode AFNER requires:
- either training data files (-f [--file] <filename>) or
a directory containing training data files (-P [--path]
<pathname>)
- a file to store training data (-D [--training-data-file]
<filename>)
- a file containing the tagset to use (-g [--tagset]
<filename>)
- a model file (--outpu-model-file <filename>) that will be
generated after calling YASMET.
- if the features PrevClass or ProbClass are used, AFNER requires the
files with the token frequencies (--token-frequency-input
<filename> and --prev-token-frequency-input <filename>). See
the section Count mode below for further details on this.
The following is an example of a typical training run:
afner --train --output-model-file modelfile -D yasmetDataFile \
-P trainPath
This example uses all the files in trainPath for training the
system and produces the model file modelfile. It also produces
the raw input data for YASMET yasmetDataFile.
afner --train --output-model-file modelfile -D yasmetDataFile \
-f trainFile
This example uses only one file trainFile for training the system
and produces the model file modelfile and the raw input data for
YASMET yasmetDataFile.
afner --train --output-model-file modelfile -D yasmetDataFile \
-P trainPath1 -P trainPath2 -f trainFile1 -f trainFile2
This example uses the files trainFile1 and trainFile2 plus
all files from trainPath1 and trainPath2.
2.4 Dump Mode
The dump mode is exactly the same as the train mode, only that no
model is generated. Instead, the file specified by option
--training-data-file is generated with the features in a
format that YASMET understands.
2.5 Count Mode
Some of AFNER features (PrevClass and ProbClass) need to use
information about token frequencies and previous token
frequencies. Prior to any training AFNER needs to be run in counting
mode to generate these frequencies (options
--token-frequency-output <filename> and
--prev-token-frequency-output <filename>). However, in the
current implementation features PrevClass and ProbClass lower the
results of AFNER so it is recommented not to generate token
frequencies.
2.6 Evaluating the Results
A python script is provided in the directory src/utilities that
can be used to evaluate the accuracy of AFNER. An example run is:
utilities/test.py -c RemediaAnnot/level4/ resultsNew/
This example uses the annotated corpus stored in
RemediaAnnot/level4/ to evaluate the results that are in
resultsNew/; these results are the output of AFNER. The
evaluation results are sent to standard output.
The evaluation script assumes that the testing files used by AFNER have
all the annotation markup removed prior to calling to AFNER. The script
utilitities/remove_non_ent_tags.py can be used to remove all
markup.