Next: Output, Previous: Classifier, Up: Classifier
The features included were implemented by Alex Chilvers. The following binary features have been included so far:
InitCaps - Whether the token's first character is capitalized
AllCaps - Whether all characters in the token are capitalized
MixedCaps - Whether there is a mix of upper lowercase characters in the token
AlwaysCapped - Whether a token is always capitalised in the text.
IsSentEnd - Whether token is an end of sentence character, ie. '.' or '!' or '?'
InitCapPeriod - Whether the token starts with a cap and is followed by a period e.g. Mr.
OneCap - Whether the token is one capital letter
ContainDigit - Whether the token contains a digit
TwoDigits - Whether the token is 2 digits, eg. '97' or '06'
FourDigits - Whether the token is 4 digits, eg. '1985'
MonthName - Whether token is a month name, eg 'November'
DayOfTheWeek - Whether token is a day of the week, eg. 'monday'
NumberString - Whether token is a number word, eg. 'one', 'thousand'
PrepPreceded - Whether token is preceded by a preposition (in a window of 4 tokens)
PartMatch - Whether a token is part of a match of a regular expression or list item spanning multiple tokens. The printed name changes depending on the match.
FoundInList - Whether the token is found as an element in a list. The printed feature name changes depending on the matching list.
MatchRegex - Whether the token matches a regular expression.
Other features may still be implemented, in particular those using global information, ie. whether a token occurs in a series of capitalised words; whether an acronym for the capitalised series is found.