Next: Output, Previous: Classifier, Up: Classifier
The features included were implemented by Alex Chilvers. The following binary features have been included so far:
InitCaps
- Whether the token's first character is capitalized
AllCaps
- Whether all characters in the token are capitalized
MixedCaps
- Whether there is a mix of upper lowercase characters in the token
AlwaysCapped
- Whether a token is always capitalised in the text.
IsSentEnd
- Whether token is an end of sentence character, ie. '.' or '!' or '?'
InitCapPeriod
- Whether the token starts with a cap and is followed by a period e.g. Mr.
OneCap
- Whether the token is one capital letter
ContainDigit
- Whether the token contains a digit
TwoDigits
- Whether the token is 2 digits, eg. '97' or '06'
FourDigits
- Whether the token is 4 digits, eg. '1985'
MonthName
- Whether token is a month name, eg 'November'
DayOfTheWeek
- Whether token is a day of the week, eg. 'monday'
NumberString
- Whether token is a number word, eg. 'one', 'thousand'
PrepPreceded
- Whether token is preceded by a preposition (in a window of 4 tokens)
PartMatch
- Whether a token is part of a match of a regular expression or list item spanning multiple tokens. The printed name changes depending on the match.
FoundInList
- Whether the token is found as an element in a list. The printed feature name changes depending on the matching list.
MatchRegex
- Whether the token matches a regular expression.
Other features may still be implemented, in particular those using global information, ie. whether a token occurs in a series of capitalised words; whether an acronym for the capitalised series is found.