What is Named Entity Recognition?
Named entities are "atomic elements in text" belonging to "predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc." (Wikipedia, 2006). Named entity recognition (NER) is the task of identifying such named entities.
Even though the categories of named entities are predefined, there are varying opinions on what categories should be regarded as named entities and how broad those categories should be. A few conventions have emerged, and entities are commonly marked up in accordance with the XML style format described in the Message Understanding Conference. "ENAMEX" tags are used for names, "NUMEX" tags are used for numerical entities, and "TIMEX" tags are used for temporal entities.
The following example is taken from Wikipedia:
Jim bought 300 shares of Acme Corp. in 2006.
<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.
Basic categories generally agreed upon include the following:
- Names (enamex)
- Times (timex)
- Numbers (numex)
However, the following may also be considered as categories/subcategories:
The categories chosen for a particular NER project may depend on the requirements of the project. If numerical classification is important to a particular field, then the categories dealing with numerical data may need to be more refined. Similarly, if geographical classification is important, it may be necessary to classify each location entity as a particular type of location.
Named entity recognition, although a seemingly simple task, faces a number of challenges. Entities may firstly be difficult to find, and once found, difficult to classify. Locations and person names can be the same, and follow similar formatting.