J. Teixeira, L. Sarmento and E. Oliveira
The 4th Track on Text Mining and Applications (TeMA 2011) in the 15th Portuguese Conference of Artificial Intelligence (EPIA), October 2011, Lisbon, Portugal
In this paper we present a bootstrapping approach for training a Named Entity Recognition (NER) system. Our method starts by annotating persons’ names on a dataset of 50,000 news items. This is performed using a simple dictionary-based approach. Using such train- ing set we build a classification model based on Conditional Random Fields (CRF). We then use the inferred classification model to perform additional annotations of the initial seed corpus, which is then used for training a new classification model. This cycle is repeated until the NER model stabilizes. We evaluate each of the bootstrapping iterations by calculating: (i) the precision and recall of the NER model in annotating a small gold-standard collection (HAREM); (ii) the precision and recall of the CRF bootstrapping annotation method over a small sample of news; and (iii) the correctness and the number of new names identified. Additionally, we compare the NER model with a dictionary-based approach, our baseline method. Results show that our bootstrapping approach sta- bilizes after 7 iterations, achieving high values of precision (83%) and recall (68%).