Third Workshop on Intelligent Systems and Applications in 6ª Conferência Ibérica de Sistemas e Tecnologias de Informação (CISTI), June 2011, Chaves, Portugal
In this paper we tackle the problem of creating a reference corpus for the classification of news items in fine-grained multi-label scenarios. These scenarios are particularly challenging for text classification techniques, and the availability of reference corpora is one important bottleneck for developing and testing new classification strategies. We propose a semi-automatic approach for creating a reference corpus that uses three auxiliary classification methods – one based on Support Vector Machines, one based on Nearest Neighbor Classifiers and another based on a dictionary-based classification heuristic – for suggesting to human annotators topic-related labels that can be used to describe different facets of a given news item being annotated. Using such approach, we semi-automatically produce a corpus of 1,600 news items with 865 different labels, having in average 3.63 labels per news item. We evaluate the contribution of each of the auxiliary classification methods to the annotation process and we conclude that: (i) none of the methods alone is capable of suggesting all relevant labels, (ii) a dictionary-based classification heuristic contributes significantly and (iii) the Nearest Neighbor classifier performs very efficiently in the most extreme multi-label part of the problem and is robust to the very unbalanced item-to-class distribution.
L. Sarmento, S. Nunes, J. Teixeira and E. Oliveira IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT’09), 15 – 18 September, 2009, Milano, Italy
Abstract: We propose an unsupervised method for propagating automatically extracted fine-grained topic labels among news items to improve their topic description for subsequent text classification procedure. This method compares vector representations of news items and assigns to each news item the label of its closest neighbour with a different topic label. Results obtained show that high precision can be achieved in propagating the top ranked topic label, and that 2-gram and 3-gram feature representations optimize the precision.
O verbatim é uma ferramenta que extrai automaticamente citações dos órgãos de comunicação social portugueses e as apresenta de forma organizada numa interface web. O verbatim processa diariamente dezenas de notícias recolhidas através dos serviços web do SAPO e, sem qualquer intervenção humana, identifica citações, emissores e faz a classificação em tópicos.