Tokenizing Micro-Blogging Messages using a Text Classification Approach

21 Setembro 2010
G. Laboreiro, L. Sarmento, J. Teixeira and E. Oliveira
The Fourth Workshop on Analytics for Noisy Unstructured Text Data (AND’10), October 26th, 2010, Toronto, Canada

The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. “2day”, “gr8”, “tl;dr”, “loool”), emoticons (e.g. “(ò_ó)”, “(=ˆ-ˆ=)”), non-standard letter casing (e.g. “dr. Fred”) and unusual punctuation (e.g. “…. ..”, “!??!!!?”, “„,”). Additionally, spelling errors are abundant (e.g. “I;m”), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address…

