G. Laboreiro, L. Sarmento, J. Teixeira and E. Oliveira
The Fourth Workshop on Analytics for Noisy Unstructured Text Data (AND’10), October 26th, 2010, Toronto, Canada
Abstract:
The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. “2day”, “gr8”, “tl;dr”, “loool”), emoticons (e.g. “(ò_ó)”, “(=ˆ-ˆ=)”), non-standard letter casing (e.g. “dr. Fred”) and unusual punctuation (e.g. “…. ..”, “!??!!!?”, “„,”). Additionally, spelling errors are abundant (e.g. “I;m”), and we can frequently find more than one language (with different tokenization requirements) in the same short message. For being efficient in such environment, manually-developed rule-based tokenizer systems have to deal with many conditions and exceptions, which makes them difficult to build and maintain. We present a text classification approach for tokenizing Twitter messages, which address…