Avatar

Sylvester UGC Tokenizer

12 Novembro 2011
Sem comentários

Sylvester UGC Tokenizer is a simple tool that is capable of splitting
noisy text into segments, such as words, punctuation blocks, URLs,
smileys, and so on. Most tokenizers were made to handle clean text,
and can corrupt noisy messages, (e. g. Twitter posts).
We use a text classification approach, described in this post,
achieving significantly better results.
The library and example code are available here.



Sem comentários