Gustavo Laboreiro, Matko Bošnjak, Luís Sarmento, Eduarda Mendes Rodrigues, Eugénio Oliveira (2013). “Determining language variant in microblog messages”, in Proceedings of the 28th Annual ACM Symposium on Applied Computing 2013, Volume I, ACM, ISBN 978-1-4503-1656-9, pp. 902-907.
It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language — European and Brazilian — in Twit- ter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classifica- tion approach using a Na ̈ıve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification.