ABSTRACT
It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language --- European and Brazilian --- in Twitter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classification approach using a Naïve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification.
- S. Carter, W. Weerkamp, and M. Tsagkias. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal, 2013. Google ScholarDigital Library
- W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.Google Scholar
- Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-based approach to geo-locating twitter users. In CIKM, pages 759--768. ACM, 2010. Google ScholarDigital Library
- J. F. da Silva and G. P. Lopes. Identification of document language is not yet a completely solved problem. In Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA-IAWTIC'06), pages 212--219, 2006. Google ScholarDigital Library
- R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM Symposium on Discrete algorithms, SODA '03, pages 28--36, 2003. Google ScholarDigital Library
- C. Fink, C. D. Piatko, J. Mayfield, T. Finin, and J. Martineau. Geolocating Blogs from Their Textual Content. In AAAI Spring Symposium: Social Semantic Web: Where Web 2.0 Meets Web 3.0, pages 25--26. AAAI, 2009.Google Scholar
- R. Gonzalez, R. Cuevas, A. Cuevas, and C. Guerrero. Where are my followers? Understanding the Locality Effect in Twitter. ArXiv e-prints, May 2011.Google Scholar
- T. Gottron and N. Lipka. A comparison of language identification approaches on short, query-style texts. In Proceedings of the 32nd European conference on Advances in Information Retrieval (ECIR'2010), pages 611--614, 2010. Google ScholarDigital Library
- L. Grothe, E. W. D. Luca, and A. Nürnberger. A comparative study on language identification methods. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008.Google Scholar
- L. Hong, G. Convertino, and E. Chi. Language matters in twitter: A large scale study. In Proceedings of the fifth International AAAI Conference on Weblogs and Social Media (ICWSM'2011), pages 518--521, 2011.Google Scholar
- B. Hughes, T. Baldwin, S. Bird, J. Nicholson, and A. Mackinlay. Reconsidering language identification for written language resources. In Proceedings of the fifth International Conference on Language Resources and Evaluation (LREC'2006), pages 485--488, 2006.Google Scholar
- I. Instituto Nacional de Estatística, editor. Estatísticas Demográficas 2010. Instituto Nacional de Estatística, 2012.Google Scholar
- G. Laboreiro, L. Sarmento, and E. Oliveira. Identifying Automatic Posting Systems in Microblogs. In Progress in Artificial Intelligence, volume 7026 of Lecture Notes in Computer Science, pages 634--648. Springer Berlin/Heidelberg, 2011. Google ScholarDigital Library
- G. Laboreiro, L. Sarmento, J. Teixeira, and E. Oliveira. Tokenizing micro-blogging messages using a text classification approach. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data (AND'10), pages 81--88, 2010. Google ScholarDigital Library
- N. Ljubesic, N. Mikelic, and D. Boras. Language indentification: How to distinguish similar languages? In Proceedings of 29th International Conference on Information Technology Interfaces (ITI'2007), pages 541--546, 2007.Google ScholarCross Ref
- B. Martins and M. J. Silva. Language identification in web pages. In Proceedings of the 2005 ACM symposium on Applied computing (SAC'05), pages 764--768, 2005. Google ScholarDigital Library
- D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, SMUC '10, pages 37--44. ACM, 2010. Google ScholarDigital Library
- R. Sousa-Silva, G. Laboreiro, L. Sarmento, T. Grant, E. Oliveira, and B. Maia. 'twazn me!!!;(' Automatic Authorship Analysis of Micro-Blogging Messages. In Procedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB'2011), pages 161--168, 2011. Google ScholarDigital Library
- Y.-J. Tang, C.-Y. Li, and H.-H. Chen. A comparison between microblog corpus and balanced corpus from linguistic and sentimental perspectives. In Workshop on Analyzing Microtext (AAAI'2011), 2011.Google Scholar
- T. Vatanen, J. J. Väyrynen, and S. Virpioja. Language Identification of Short Text Segments with N-gram Models. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), 2010.Google Scholar
Recommendations
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. To this end, language identification can provide a monolingual set of content for analysis. We find the unedited and idiomatic language of microblogs to ...
What does software engineering community microblog about?
MSR '12: Proceedings of the 9th IEEE Working Conference on Mining Software RepositoriesMicroblogging is a new trend to communicate and to disseminate information. One microblog post could potentially reach millions of users. Millions of microblogs are generated on a daily basis on popular sites such as Twitter. The popularity of ...
Predicting lifespans of popular tweets in microblog
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalIn microblog like Twitter, popular tweets are usually retweeted by many users. For different tweets, their lifespans (i.e., how long they will stay popular) vary. This paper presents a simple yet effective approach to predict the lifespans of popular ...
Comments