skip to main content
10.1145/2480362.2480535acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Determining language variant in microblog messages

Published:18 March 2013Publication History

ABSTRACT

It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language --- European and Brazilian --- in Twitter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classification approach using a Naïve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification.

References

  1. S. Carter, W. Weerkamp, and M. Tsagkias. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.Google ScholarGoogle Scholar
  3. Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-based approach to geo-locating twitter users. In CIKM, pages 759--768. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. F. da Silva and G. P. Lopes. Identification of document language is not yet a completely solved problem. In Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA-IAWTIC'06), pages 212--219, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM Symposium on Discrete algorithms, SODA '03, pages 28--36, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Fink, C. D. Piatko, J. Mayfield, T. Finin, and J. Martineau. Geolocating Blogs from Their Textual Content. In AAAI Spring Symposium: Social Semantic Web: Where Web 2.0 Meets Web 3.0, pages 25--26. AAAI, 2009.Google ScholarGoogle Scholar
  7. R. Gonzalez, R. Cuevas, A. Cuevas, and C. Guerrero. Where are my followers? Understanding the Locality Effect in Twitter. ArXiv e-prints, May 2011.Google ScholarGoogle Scholar
  8. T. Gottron and N. Lipka. A comparison of language identification approaches on short, query-style texts. In Proceedings of the 32nd European conference on Advances in Information Retrieval (ECIR'2010), pages 611--614, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Grothe, E. W. D. Luca, and A. Nürnberger. A comparative study on language identification methods. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008.Google ScholarGoogle Scholar
  10. L. Hong, G. Convertino, and E. Chi. Language matters in twitter: A large scale study. In Proceedings of the fifth International AAAI Conference on Weblogs and Social Media (ICWSM'2011), pages 518--521, 2011.Google ScholarGoogle Scholar
  11. B. Hughes, T. Baldwin, S. Bird, J. Nicholson, and A. Mackinlay. Reconsidering language identification for written language resources. In Proceedings of the fifth International Conference on Language Resources and Evaluation (LREC'2006), pages 485--488, 2006.Google ScholarGoogle Scholar
  12. I. Instituto Nacional de Estatística, editor. Estatísticas Demográficas 2010. Instituto Nacional de Estatística, 2012.Google ScholarGoogle Scholar
  13. G. Laboreiro, L. Sarmento, and E. Oliveira. Identifying Automatic Posting Systems in Microblogs. In Progress in Artificial Intelligence, volume 7026 of Lecture Notes in Computer Science, pages 634--648. Springer Berlin/Heidelberg, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Laboreiro, L. Sarmento, J. Teixeira, and E. Oliveira. Tokenizing micro-blogging messages using a text classification approach. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data (AND'10), pages 81--88, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Ljubesic, N. Mikelic, and D. Boras. Language indentification: How to distinguish similar languages? In Proceedings of 29th International Conference on Information Technology Interfaces (ITI'2007), pages 541--546, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  16. B. Martins and M. J. Silva. Language identification in web pages. In Proceedings of the 2005 ACM symposium on Applied computing (SAC'05), pages 764--768, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, SMUC '10, pages 37--44. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Sousa-Silva, G. Laboreiro, L. Sarmento, T. Grant, E. Oliveira, and B. Maia. 'twazn me!!!;(' Automatic Authorship Analysis of Micro-Blogging Messages. In Procedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB'2011), pages 161--168, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y.-J. Tang, C.-Y. Li, and H.-H. Chen. A comparison between microblog corpus and balanced corpus from linguistic and sentimental perspectives. In Workshop on Analyzing Microtext (AAAI'2011), 2011.Google ScholarGoogle Scholar
  20. T. Vatanen, J. J. Väyrynen, and S. Virpioja. Language Identification of Short Text Segments with N-gram Models. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), 2010.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing
    March 2013
    2124 pages
    ISBN:9781450316569
    DOI:10.1145/2480362

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 18 March 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SAC '13 Paper Acceptance Rate255of1,063submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader