research-article

Determining language variant in microblog messages

Authors:
Gustavo Laboreiro

Universidade do Porto

Universidade do Porto
View Profile

,
Matko Bošnjak

Universidade do Porto

Universidade do Porto
View Profile

,
Luís Sarmento

Universidade do Porto

Universidade do Porto
View Profile

,
Eduarda Mendes Rodrigues

Universidade do Porto

Universidade do Porto
View Profile

,
Eugénio Oliveira

Universidade do Porto

Universidade do Porto
View Profile

SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied ComputingMarch 2013Pages 902–907https://doi.org/10.1145/2480362.2480535

Published:18 March 2013Publication History

SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing

Pages 902–907

ABSTRACT

It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language --- European and Brazilian --- in Twitter micro-blogging data, by proposing and evaluating a set of high-precision features. We follow an automatic classification approach using a Naïve Bayes classifier, achieving 95% accuracy. We find that our system is adequate for real-time tweet classification.

References

S. Carter, W. Weerkamp, and M. Tsagkias. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal, 2013. Google ScholarDigital Library
W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161--175, 1994.Google Scholar
Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-based approach to geo-locating twitter users. In CIKM, pages 759--768. ACM, 2010. Google ScholarDigital Library
J. F. da Silva and G. P. Lopes. Identification of document language is not yet a completely solved problem. In Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA-IAWTIC'06), pages 212--219, 2006. Google ScholarDigital Library
R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM Symposium on Discrete algorithms, SODA '03, pages 28--36, 2003. Google ScholarDigital Library
C. Fink, C. D. Piatko, J. Mayfield, T. Finin, and J. Martineau. Geolocating Blogs from Their Textual Content. In AAAI Spring Symposium: Social Semantic Web: Where Web 2.0 Meets Web 3.0, pages 25--26. AAAI, 2009.Google Scholar
R. Gonzalez, R. Cuevas, A. Cuevas, and C. Guerrero. Where are my followers? Understanding the Locality Effect in Twitter. ArXiv e-prints, May 2011.Google Scholar
T. Gottron and N. Lipka. A comparison of language identification approaches on short, query-style texts. In Proceedings of the 32nd European conference on Advances in Information Retrieval (ECIR'2010), pages 611--614, 2010. Google ScholarDigital Library
L. Grothe, E. W. D. Luca, and A. Nürnberger. A comparative study on language identification methods. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), 2008.Google Scholar
L. Hong, G. Convertino, and E. Chi. Language matters in twitter: A large scale study. In Proceedings of the fifth International AAAI Conference on Weblogs and Social Media (ICWSM'2011), pages 518--521, 2011.Google Scholar
B. Hughes, T. Baldwin, S. Bird, J. Nicholson, and A. Mackinlay. Reconsidering language identification for written language resources. In Proceedings of the fifth International Conference on Language Resources and Evaluation (LREC'2006), pages 485--488, 2006.Google Scholar
I. Instituto Nacional de Estatística, editor. Estatísticas Demográficas 2010. Instituto Nacional de Estatística, 2012.Google Scholar
G. Laboreiro, L. Sarmento, and E. Oliveira. Identifying Automatic Posting Systems in Microblogs. In Progress in Artificial Intelligence, volume 7026 of Lecture Notes in Computer Science, pages 634--648. Springer Berlin/Heidelberg, 2011. Google ScholarDigital Library
G. Laboreiro, L. Sarmento, J. Teixeira, and E. Oliveira. Tokenizing micro-blogging messages using a text classification approach. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data (AND'10), pages 81--88, 2010. Google ScholarDigital Library
N. Ljubesic, N. Mikelic, and D. Boras. Language indentification: How to distinguish similar languages? In Proceedings of 29th International Conference on Information Technology Interfaces (ITI'2007), pages 541--546, 2007.Google ScholarCross Ref
B. Martins and M. J. Silva. Language identification in web pages. In Proceedings of the 2005 ACM symposium on Applied computing (SAC'05), pages 764--768, 2005. Google ScholarDigital Library
D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, SMUC '10, pages 37--44. ACM, 2010. Google ScholarDigital Library
R. Sousa-Silva, G. Laboreiro, L. Sarmento, T. Grant, E. Oliveira, and B. Maia. 'twazn me!!!;(' Automatic Authorship Analysis of Micro-Blogging Messages. In Procedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB'2011), pages 161--168, 2011. Google ScholarDigital Library
Y.-J. Tang, C.-Y. Li, and H.-H. Chen. A comparison between microblog corpus and balanced corpus from linguistic and sentimental perspectives. In Workshop on Analyzing Microtext (AAAI'2011), 2011.Google Scholar
T. Vatanen, J. J. Väyrynen, and S. Virpioja. Language Identification of Short Text Segments with N-gram Models. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), 2010.Google Scholar

Recommendations

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. To this end, language identification can provide a monolingual set of content for analysis. We find the unedited and idiomatic language of microblogs to ...
Read More
What does software engineering community microblog about?
MSR '12: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories

Microblogging is a new trend to communicate and to disseminate information. One microblog post could potentially reach millions of users. Millions of microblogs are generated on a daily basis on popular sites such as Twitter. The popularity of ...
Read More
Predicting lifespans of popular tweets in microblog
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

In microblog like Twitter, popular tweets are usually retweeted by many users. For different tweets, their lifespans (i.e., how long they will stay popular) vary. This paper presents a simple yet effective approach to predict the lifespans of popular ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing
March 2013
2124 pages
ISBN:9781450316569
DOI:10.1145/2480362
Conference Chairs:
Sung Y. Shin
South Dakota State University, United States
,
José Carlos Maldonado
ICMC - University of São Paulo, Brazil
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SAC '13 Paper Acceptance Rate255of1,063submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 109
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Determining language variant in microblog messages

SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Recommendations

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

What does software engineering community microblog about?

Predicting lifespans of popular tweets in microblog

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Determining language variant in microblog messages

SAC '13: Proceedings of the 28th Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Recommendations

Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

What does software engineering community microblog about?

Predicting lifespans of popular tweets in microblog

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media