research-article

Named entity recognition for tweets

Authors:

Shaodian Zhang,

Ming ZhouAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 4, Issue 1

Article No.: 3, Pages 1 - 15

https://doi.org/10.1145/2414425.2414428

Published: 01 February 2013 Publication History

Abstract

Two main challenges of Named Entity Recognition (NER) for tweets are the insufficient information in a tweet and the lack of training data. We propose a novel method consisting of three core elements: (1) normalization of tweets; (2) combination of a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model; and (3) semisupervised learning framework. The tweet normalization preprocessing corrects common ill-formed words using a global linear model. The KNN-based classifier conducts prelabeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semisupervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of normalization, KNN, and semisupervised learning.

References

[1]

Brill, E. 1992. A simple rule-based part of speech tagger. In Proceedings of the Workshop on Speech and Natural Language. 112--116.

Digital Library

[2]

Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL '00). Association for Computational Linguistics, 286--293.

Digital Library

[3]

Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. 1992. Class-Based n-gram models of natural language. Comput. Linguist. 18, 467--479.

Digital Library

[4]

Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., and Vaithyanathan, S. 2010. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 1002--1012.

Digital Library

[5]

Collins, M. 2002a. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '02). Vol. 10, Association for Computational Linguistics, 1--8.

Digital Library

[6]

Collins, M. 2002b. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, 489--496.

Digital Library

[7]

Cook, P. and Stevenson, S. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity (CALC '09). Association for Computational Linguistics, 71--78.

Digital Library

[8]

Downey, D., Broadhead, M., and Etzioni, O. 2007. Locating complex named entities in web text. In Proceedings of the Joint Conference on Artificial Intelligence.

Digital Library

[9]

Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artif. Intell. 165, 1, 91--134.

Digital Library

[10]

Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. 2010. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT '10). Association for Computational Linguistics, 80--88.

Digital Library

[11]

Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL '05). Association for Computational Linguistics, 363--370.

Digital Library

[12]

Finkel, J. R. and Manning, C. D. 2009. Nested named entity recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 141--150.

Digital Library

[13]

Guo, H., Zhu, H., Guo, Z., Zhang, X., Wu, X., and Su, Z. 2009. Domain adaptation with latent semantic association for named entity recognition. In Proceedings of Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'09). Association for Computational Linguistics, 281--289.

Digital Library

[14]

Han, B. and Baldwin, T. 2011. Lexical normalisation of short text messages: Makn sens a &num;twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 1, 368--378.

Digital Library

[15]

Jansche, M. and Abney, S. P. 2002. Information extraction from voicemail transcripts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'02). Vol. 10, Association for Computational Linguistics, 320--327.

Digital Library

[16]

Jiang, J. and Zhai, C. 2007. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th Annual Meeting for the Association for Computational Linguistics. Association for Computational Linguistics, 264--271.

[17]

Klein, D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL '03). Vol. 1, Association for Computational Linguistics, 423--430.

Digital Library

[18]

Krishnan, V. and Manning, C. D. 2006. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1121--1128.

Digital Library

[19]

Krupka, G. R. and Hausman, K. 1998. Isoquest: Description of the netowl#8482; extractor system as used in muc-7. In In Proceedings of the 7th Message Understanding Conference (MUC-7).

[20]

Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th Internatinal Conference on Machine Learning. 282--289.

Digital Library

[21]

Mccallum, A. and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Association for Computational Linguistics, 188--191.

Digital Library

[22]

Miller, S., Guinness, J., and Zamanian, A. 2004. Name tagging with word clusters and discriminative training. In HLT-NAACL: Main Proceedings. D. M. Susan Dumais and S. Roukos, Eds., Association for Computational Linguistics, 337--342.

[23]

Minkov, E., Wang, R. C., and Cohen, W. W. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 443--450.

Digital Library

[24]

Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification. Linguist. Invest. 30, 1, 3--26.

[25]

Ratinov, L. and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics, 147--155.

Digital Library

[26]

Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27.

[27]

Singh, S., Hillard, D., and Leggetter, C. 2010. Minimally-supervised extraction of entities from text advertisements. In Proceedings of the Human Language Technologies Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 73--81.

Digital Library

[28]

Stolcke, A. 2002. Srilmlan extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP). 901--904.

[29]

Suzuki, J. and Isozaki, H. 2008. Semi-Supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 665--673.

[30]

Tjong Kim Sang, E. F. and De Meulder, F. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Association for Computational Linguistics, 142--147.

Digital Library

[31]

Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the North American Chapter of the Association for Computational Linguises (NAACL). 173--180.

Digital Library

[32]

Wang, Y. 2009. Annotating and recognising named entities in clinical notes. In Proceedings of the ACL-IJCNLP Student Research Workshop (ACL-IJCNLP '09). Association for Computational Linguistics, 18--26.

Digital Library

[33]

Wu, D., Lee, W. S., Ye, N., and Chieu, H. L. 2009. Domain adaptive bootstrapping for named entity recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1523--1532.

Digital Library

[34]

Yoshida, K. and Tsujii, J. 2007. Reranking for biomedical named-entity recognition. In Proceedings of the Workshop on Biological, Translational, and Clinical Language Processing (BioNLP '07). Association for Computational Linguistics, 209--216.

Digital Library

[35]

Zhang, T. and Johnson, D. 2003. A robust risk minimization based named entity recognition system. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Vol. 4, Association for Computational Linguistics, 204--207.

Digital Library

[36]

Zhou, G. and Su, J. 2002. Named entity recognition using an hmm-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 473--480.

Digital Library

Cited By

Ravi RGinde GRokne J(2025)PRAGyan - Connecting the Dots in TweetsSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_25(338-354)Online publication date: 24-Jan-2025
https://doi.org/10.1007/978-3-031-78548-1_25
Quan HLi YLiu DZhou Y(2024)Protection of Guizhou Miao batik culture based on knowledge graph and deep learningHeritage Science10.1186/s40494-024-01317-y12:1Online publication date: 14-Jun-2024
https://doi.org/10.1186/s40494-024-01317-y
Shankar SZamfirescu-Pereira JHartmann BParameswaran AArawjo I(2024)Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human PreferencesProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676450(1-14)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676450
Show More Cited By

Index Terms

Named entity recognition for tweets
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Named entity recognition in tweets: an experimental study
EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on ...
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Bio Named Entity Recognition Based on Co-training Algorithm
WAINA '12: Proceedings of the 2012 26th International Conference on Advanced Information Networking and Applications Workshops

One essential task in extracting information from biomedical literature is the bio Named Entity Recognition (NER) process, which basically defines the boundaries between typical words and biomedical terminology in particular text data, and assigns them ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 4, Issue 1

Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context

January 2013

357 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/2414425

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2013

Accepted: 01 October 2012

Revised: 01 September 2012

Received: 01 June 2011

Published in TIST Volume 4, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
1,107
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ravi RGinde GRokne J(2025)PRAGyan - Connecting the Dots in TweetsSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_25(338-354)Online publication date: 24-Jan-2025
https://doi.org/10.1007/978-3-031-78548-1_25
Quan HLi YLiu DZhou Y(2024)Protection of Guizhou Miao batik culture based on knowledge graph and deep learningHeritage Science10.1186/s40494-024-01317-y12:1Online publication date: 14-Jun-2024
https://doi.org/10.1186/s40494-024-01317-y
Shankar SZamfirescu-Pereira JHartmann BParameswaran AArawjo I(2024)Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human PreferencesProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676450(1-14)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676450
El Bahi H(2024)Handwritten text recognition and information extraction from ancient manuscripts using deep convolutional and recurrent neural networkSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09930-628:20(12249-12268)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1007/s00500-024-09930-6
Alsaqer MAlelyani SMohana MAlreemy KAlqahtani A(2023)Predicting Location of Tweets Using Machine Learning ApproachesApplied Sciences10.3390/app1305302513:5(3025)Online publication date: 26-Feb-2023
https://doi.org/10.3390/app13053025
Simanjuntak LMahendra RYulianti E(2022)We Know You Are Living in Bali: Location Prediction of Twitter Users Using BERT Language ModelBig Data and Cognitive Computing10.3390/bdcc60300776:3(77)Online publication date: 7-Jul-2022
https://doi.org/10.3390/bdcc6030077
Leng YQiu D(2022)Using Approximately Coupled Tensor Factorization to Model Changing User Preferences for Movie RecommendationsProceedings of the 8th International Conference on Computing and Artificial Intelligence10.1145/3532213.3532257(293-300)Online publication date: 18-Mar-2022
https://dl.acm.org/doi/10.1145/3532213.3532257
Alkhalifa RZubiaga A(2022)Capturing stance dynamics in social media: open challenges and research directionsInternational Journal of Digital Humanities10.1007/s42803-022-00043-w3:1-3(115-135)Online publication date: 8-Mar-2022
https://doi.org/10.1007/s42803-022-00043-w
Li SHan L(2022)A Two-Stage NER Method for Online-Sale CommentsApplications of Decision Science in Management10.1007/978-981-19-2768-3_26(283-290)Online publication date: 8-Sep-2022
https://doi.org/10.1007/978-981-19-2768-3_26
Li SHan L(2022)A Two-Stage NER Method for Outstanding Papers in MCMArtificial Intelligence in Education: Emerging Technologies, Models and Applications10.1007/978-981-16-7527-0_3(41-50)Online publication date: 18-Mar-2022
https://doi.org/10.1007/978-981-16-7527-0_3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents