skip to main content
research-article

Named entity recognition for tweets

Published: 01 February 2013 Publication History

Abstract

Two main challenges of Named Entity Recognition (NER) for tweets are the insufficient information in a tweet and the lack of training data. We propose a novel method consisting of three core elements: (1) normalization of tweets; (2) combination of a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model; and (3) semisupervised learning framework. The tweet normalization preprocessing corrects common ill-formed words using a global linear model. The KNN-based classifier conducts prelabeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semisupervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of normalization, KNN, and semisupervised learning.

References

[1]
Brill, E. 1992. A simple rule-based part of speech tagger. In Proceedings of the Workshop on Speech and Natural Language. 112--116.
[2]
Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL '00). Association for Computational Linguistics, 286--293.
[3]
Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. 1992. Class-Based n-gram models of natural language. Comput. Linguist. 18, 467--479.
[4]
Chiticariu, L., Krishnamurthy, R., Li, Y., Reiss, F., and Vaithyanathan, S. 2010. Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 1002--1012.
[5]
Collins, M. 2002a. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '02). Vol. 10, Association for Computational Linguistics, 1--8.
[6]
Collins, M. 2002b. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, 489--496.
[7]
Cook, P. and Stevenson, S. 2009. An unsupervised model for text message normalization. In Proceedings of the Workshop on Computational Approaches to Linguistic Creativity (CALC '09). Association for Computational Linguistics, 71--78.
[8]
Downey, D., Broadhead, M., and Etzioni, O. 2007. Locating complex named entities in web text. In Proceedings of the Joint Conference on Artificial Intelligence.
[9]
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artif. Intell. 165, 1, 91--134.
[10]
Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. 2010. Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT '10). Association for Computational Linguistics, 80--88.
[11]
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL '05). Association for Computational Linguistics, 363--370.
[12]
Finkel, J. R. and Manning, C. D. 2009. Nested named entity recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 141--150.
[13]
Guo, H., Zhu, H., Guo, Z., Zhang, X., Wu, X., and Su, Z. 2009. Domain adaptation with latent semantic association for named entity recognition. In Proceedings of Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL'09). Association for Computational Linguistics, 281--289.
[14]
Han, B. and Baldwin, T. 2011. Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 1, 368--378.
[15]
Jansche, M. and Abney, S. P. 2002. Information extraction from voicemail transcripts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'02). Vol. 10, Association for Computational Linguistics, 320--327.
[16]
Jiang, J. and Zhai, C. 2007. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th Annual Meeting for the Association for Computational Linguistics. Association for Computational Linguistics, 264--271.
[17]
Klein, D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL '03). Vol. 1, Association for Computational Linguistics, 423--430.
[18]
Krishnan, V. and Manning, C. D. 2006. An effective two-stage model for exploiting non-local dependencies in named entity recognition. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1121--1128.
[19]
Krupka, G. R. and Hausman, K. 1998. Isoquest: Description of the netowl#8482; extractor system as used in muc-7. In In Proceedings of the 7th Message Understanding Conference (MUC-7).
[20]
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th Internatinal Conference on Machine Learning. 282--289.
[21]
Mccallum, A. and Li, W. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Association for Computational Linguistics, 188--191.
[22]
Miller, S., Guinness, J., and Zamanian, A. 2004. Name tagging with word clusters and discriminative training. In HLT-NAACL: Main Proceedings. D. M. Susan Dumais and S. Roukos, Eds., Association for Computational Linguistics, 337--342.
[23]
Minkov, E., Wang, R. C., and Cohen, W. W. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 443--450.
[24]
Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification. Linguist. Invest. 30, 1, 3--26.
[25]
Ratinov, L. and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics, 147--155.
[26]
Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27.
[27]
Singh, S., Hillard, D., and Leggetter, C. 2010. Minimally-supervised extraction of entities from text advertisements. In Proceedings of the Human Language Technologies Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 73--81.
[28]
Stolcke, A. 2002. Srilmlan extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP). 901--904.
[29]
Suzuki, J. and Isozaki, H. 2008. Semi-Supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 665--673.
[30]
Tjong Kim Sang, E. F. and De Meulder, F. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Association for Computational Linguistics, 142--147.
[31]
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the North American Chapter of the Association for Computational Linguises (NAACL). 173--180.
[32]
Wang, Y. 2009. Annotating and recognising named entities in clinical notes. In Proceedings of the ACL-IJCNLP Student Research Workshop (ACL-IJCNLP '09). Association for Computational Linguistics, 18--26.
[33]
Wu, D., Lee, W. S., Ye, N., and Chieu, H. L. 2009. Domain adaptive bootstrapping for named entity recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1523--1532.
[34]
Yoshida, K. and Tsujii, J. 2007. Reranking for biomedical named-entity recognition. In Proceedings of the Workshop on Biological, Translational, and Clinical Language Processing (BioNLP '07). Association for Computational Linguistics, 209--216.
[35]
Zhang, T. and Johnson, D. 2003. A robust risk minimization based named entity recognition system. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. Vol. 4, Association for Computational Linguistics, 204--207.
[36]
Zhou, G. and Su, J. 2002. Named entity recognition using an hmm-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 473--480.

Cited By

View all
  • (2025)PRAGyan - Connecting the Dots in TweetsSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_25(338-354)Online publication date: 24-Jan-2025
  • (2024)Protection of Guizhou Miao batik culture based on knowledge graph and deep learningHeritage Science10.1186/s40494-024-01317-y12:1Online publication date: 14-Jun-2024
  • (2024)Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human PreferencesProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676450(1-14)Online publication date: 13-Oct-2024
  • Show More Cited By

Index Terms

  1. Named entity recognition for tweets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 4, Issue 1
    Special section on twitter and microblogging services, social recommender systems, and CAMRa2010: Movie recommendation in context
    January 2013
    357 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/2414425
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 February 2013
    Accepted: 01 October 2012
    Revised: 01 September 2012
    Received: 01 June 2011
    Published in TIST Volume 4, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Semisupervised learning
    2. model combination
    3. tweet normalization

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)PRAGyan - Connecting the Dots in TweetsSocial Networks Analysis and Mining10.1007/978-3-031-78548-1_25(338-354)Online publication date: 24-Jan-2025
    • (2024)Protection of Guizhou Miao batik culture based on knowledge graph and deep learningHeritage Science10.1186/s40494-024-01317-y12:1Online publication date: 14-Jun-2024
    • (2024)Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human PreferencesProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676450(1-14)Online publication date: 13-Oct-2024
    • (2024)Handwritten text recognition and information extraction from ancient manuscripts using deep convolutional and recurrent neural networkSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-024-09930-628:20(12249-12268)Online publication date: 1-Oct-2024
    • (2023)Predicting Location of Tweets Using Machine Learning ApproachesApplied Sciences10.3390/app1305302513:5(3025)Online publication date: 26-Feb-2023
    • (2022)We Know You Are Living in Bali: Location Prediction of Twitter Users Using BERT Language ModelBig Data and Cognitive Computing10.3390/bdcc60300776:3(77)Online publication date: 7-Jul-2022
    • (2022)Using Approximately Coupled Tensor Factorization to Model Changing User Preferences for Movie RecommendationsProceedings of the 8th International Conference on Computing and Artificial Intelligence10.1145/3532213.3532257(293-300)Online publication date: 18-Mar-2022
    • (2022)Capturing stance dynamics in social media: open challenges and research directionsInternational Journal of Digital Humanities10.1007/s42803-022-00043-w3:1-3(115-135)Online publication date: 8-Mar-2022
    • (2022)A Two-Stage NER Method for Online-Sale CommentsApplications of Decision Science in Management10.1007/978-981-19-2768-3_26(283-290)Online publication date: 8-Sep-2022
    • (2022)A Two-Stage NER Method for Outstanding Papers in MCMArtificial Intelligence in Education: Emerging Technologies, Models and Applications10.1007/978-981-16-7527-0_3(41-50)Online publication date: 18-Mar-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media