NER in Tweets Using Bagging and a Small Crowdsourced Dataset

Fromreide, Hege; Søgaard, Anders

doi:10.1007/978-3-319-10888-9_5

Hege Fromreide²⁰ &
Anders Søgaard²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8686))

Included in the following conference series:

International Conference on Natural Language Processing

2049 Accesses
1 Citations

Abstract

Named entity recognition (NER) systems for Twitter are very sensitive to cross-sample variation, and the performance of off-the-shelf systems vary from reasonable (F ₁: 60–70%) to completely useless (F ₁: 40–50%) across available Twitter datasets. This paper introduces a semi-supervised wrapper method for robust learning of sequential problems with many negative examples, such as NER, and shows that using a simple conditional random fields (CRF) model and a small crowdsourced dataset [4], leads to good NER performance across datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
MATH MathSciNet Google Scholar
Collins, M.: Discriminative training methods for Hidden Markov Models. In: EMNLP (2002)
Google Scholar
Eisenstein, J.: What to do about bad language on the internet. In: NAACL (2013)
Google Scholar
Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: NAACL Workshop on Creating Speech and Language Data with Amazons Mechanical Turk (2010)
Google Scholar
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL (2005)
Google Scholar
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: NAACL (2013)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML (2001)
Google Scholar
Liu, X., Zhang, S., Wei, F., Zhou, M.: Recognizing named entities in tweets. In: ACL (2011)
Google Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: NAACL (2013)
Google Scholar
Piskorski, J., Ehrmann, M.: Named entity recognition in targeted Twitter streams in Polish. In: ACL Workshop on Balto-Slavic NLP (2013)
Google Scholar
Poibeau, T., Kosseim, L.: Proper name extraction from non-journalistic texts. In: CLIN (2000)
Google Scholar
Ritter, A., Clark, S., Etzioni, M., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: EMNLP (2011)
Google Scholar
Rodrigues, F., Pereira, F., Ribeiro, B.: Sequence labeling with multiple annotators. Machine Learning, 1–17 (2013)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: HTL-NAACL (2003)
Google Scholar
Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: ACL, Columbus, Ohio, pp. 665–673 (2008)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: ACL (2010)
Google Scholar
Wang, C.-K., Hsu, B.-J., Chang, M.-W., Kiciman, E.: Simple and knowledge-intensive generative model for named entity recognition. Technical report, Microsoft Research (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Language Technology, University of Copenhagen, Denmark
Hege Fromreide & Anders Søgaard

Authors

Hege Fromreide
View author publications
You can also search for this author in PubMed Google Scholar
Anders Søgaard
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248, Warsaw, Poland
Adam Przepiórkowski & Maciej Ogrodniczuk &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fromreide, H., Søgaard, A. (2014). NER in Tweets Using Bagging and a Small Crowdsourced Dataset. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-10888-9_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10887-2
Online ISBN: 978-3-319-10888-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics