skip to main content
10.1145/2872518.2891065acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
tutorial

Automatic Entity Recognition and Typing in Massive Text Corpora

Published: 11 April 2016 Publication History

Abstract

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

References

[1]
R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005.
[2]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.
[3]
D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine learning, 34(1--3):211--231, 1999.
[4]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT Workshop on Computational Learning Theory, 1998.
[5]
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.
[6]
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.
[7]
W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004.
[8]
M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 489--496. Association for Computational Linguistics, 2002.
[9]
J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003.
[10]
B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012.
[11]
X. L. Dong, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
[12]
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015.
[13]
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005.
[14]
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.
[15]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.
[16]
V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008.
[17]
W. Gao, P. Li, and K. Darwish. Joint topic modeling for event summarization across news and social media streams. In CIKM, 2012.
[18]
A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013.
[19]
W. Guo, H. Li, Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL, 2013.
[20]
S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.
[21]
Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011.
[22]
R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010.
[23]
H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011.
[24]
D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013.
[25]
Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011.
[26]
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012.
[27]
Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.
[28]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010.
[29]
D. Lin and X. Wu. Phrase clustering for discriminative learning. In ACL, 2009.
[30]
H. Lin, Y. Jia, Y. Wang, X. Jin, X. Li, and X. Cheng. Populating knowledge base with collective entity mentions: A graph-based approach. In ASONAM, 2014.
[31]
T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. In EMNLP, 2012.
[32]
W. Lin, R. Yangarber, and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, 2003.
[33]
X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012.
[34]
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015.
[35]
A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000.
[36]
P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In COLING, 2002.
[37]
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.
[38]
N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. In ACL, 2013.
[39]
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000.
[40]
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009.
[41]
X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015.
[42]
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011.
[43]
W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.
[44]
W. Shen, J. Wang, P. Luo, and M. Wang. A graph-based approach for ontology population with named entities. In CIKM, 2012.
[45]
Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013.
[46]
P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006.
[47]
P. P. Talukdar and F. Pereira. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In ACL, 2010.
[48]
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010.
[49]
R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002.

Cited By

View all
  • (2022)Mining Structures of Factual Knowledge from TextundefinedOnline publication date: 24-Mar-2022
  • (2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
  • (2020)Fine Grained Named Entity Recognition via Seq2seq FrameworkIEEE Access10.1109/ACCESS.2020.29804318(53953-53961)Online publication date: 2020
  • Show More Cited By

Index Terms

  1. Automatic Entity Recognition and Typing in Massive Text Corpora

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web
    April 2016
    1094 pages
    ISBN:9781450341448
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    International World Wide Web Conferences Steering Committee

    Republic and Canton of Geneva, Switzerland

    Publication History

    Published: 11 April 2016

    Check for updates

    Author Tags

    1. entity recognition and typing
    2. massive text corpora

    Qualifiers

    • Tutorial

    Funding Sources

    • U.S. Army Research Lab.
    • National Science Foundation

    Conference

    WWW '16
    Sponsor:
    • IW3C2
    WWW '16: 25th International World Wide Web Conference
    April 11 - 15, 2016
    Québec, Montréal, Canada

    Acceptance Rates

    WWW '16 Companion Paper Acceptance Rate 115 of 727 submissions, 16%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Mining Structures of Factual Knowledge from TextundefinedOnline publication date: 24-Mar-2022
    • (2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
    • (2020)Fine Grained Named Entity Recognition via Seq2seq FrameworkIEEE Access10.1109/ACCESS.2020.29804318(53953-53961)Online publication date: 2020
    • (2019)SANE 2.0Engineering Applications of Artificial Intelligence10.1016/j.engappai.2019.05.00784:C(11-17)Online publication date: 1-Sep-2019
    • (2018)Construction and Applications of TeKnowbaseCompanion Proceedings of the The Web Conference 201810.1145/3184558.3191532(1023-1030)Online publication date: 23-Apr-2018
    • (2018)Foundations of Temporal Text NetworksApplied Network Science10.1007/s41109-018-0082-33:1Online publication date: 13-Aug-2018
    • (2017)One-shot learning for fine-grained relation extraction via convolutional siamese neural network2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258168(2194-2199)Online publication date: Dec-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media