tutorial

Automatic Entity Recognition and Typing in Massive Text Corpora

Authors:

Ahmed El-Kishky,

Jiawei HanAuthors Info & Claims

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

Pages 1025 - 1028

https://doi.org/10.1145/2872518.2891065

Published: 11 April 2016 Publication History

Abstract

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

References

[1]

R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005.

Digital Library

[2]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.

[3]

D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine learning, 34(1--3):211--231, 1999.

Digital Library

[4]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT Workshop on Computational Learning Theory, 1998.

Digital Library

[5]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.

Digital Library

[6]

A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.

Digital Library

[7]

W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004.

Digital Library

[8]

M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 489--496. Association for Computational Linguistics, 2002.

Digital Library

[9]

J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003.

Digital Library

[10]

B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012.

Digital Library

[11]

X. L. Dong, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.

Digital Library

[12]

A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015.

Digital Library

[13]

O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005.

Digital Library

[14]

A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.

Digital Library

[15]

J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005.

Digital Library

[16]

V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008.

Digital Library

[17]

W. Gao, P. Li, and K. Darwish. Joint topic modeling for event summarization across news and social media streams. In CIKM, 2012.

Digital Library

[18]

A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013.

Digital Library

[19]

W. Guo, H. Li, Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL, 2013.

[20]

S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.

[21]

Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011.

Digital Library

[22]

R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010.

Digital Library

[23]

H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011.

Digital Library

[24]

D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013.

Digital Library

[25]

Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011.

Digital Library

[26]

C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012.

Digital Library

[27]

Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.

[28]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010.

Digital Library

[29]

D. Lin and X. Wu. Phrase clustering for discriminative learning. In ACL, 2009.

Digital Library

[30]

H. Lin, Y. Jia, Y. Wang, X. Jin, X. Li, and X. Cheng. Populating knowledge base with collective entity mentions: A graph-based approach. In ASONAM, 2014.

[31]

T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. In EMNLP, 2012.

Digital Library

[32]

W. Lin, R. Yangarber, and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, 2003.

[33]

X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012.

Digital Library

[34]

J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015.

Digital Library

[35]

A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000.

Digital Library

[36]

P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In COLING, 2002.

Digital Library

[37]

D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.

[38]

N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. In ACL, 2013.

[39]

K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000.

Digital Library

[40]

L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009.

Digital Library

[41]

X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015.

Digital Library

[42]

A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011.

Digital Library

[43]

W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.

[44]

W. Shen, J. Wang, P. Luo, and M. Wang. A graph-based approach for ontology population with named entities. In CIKM, 2012.

Digital Library

[45]

Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013.

Digital Library

[46]

P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006.

Digital Library

[47]

P. P. Talukdar and F. Pereira. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In ACL, 2010.

Digital Library

[48]

J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010.

Digital Library

[49]

R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002.

Digital Library

Cited By

Ren XHan J(2022)Mining Structures of Factual Knowledge from TextundefinedOnline publication date: 24-Mar-2022
El-Kishky AKoehn PSchwenk HHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
Zhu HHe CFang YXiao W(2020)Fine Grained Named Entity Recognition via Seq2seq FrameworkIEEE Access10.1109/ACCESS.2020.29804318(53953-53961)Online publication date: 2020
Show More Cited By

Index Terms

Automatic Entity Recognition and Typing in Massive Text Corpora
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Automatic Entity Recognition and Typing in Massive Text Data
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. ...
Building Structured Databases of Factual Knowledge from Massive Text Corpora
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate ...
Constructing Structured Information Networks from Massive Text Corpora
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

April 2016

1094 pages

ISBN:9781450341448

General Chairs:
Jacqueline Bourdeau
Tele-university (TELUQ), Montreal, QC, Canada
,
Jim A. Hendler
Rensselaer Polytechnic Institute, Troy, NY, USA
,
Roger Nkambou Nkambou
Université du Québec à Montréal, Montreal, QC, Canada
,
Program Chairs:
Ian Horrocks
University of Oxford, UK
,
Ben Y. Zhao
University of California at Santa Barbara, CA, USA

Copyright © 2016 Copyright is held by the owner/author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 11 April 2016

Check for updates

Author Tags

Qualifiers

Tutorial

Funding Sources

U.S. Army Research Lab.
National Science Foundation

Conference

WWW '16

Sponsor:

IW3C2

WWW '16: 25th International World Wide Web Conference

April 11 - 15, 2016

Québec, Montréal, Canada

Acceptance Rates

WWW '16 Companion Paper Acceptance Rate 115 of 727 submissions, 16%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
325
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ren XHan J(2022)Mining Structures of Factual Knowledge from TextundefinedOnline publication date: 24-Mar-2022
El-Kishky AKoehn PSchwenk HHuang JChang YCheng XKamps JMurdock VWen JLiu Y(2020)Searching the Web for Cross-lingual Parallel DataProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3397271.3401417(2417-2420)Online publication date: 25-Jul-2020
Zhu HHe CFang YXiao W(2020)Fine Grained Named Entity Recognition via Seq2seq FrameworkIEEE Access10.1109/ACCESS.2020.29804318(53953-53961)Online publication date: 2020
Lal AC. R(2019)SANE 2.0Engineering Applications of Artificial Intelligence10.1016/j.engappai.2019.05.00784:C(11-17)Online publication date: 1-Sep-2019
Upadhyay PBindal AKumar MRamanath MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Construction and Applications of TeKnowbaseCompanion Proceedings of the The Web Conference 201810.1145/3184558.3191532(1023-1030)Online publication date: 23-Apr-2018
Vega DMagnani M(2018)Foundations of Temporal Text NetworksApplied Network Science10.1007/s41109-018-0082-33:1Online publication date: 13-Aug-2018
Yuan JGuo HJin ZJin HZhang XLuo J(2017)One-shot learning for fine-grained relation extraction via convolutional siamese neural network2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258168(2194-2199)Online publication date: Dec-2017

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten