research-article

Effective named entity recognition for idiosyncratic web collections

Authors:

Roman Prokofyev,

Gianluca Demartini,

Philippe Cudré-MaurouxAuthors Info & Claims

WWW '14: Proceedings of the 23rd international conference on World wide web

Pages 397 - 408

https://doi.org/10.1145/2566486.2568013

Published: 07 April 2014 Publication History

Abstract

Named Entity Recognition (NER) plays an important role in a variety of online information management tasks including text categorization, document clustering, and faceted search. While recent NER systems can achieve near-human performance on certain documents like news articles, they still remain highly domain-specific and thus cannot effectively identify entities such as original technical concepts in scientific documents. In this work, we propose novel approaches for NER on distinctive document collections (such as scientific articles) based on n-grams inspection and classification. We design and evaluate several entity recognition features---ranging from well-known part-of-speech tags to n-gram co-location statistics and decision trees---to classify candidates. In addition, we show how the use of external knowledge bases (either specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic collections. We evaluate our system on two test collections created from a set of Computer Science and Physics papers and compare it against state-of-the-art supervised methods. Experimental results show that a careful combination of the features we propose yield up to 85% NER accuracy over scientific collections and substantially outperforms state-of-the-art approaches such as those based on maximum entropy.

References

[1]

K. Aberer, A. Boyarsky, P. Cudré-Mauroux, G. Demartini, and O. Ruchayskiy. Sciencewise: A web-based interactive semantic platform for scientific collaboration. In 10th International Semantic Web Conference (ISWC 2011-Demo), Bonn, Germany, 2011.

[2]

O. Bender, F. J. Och, and H. Ney. Maximum entropy models for named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL '03, pages 148--151, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.

Digital Library

[3]

A. L. Berger and V. O. Mittal. Ocelot: a system for summarizing web pages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '00, pages 144--151, New York, NY, USA, 2000. ACM.

Digital Library

[4]

A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora, pages 152--160, 1998.

[5]

J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5):pp. 1470--1480, 1972.

[6]

L. Del Corro and R. Gemulla. ClausIE: Clause-Based Open Information Extraction. In Proceedings of the 22nd International World Wide Web Conference (WWW 2013), Rio do Janeiro, Brazil, 2013. International World Wide Web Conferences Steering Committee (IW3C2), ACM.

Digital Library

[7]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st International Conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM.

Digital Library

[8]

G. Demartini, D. E. Difallah, and P. Cudre-Mauroux. Large-scale linked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal, 22(5):665--687, 2013.

Digital Library

[9]

Y. feng Lin, T. han Tsai, W. chi Chou, K. pin Wu, T. yi Sung, and W. lian Hsu. A maximum entropy approach to biomedical named entity recognition. In Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, pages 56--61, 2004.

[10]

J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 363--370, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.

Digital Library

[11]

E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and et al. Domain-specific keyphrase extraction. In Proceedings of the 16th international joint conference on Artificial Intelligence, pages 668--673. Morgan Kaufmann Publishers, 1999.

Digital Library

[12]

K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2):115--130, 2000.

[13]

P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn., 63(1):3--42, Apr. 2006.

Digital Library

[14]

I. Hulpus, C. Hayes, M. Karnstedt, and D. Greene. Unsupervised graph-based topic labelling using dbpedia. In Proceedings of the 6th ACM international conference on Web Search and Data Mining, WSDM '13, pages 465--474, New York, NY, USA, 2013. ACM.

Digital Library

[15]

X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pages 756--757, New York, NY, USA, 2009. ACM.

Digital Library

[16]

S. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin. Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation, pages 1--20, 2012.

Digital Library

[17]

M. Krapivin, M. Autayeu, M. Marchese, E. Blanzieri, and N. Segata. Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In Proceedings of the joint JCDL/ICADL international digital libraries conference, pages 102--111, 2010.

Digital Library

[18]

C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 721--730, New York, NY, USA, 2012. ACM.

Digital Library

[19]

C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

Digital Library

[20]

R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM conference on Conference on information and knowledge management, CIKM '07, pages 233--242, New York, NY, USA, 2007. ACM.

Digital Library

[21]

G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995.

Digital Library

[22]

D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 509--518, New York, NY, USA, 2008. ACM.

Digital Library

[23]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.

Digital Library

[24]

T. Poibeau and L. Kosseim. Proper name extraction from non-journalistic texts. In Computational Linguistics in the Netherlands, pages 144--157, 2001.

[25]

J. Pound, P. Mika, and H. Zaragoza. Ad-hoc object retrieval in the web of data. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 771--780, New York, NY, USA, 2010. ACM.

Digital Library

[26]

L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th conference on Computational Natural Language Learning (CONLL), pages 147--155, 2009.

Digital Library

[27]

A. Ratnaparkhi et al. A maximum entropy model for part-of-speech tagging. In Proceedings of the conference on empirical methods in natural language processing, volume 1, pages 133--142, 1996.

[28]

B. Settles. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191--3192, 2005.

Digital Library

[29]

A. Tonon, G. Demartini, and P. Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 125--134, New York, NY, USA, 2012. ACM.

Digital Library

[30]

P. D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303--336, May 2000.

Digital Library

[31]

C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale named entity recognition. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 123--132, New York, NY, USA, 2008. ACM.

Digital Library

[32]

A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrunner: open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL-Demonstrations '07, pages 25--26, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics.

Digital Library

Cited By

Jain NSierra-Múnera AEhmueller JKrestel R(2022)Generation of training data for named entity recognition of artworksSemantic Web10.3233/SW-22317714:2(239-260)Online publication date: 15-Dec-2022
https://doi.org/10.3233/SW-223177
Shah SAli Masood MYasin A(2022)Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTMIEEE Access10.1109/ACCESS.2022.320653910(99633-99645)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3206539
Gerlach MMiller MHo RHarlan KDifallah DDemartini GZuccon GCulpepper JHuang ZTong H(2021)Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop ApproachProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481939(3818-3827)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3481939
Show More Cited By

Index Terms

Effective named entity recognition for idiosyncratic web collections
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Named entity recognition and resolution in legal text
Semantic Processing of Legal Texts

Named entities in text are persons, places, companies, etc. that are explicitly mentioned in text using proper nouns. The process of finding named entities in a text and classifying them to a semantic type, is called named entity recognition. Resolution ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '14: Proceedings of the 23rd international conference on World wide web

April 2014

926 pages

ISBN:9781450327442

DOI:10.1145/2566486

General Chair:
Chin-Wan Chung
Korea Advanced Institute of Science and Technology, Korea
,
Program Chairs:
Andrei Broder
Google Inc., USA
,
Kyuseok Shim
Seoul National University, Korea
,
Torsten Suel
New York University, USA

Copyright © 2014 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Science Foundation

Conference

WWW '14

Sponsor:

IW3C2

WWW '14: 23rd International World Wide Web Conference

April 7 - 11, 2014

Seoul, Korea

Acceptance Rates

WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
504
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jain NSierra-Múnera AEhmueller JKrestel R(2022)Generation of training data for named entity recognition of artworksSemantic Web10.3233/SW-22317714:2(239-260)Online publication date: 15-Dec-2022
https://doi.org/10.3233/SW-223177
Shah SAli Masood MYasin A(2022)Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTMIEEE Access10.1109/ACCESS.2022.320653910(99633-99645)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3206539
Gerlach MMiller MHo RHarlan KDifallah DDemartini GZuccon GCulpepper JHuang ZTong H(2021)Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop ApproachProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481939(3818-3827)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3481939
Azzini ABarbon SBellandi VCatarci TCeravolo PCudré-Mauroux PMaghool SPokorny JScannapieco MSedes FTavares GWrembel R(2021)Advances in Data Management in the Big Data EraAdvancing Research in Information and Communication Technology10.1007/978-3-030-81701-5_4(99-126)Online publication date: 4-Aug-2021
https://doi.org/10.1007/978-3-030-81701-5_4
Cudré-Mauroux P(2020)Leveraging Knowledge Graphs for Big Data IntegrationSemantic Web10.3233/SW-19037111:1(13-17)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/SW-190371
Jain NKrestel R(2019)Who is Mona L.? Identifying Mentions of Artworks in Historical ArchivesDigital Libraries for Open Knowledge10.1007/978-3-030-30760-8_10(115-122)Online publication date: 30-Aug-2019
https://doi.org/10.1007/978-3-030-30760-8_10
Hoxha KBaxhaku A(2018)An Automatically Generated Annotated Corpus for Albanian Named Entity RecognitionCybernetics and Information Technologies10.2478/cait-2018-000918:1(95-108)Online publication date: 30-Mar-2018
https://doi.org/10.2478/cait-2018-0009
Yan JWang CCheng WGao MZhou A(2018)A retrospective of knowledge graphsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-5228-912:1(55-74)Online publication date: 1-Feb-2018
https://dl.acm.org/doi/10.1007/s11704-016-5228-9
Yan EWilliams JChen Z(2017)Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approachPLOS ONE10.1371/journal.pone.018776212:11(e0187762)Online publication date: 29-Nov-2017
https://doi.org/10.1371/journal.pone.0187762
Prokofyev RLuggen MDifallah DCudré-Mauroux P(2017)SwissLinkProceedings of the 13th International Conference on Semantic Systems10.1145/3132218.3132234(65-72)Online publication date: 11-Sep-2017
https://dl.acm.org/doi/10.1145/3132218.3132234
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten