skip to main content
10.1145/2566486.2568013acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections

Effective named entity recognition for idiosyncratic web collections

Published: 07 April 2014 Publication History


Named Entity Recognition (NER) plays an important role in a variety of online information management tasks including text categorization, document clustering, and faceted search. While recent NER systems can achieve near-human performance on certain documents like news articles, they still remain highly domain-specific and thus cannot effectively identify entities such as original technical concepts in scientific documents. In this work, we propose novel approaches for NER on distinctive document collections (such as scientific articles) based on n-grams inspection and classification. We design and evaluate several entity recognition features---ranging from well-known part-of-speech tags to n-gram co-location statistics and decision trees---to classify candidates. In addition, we show how the use of external knowledge bases (either specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic collections. We evaluate our system on two test collections created from a set of Computer Science and Physics papers and compare it against state-of-the-art supervised methods. Experimental results show that a careful combination of the features we propose yield up to 85% NER accuracy over scientific collections and substantially outperforms state-of-the-art approaches such as those based on maximum entropy.


K. Aberer, A. Boyarsky, P. Cudré-Mauroux, G. Demartini, and O. Ruchayskiy. Sciencewise: A web-based interactive semantic platform for scientific collaboration. In 10th International Semantic Web Conference (ISWC 2011-Demo), Bonn, Germany, 2011.
O. Bender, F. J. Och, and H. Ney. Maximum entropy models for named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL '03, pages 148--151, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
A. L. Berger and V. O. Mittal. Ocelot: a system for summarizing web pages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '00, pages 144--151, New York, NY, USA, 2000. ACM.
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the 6th Workshop on Very Large Corpora, pages 152--160, 1998.
J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5):pp. 1470--1480, 1972.
L. Del Corro and R. Gemulla. ClausIE: Clause-Based Open Information Extraction. In Proceedings of the 22nd International World Wide Web Conference (WWW 2013), Rio do Janeiro, Brazil, 2013. International World Wide Web Conferences Steering Committee (IW3C2), ACM.
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st International Conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM.
G. Demartini, D. E. Difallah, and P. Cudre-Mauroux. Large-scale linked data integration using probabilistic reasoning and crowdsourcing. The VLDB Journal, 22(5):665--687, 2013.
Y. feng Lin, T. han Tsai, W. chi Chou, K. pin Wu, T. yi Sung, and W. lian Hsu. A maximum entropy approach to biomedical named entity recognition. In Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, pages 56--61, 2004.
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL '05, pages 363--370, Stroudsburg, PA, USA, 2005. Association for Computational Linguistics.
E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and et al. Domain-specific keyphrase extraction. In Proceedings of the 16th international joint conference on Artificial Intelligence, pages 668--673. Morgan Kaufmann Publishers, 1999.
K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2):115--130, 2000.
P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn., 63(1):3--42, Apr. 2006.
I. Hulpus, C. Hayes, M. Karnstedt, and D. Greene. Unsupervised graph-based topic labelling using dbpedia. In Proceedings of the 6th ACM international conference on Web Search and Data Mining, WSDM '13, pages 465--474, New York, NY, USA, 2013. ACM.
X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pages 756--757, New York, NY, USA, 2009. ACM.
S. Kim, O. Medelyan, M.-Y. Kan, and T. Baldwin. Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation, pages 1--20, 2012.
M. Krapivin, M. Autayeu, M. Marchese, E. Blanzieri, and N. Segata. Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In Proceedings of the joint JCDL/ICADL international digital libraries conference, pages 102--111, 2010.
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 721--730, New York, NY, USA, 2012. ACM.
C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM conference on Conference on information and knowledge management, CIKM '07, pages 233--242, New York, NY, USA, 2007. ACM.
G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995.
D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 509--518, New York, NY, USA, 2008. ACM.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
T. Poibeau and L. Kosseim. Proper name extraction from non-journalistic texts. In Computational Linguistics in the Netherlands, pages 144--157, 2001.
J. Pound, P. Mika, and H. Zaragoza. Ad-hoc object retrieval in the web of data. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 771--780, New York, NY, USA, 2010. ACM.
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th conference on Computational Natural Language Learning (CONLL), pages 147--155, 2009.
A. Ratnaparkhi et al. A maximum entropy model for part-of-speech tagging. In Proceedings of the conference on empirical methods in natural language processing, volume 1, pages 133--142, 1996.
B. Settles. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191--3192, 2005.
A. Tonon, G. Demartini, and P. Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 125--134, New York, NY, USA, 2012. ACM.
P. D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303--336, May 2000.
C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale named entity recognition. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM '08, pages 123--132, New York, NY, USA, 2008. ACM.
A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland. Textrunner: open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL-Demonstrations '07, pages 25--26, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics.

Cited By

View all
  • (2022)Generation of training data for named entity recognition of artworksSemantic Web10.3233/SW-22317714:2(239-260)Online publication date: 15-Dec-2022
  • (2022)Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTMIEEE Access10.1109/ACCESS.2022.320653910(99633-99645)Online publication date: 2022
  • (2021)Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop ApproachProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481939(3818-3827)Online publication date: 26-Oct-2021
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Other conferences
WWW '14: Proceedings of the 23rd international conference on World wide web
April 2014
926 pages


  • IW3C2: International World Wide Web Conference Committee



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2014


Request permissions for this article.

Check for updates

Author Tags

  1. named entity recognition
  2. term recognition
  3. text mining


  • Research-article

Funding Sources


WWW '14
  • IW3C2

Acceptance Rates

WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Feb 2025

Other Metrics


Cited By

View all
  • (2022)Generation of training data for named entity recognition of artworksSemantic Web10.3233/SW-22317714:2(239-260)Online publication date: 15-Dec-2022
  • (2022)Dark Web: E-Commerce Information Extraction Based on Name Entity Recognition Using Bidirectional-LSTMIEEE Access10.1109/ACCESS.2022.320653910(99633-99645)Online publication date: 2022
  • (2021)Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop ApproachProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481939(3818-3827)Online publication date: 26-Oct-2021
  • (2021)Advances in Data Management in the Big Data EraAdvancing Research in Information and Communication Technology10.1007/978-3-030-81701-5_4(99-126)Online publication date: 4-Aug-2021
  • (2020)Leveraging Knowledge Graphs for Big Data IntegrationSemantic Web10.3233/SW-19037111:1(13-17)Online publication date: 1-Jan-2020
  • (2019)Who is Mona L.? Identifying Mentions of Artworks in Historical ArchivesDigital Libraries for Open Knowledge10.1007/978-3-030-30760-8_10(115-122)Online publication date: 30-Aug-2019
  • (2018)An Automatically Generated Annotated Corpus for Albanian Named Entity RecognitionCybernetics and Information Technologies10.2478/cait-2018-000918:1(95-108)Online publication date: 30-Mar-2018
  • (2018)A retrospective of knowledge graphsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-5228-912:1(55-74)Online publication date: 1-Feb-2018
  • (2017)Understanding disciplinary vocabularies using a full-text enabled domain-independent term extraction approachPLOS ONE10.1371/journal.pone.018776212:11(e0187762)Online publication date: 29-Nov-2017
  • (2017)SwissLinkProceedings of the 13th International Conference on Semantic Systems10.1145/3132218.3132234(65-72)Online publication date: 11-Sep-2017
  • Show More Cited By

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media