research-article

Provenance-based dictionary refinement in information extraction

Authors:

Laura Chiticariu,

Vitaly Feldman,

Frederick R. Reiss,

Huaiyu ZhuAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 457 - 468

https://doi.org/10.1145/2463676.2465284

Published: 22 June 2013 Publication History

Abstract

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results.

In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

References

[1]

In www.census.gov.

[2]

In www.geonames.org.

[3]

Automatic Content Extraction 2005 Evaluation Dataset. 2005.

[4]

E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, pages 85--94, 2000.

Digital Library

[5]

N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009.

Digital Library

[6]

P. Buneman, S. Khanna, and W.-C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002.

Digital Library

[7]

X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. In SIGMOD, 2009.

Digital Library

[8]

J. Cheney, L. Chiticariu, and W. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009.

Digital Library

[9]

L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP, pages 1002--1012, 2010.

Digital Library

[10]

W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, pages 89--98, 2004.

Digital Library

[11]

D. G. Corneil and Y. Perl. Clustering and domination in perfect graphs. Discrete Applied Mathematics, 9(1):27 -- 39, 1984.

[12]

H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS -- 99 -- 06, University of Sheffield, May 1999.

[13]

N. N. Dalvi, K. Schnaitter, and D. Suciu. Computing query probability with incidence algebras. In PODS, pages 203--214, 2010.

Digital Library

[14]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.

[15]

H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. PVLDB, pages 1078--1089, 2009.

Digital Library

[16]

D. Eppstein and D. S. Hirschberg. Choosing subsets with maximum weighted average. J. Algorithms, 24(1):177--193, 1997.

Digital Library

[17]

O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Methods for domain-independent information extraction from the web: an experimental comparison. In AAAI, 2004.

Digital Library

[18]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979.

Digital Library

[19]

T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007.

Digital Library

[20]

J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, 2011.

Digital Library

[21]

D. Jurafsky and J. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, 2009.

Digital Library

[22]

J. Kazama and K. Torisawa. Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations. In ACL, pages 407--415, 2008.

[23]

B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In PODS, pages 187--198, 2011.

Digital Library

[24]

Z. Kozareva. Bootstrapping named entity recognition with automatically generated gazetteer lists. In EACL: Student Research Workshop, 2006.

Digital Library

[25]

R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4):7--13, 2008.

Digital Library

[26]

B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. Automatic Rule Refinement for Information Extraction. PVLDB, 2010.

Digital Library

[27]

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313--330, 1993.

Digital Library

[28]

D. Maynard, K. Bontcheva, and H. Cunningham. Towards a semantic extraction of named entities. In Recent Advances in Natural Language Processing, 2003.

[29]

A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, 2011.

Digital Library

[30]

A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In EACL, pages 1--8, 1999.

Digital Library

[31]

D. Nadeau, P. D. Turney, and S. Matwin. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Canadian Conference on AI, pages 266--277, 2006.

Digital Library

[32]

F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008.

Digital Library

[33]

E. Riloff. Automatically constructing a dictionary for information extraction tasks. In KDD, 1993.

[34]

W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD, 2008.

Digital Library

[35]

W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007.

Digital Library

[36]

E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In HLT-NAACL, 2003.

Digital Library

[37]

L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.

[38]

C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979.

Digital Library

[39]

A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), pages 25--26, 2007.

Digital Library

Cited By

Tzimas GZotos NMourelatos EGiotopoulos KZervas P(2024)From Data to Insight: Transforming Online Job Postings into Labor-Market IntelligenceInformation10.3390/info1508049615:8(496)Online publication date: 20-Aug-2024
https://doi.org/10.3390/info15080496
Kassaie BTompa FBorghoff USchimmler S(2019)Predictable and Consistent Information ExtractionProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345391(1-10)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1145/3342558.3345391
Chiticariu LDanilevsky MHo HKrishnamurthy RLi YRaghavan SReiss FVaithyanathan SZhu H(2018)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_459(4620-4629)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_459
Show More Cited By

Index Terms

Provenance-based dictionary refinement in information extraction
1. Applied computing
  1. Document management and text processing
2. Information systems

Recommendations

Refinement and coarsening of surface meshes

This paper presents an adaptation scheme for surface meshes. Both refinement and coarsening tools are based upon local retriangulation. They can maintain the geometric features of the given surface mesh and its quality as well. A mesh gradation tool to ...
Computational aspects of the refinement of 3D tetrahedral meshes

The refinement of tetrahedral meshes is a significant task in many numerical and discretizations methods. The computational aspects for implementing refinement of meshes with complex geometry need to be carefully considered in order to have real-time ...
A methodology for quadrilateral finite element mesh coarsening

High fidelity finite element modeling of continuum mechanics problems often requires using all quadrilateral or all hexahedral meshes. The efficiency of such models is often dependent upon the ability to adapt a mesh to the physics of the phenomena. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
457
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tzimas GZotos NMourelatos EGiotopoulos KZervas P(2024)From Data to Insight: Transforming Online Job Postings into Labor-Market IntelligenceInformation10.3390/info1508049615:8(496)Online publication date: 20-Aug-2024
https://doi.org/10.3390/info15080496
Kassaie BTompa FBorghoff USchimmler S(2019)Predictable and Consistent Information ExtractionProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345391(1-10)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1145/3342558.3345391
Chiticariu LDanilevsky MHo HKrishnamurthy RLi YRaghavan SReiss FVaithyanathan SZhu H(2018)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_459(4620-4629)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-1-4614-8265-9_459
Herschel MDiestelkämper RBen Lahmar H(2017)A survey on provenanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0486-126:6(881-906)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s00778-017-0486-1
Chiticariu LDanilevsky MHo HKrishnamurthy RLi YRaghavan SReiss FVaithyanathan SZhu H(2017)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_459-2(1-9)Online publication date: 27-Jan-2017
https://doi.org/10.1007/978-1-4899-7993-3_459-2
Chiang FAndritsos PMiller R(2016)Data Driven Discovery of Attribute DictionariesTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090180(69-96)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.5555/3090176.3090180
Fagin RKimelfeld BReiss FVansummeren S(2016)Declarative Cleaning of Inconsistencies in Information ExtractionACM Transactions on Database Systems10.1145/287720241:1(1-44)Online publication date: 7-Apr-2016
https://dl.acm.org/doi/10.1145/2877202
Chen ZCafarella MJagadish HBennett PJosifovski VNeville JRadlinski F(2016)Long-tail Vocabulary Dictionary Extraction from the WebProceedings of the Ninth ACM International Conference on Web Search and Data Mining10.1145/2835776.2835778(625-634)Online publication date: 8-Feb-2016
https://dl.acm.org/doi/10.1145/2835776.2835778
Chiang FAndritsos PMiller R(2016)Data Driven Discovery of Attribute DictionariesTransactions on Computational Collective Intelligence XXI10.1007/978-3-662-49521-6_4(69-96)Online publication date: 2016
https://doi.org/10.1007/978-3-662-49521-6_4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten