ABSTRACT
Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results.
In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.
- In www.census.gov.Google Scholar
- In www.geonames.org.Google Scholar
- Automatic Content Extraction 2005 Evaluation Dataset. 2005.Google Scholar
- E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, pages 85--94, 2000. Google ScholarDigital Library
- N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W.-C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002. Google ScholarDigital Library
- X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. In SIGMOD, 2009. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
- L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP, pages 1002--1012, 2010. Google ScholarDigital Library
- W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, pages 89--98, 2004. Google ScholarDigital Library
- D. G. Corneil and Y. Perl. Clustering and domination in perfect graphs. Discrete Applied Mathematics, 9(1):27 -- 39, 1984.Google ScholarCross Ref
- H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS -- 99 -- 06, University of Sheffield, May 1999.Google Scholar
- N. N. Dalvi, K. Schnaitter, and D. Suciu. Computing query probability with incidence algebras. In PODS, pages 203--214, 2010. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.Google Scholar
- H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. PVLDB, pages 1078--1089, 2009. Google ScholarDigital Library
- D. Eppstein and D. S. Hirschberg. Choosing subsets with maximum weighted average. J. Algorithms, 24(1):177--193, 1997. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Methods for domain-independent information extraction from the web: an experimental comparison. In AAAI, 2004. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007. Google ScholarDigital Library
- J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, 2011. Google ScholarDigital Library
- D. Jurafsky and J. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, 2009. Google ScholarDigital Library
- J. Kazama and K. Torisawa. Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations. In ACL, pages 407--415, 2008.Google Scholar
- B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In PODS, pages 187--198, 2011. Google ScholarDigital Library
- Z. Kozareva. Bootstrapping named entity recognition with automatically generated gazetteer lists. In EACL: Student Research Workshop, 2006. Google ScholarDigital Library
- R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4):7--13, 2008. Google ScholarDigital Library
- B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. Automatic Rule Refinement for Information Extraction. PVLDB, 2010. Google ScholarDigital Library
- M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313--330, 1993. Google ScholarDigital Library
- D. Maynard, K. Bontcheva, and H. Cunningham. Towards a semantic extraction of named entities. In Recent Advances in Natural Language Processing, 2003.Google Scholar
- A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, 2011. Google ScholarDigital Library
- A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In EACL, pages 1--8, 1999. Google ScholarDigital Library
- D. Nadeau, P. D. Turney, and S. Matwin. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Canadian Conference on AI, pages 266--277, 2006. Google ScholarDigital Library
- F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008. Google ScholarDigital Library
- E. Riloff. Automatically constructing a dictionary for information extraction tasks. In KDD, 1993.Google Scholar
- W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD, 2008. Google ScholarDigital Library
- W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007. Google ScholarDigital Library
- E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In HLT-NAACL, 2003. Google ScholarDigital Library
- L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.Google ScholarCross Ref
- C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarDigital Library
- A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), pages 25--26, 2007. Google ScholarDigital Library
Index Terms
- Provenance-based dictionary refinement in information extraction
Recommendations
Refinement and coarsening of surface meshes
This paper presents an adaptation scheme for surface meshes. Both refinement and coarsening tools are based upon local retriangulation. They can maintain the geometric features of the given surface mesh and its quality as well. A mesh gradation tool to ...
Computational aspects of the refinement of 3D tetrahedral meshes
The refinement of tetrahedral meshes is a significant task in many numerical and discretizations methods. The computational aspects for implementing refinement of meshes with complex geometry need to be carefully considered in order to have real-time ...
A methodology for quadrilateral finite element mesh coarsening
High fidelity finite element modeling of continuum mechanics problems often requires using all quadrilateral or all hexahedral meshes. The efficiency of such models is often dependent upon the ability to adapt a mesh to the physics of the phenomena. ...
Comments