skip to main content
10.1145/2463676.2465284acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Provenance-based dictionary refinement in information extraction

Published:22 June 2013Publication History

ABSTRACT

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results.

In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

References

  1. In www.census.gov.Google ScholarGoogle Scholar
  2. In www.geonames.org.Google ScholarGoogle Scholar
  3. Automatic Content Extraction 2005 Evaluation Dataset. 2005.Google ScholarGoogle Scholar
  4. E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, pages 85--94, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Buneman, S. Khanna, and W.-C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cheney, L. Chiticariu, and W. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP, pages 1002--1012, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, pages 89--98, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. G. Corneil and Y. Perl. Clustering and domination in perfect graphs. Discrete Applied Mathematics, 9(1):27 -- 39, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  12. H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS -- 99 -- 06, University of Sheffield, May 1999.Google ScholarGoogle Scholar
  13. N. N. Dalvi, K. Schnaitter, and D. Suciu. Computing query probability with incidence algebras. In PODS, pages 203--214, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.Google ScholarGoogle Scholar
  15. H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. PVLDB, pages 1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Eppstein and D. S. Hirschberg. Choosing subsets with maximum weighted average. J. Algorithms, 24(1):177--193, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Methods for domain-independent information extraction from the web: an experimental comparison. In AAAI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Jurafsky and J. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Kazama and K. Torisawa. Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations. In ACL, pages 407--415, 2008.Google ScholarGoogle Scholar
  23. B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In PODS, pages 187--198, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Z. Kozareva. Bootstrapping named entity recognition with automatically generated gazetteer lists. In EACL: Student Research Workshop, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4):7--13, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. Automatic Rule Refinement for Information Extraction. PVLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313--330, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Maynard, K. Bontcheva, and H. Cunningham. Towards a semantic extraction of named entities. In Recent Advances in Natural Language Processing, 2003.Google ScholarGoogle Scholar
  29. A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In EACL, pages 1--8, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Nadeau, P. D. Turney, and S. Matwin. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Canadian Conference on AI, pages 266--277, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Riloff. Automatically constructing a dictionary for information extraction tasks. In KDD, 1993.Google ScholarGoogle Scholar
  34. W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In HLT-NAACL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  38. C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), pages 25--26, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Provenance-based dictionary refinement in information extraction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
        June 2013
        1322 pages
        ISBN:9781450320375
        DOI:10.1145/2463676

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader