research-article

Efficiently incorporating user feedback into information extraction and integration programs

Authors:

Jeffrey F. NaughtonAuthors Info & Claims

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Pages 87 - 100

https://doi.org/10.1145/1559845.1559857

Published: 29 June 2009 Publication History

Abstract

Many applications increasingly employ information extraction and integration (IE/II) programs to infer structures from unstructured data. Automatic IE/II are inherently imprecise. Hence such programs often make many IE/II mistakes, and thus can significantly benefit from user feedback. Today, however, there is no good way to automatically provide and process such feedback. When finding an IE/II mistake, users often must alert the developer team (e.g., via email or Web form) about the mistake, and then wait for the team to manually examine the program internals to locate and fix the mistake, a slow, error-prone, and frustrating process.

In this paper we propose a solution for users to directly provide feedback and for IE/II programs to automatically process such feedback. In our solution a developer U uses hlog, a declarative IE/II language, to write an IE/II program P. Next, U writes declarative user feedback rules that specify which parts of P's data (e.g., input, intermediate, or output data) users can edit, and via which user interfaces. Next, the so-augmented program P is executed, then enters a loop of waiting for and incorporating user feedback. Given user feedback F on a data portion of P, we show how to automatically propagate F to the rest of P, and to seamlessly combine F with prior user feedback. We describe the syntax and semantics of hlog, a baseline execution strategy, and then various optimization techniques. Finally, we describe experiments with real-world data that demonstrate the promise of our solution.

References

[1]

P. A. Bernstein, S. Melnik, and J. E. Churchill. Incremental schema matching. In VLDB-06.

Digital Library

[2]

J. A. Blakeley, P.-A. Larson, and F. W. Tompa. Efficiently updating materialized views. SIGMOD Record, 15(2), 1986.

Digital Library

[3]

P. Bohannon, S. Merugu, C. Yu, V. Agarwal, P. DeRose, A. Iyer, A. Jain, V. Kakade, M. Muralidharan, R. Ramakrishnan, and W. Shen. Purple SOX extraction management system. SIGMOD Record, 37(4), 2008.

Digital Library

[4]

P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT-01.

Digital Library

[5]

P. Buneman and W. C. Tan. Provenance in databases. In SIGMOD-07.

Digital Library

[6]

X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. Technical report. {Online} Available: http://www.cs.wisc.edu/~xchai/papers/hlog_report.pdf.

[7]

F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE-08.

Digital Library

[8]

F. Chen, B. J. Gao, A. Doan, J. Yang, and R. Ramakrishnan. Optimizing complex extraction programs over evolving text data. In SIGMOD-09.

Digital Library

[9]

L. Chiticariu, P. G. Kolaitis, and L. Popa. Interactive generation of integrated schemas. In SIGMOD-08.

Digital Library

[10]

Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. In VLDB-01.

Digital Library

[11]

H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL-02.

[12]

P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, and J. Zhu. Building community wikipedias: A human-machine approach. In ICDE-08.

Digital Library

[13]

P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web data portals: A top-down, compositional, and incremental approach. In VLDB-07.

Digital Library

[14]

P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. DBLife: A community information management platform for the database research community. In CIDR-07.

[15]

A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD-01.

Digital Library

[16]

A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, and B.-Q. Vuong. Information extraction challenges in managing unstructured data. SIGMOD Record, 37(4), 2008.

Digital Library

[17]

A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Eng. Bull., 29(1), 2006.

[18]

D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3--4), 2004.

Digital Library

[19]

M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. SIGMOD Record, 34(4), 2005.

Digital Library

[20]

G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project -- back and forth between theory and practice. In PODS-04.

Digital Library

[21]

J. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. Granularity of locks and degrees of consistency in a shared data base. In IFIP-76.

[22]

T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB-07.

Digital Library

[23]

T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. SIGMOD Record, 24(2), 1995.

Digital Library

[24]

A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. Data Eng. Bulletin, 18(2), 1995.

[25]

J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non--answers to queries over extracted data. PVLDB, 1(1), 2008.

Digital Library

[26]

P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl?: Towards a query optimizer for text-centric tasks. In SIGMOD-06.

Digital Library

[27]

S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD-08.

Digital Library

[28]

G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4), 2008.

Digital Library

[29]

Y. Katsis, A. Deutsch, and Y. Papakonstantinou. Interactive source registration in community-oriented information integration. In VLDB-08.

[30]

H. T. Kung and J. T. Robinson. On optimistic methods for concurrency control. ACM Trans. Database Syst., 6(2), 1981.

Digital Library

[31]

P. L. Lehman and S. B. Yao. Efficient locking for concurrent operations on b-trees. ACM Trans. Database Syst., 6(4), 1981.

Digital Library

[32]

F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE-08.

Digital Library

[33]

S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008.

Digital Library

[34]

W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD-08.

Digital Library

[35]

W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB-07.

Digital Library

[36]

W. Wu, C. T. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In SIGMOD-04.

Digital Library

Cited By

Doan AFranklin MKossmann DKraska T(2020)Crowdsourcing applications and platformsProceedings of the VLDB Endowment10.14778/3402755.34028094:12(1508-1509)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402755.3402809
El-Roby AAboulnaga ALim EWinslett MSanderson MFu ASun JCulpepper SLo EHo JDonato DAgrawal RZheng YCastillo CSun ATseng VLi C(2017)UFeedProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3132887(187-196)Online publication date: 6-Nov-2017
https://dl.acm.org/doi/10.1145/3132847.3132887
Wu WMeng WSu WZhou GChiang Y(2016)Q2PACM Transactions on the Web10.1145/287306110:2(1-29)Online publication date: 29-Apr-2016
https://dl.acm.org/doi/10.1145/2873061
Show More Cited By

Index Terms

Efficiently incorporating user feedback into information extraction and integration programs
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Name disambiguation in scientific cooperation network by exploiting user feedback

Name disambiguation is a very critical problem in scientific cooperation network. Ambiguous author names may occur due to the existence of multiple authors with the same name. Despite much research work has been conducted, the problem is still not ...
Incorporating user feedback into name disambiguation of scientific cooperation network
WAIM'11: Proceedings of the 12th international conference on Web-age information management

In scientific cooperation network, ambiguous author names may occur due to the existence of multiple authors with the same name. Users of these networks usually want to know the exact author of a paper, whereas we do not have any unique identifier to ...
Enabling community-driven information integration through clustering

It has become widely recognized that user feedback can play a fundamental role in facilitating information integration tasks, e.g., the construction of integration schema and the specification of schema mappings. While promising, existing proposals make ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

June 2009

1168 pages

ISBN:9781605585512

DOI:10.1145/1559845

Editors:
Carsten Binnig,
Benoit Dageville,
General Chairs:
Uğur Çetintemel
Brown University, USA
,
Stan Zdonik
Brown University, USA
,
Program Chair:
Donald Kossmann
ETH Zurich, Switzerland

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '09

Sponsor:

SIGMOD/PODS '09: International Conference on Management of Data

June 29 - July 2, 2009

Rhode Island, Providence, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,196
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Doan AFranklin MKossmann DKraska T(2020)Crowdsourcing applications and platformsProceedings of the VLDB Endowment10.14778/3402755.34028094:12(1508-1509)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402755.3402809
El-Roby AAboulnaga ALim EWinslett MSanderson MFu ASun JCulpepper SLo EHo JDonato DAgrawal RZheng YCastillo CSun ATseng VLi C(2017)UFeedProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3132887(187-196)Online publication date: 6-Nov-2017
https://dl.acm.org/doi/10.1145/3132847.3132887
Wu WMeng WSu WZhou GChiang Y(2016)Q2PACM Transactions on the Web10.1145/287306110:2(1-29)Online publication date: 29-Apr-2016
https://dl.acm.org/doi/10.1145/2873061
Eftekhari AZulkernine FMartin P(2016)BINARY: A framework for big data integration for ad-hoc querying2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840922(2746-2753)Online publication date: Dec-2016
https://doi.org/10.1109/BigData.2016.7840922
Paton NBelhajjame KEmbury SFernandes AMaskat R(2016)Pay-as-you-go Data IntegrationProceedings of the 42nd International Conference on SOFSEM 2016: Theory and Practice of Computer Science - Volume 958710.1007/978-3-662-49192-8_7(81-92)Online publication date: 23-Jan-2016
https://dl.acm.org/doi/10.1007/978-3-662-49192-8_7
Chen WZhao ZWang XNg W(2016)Crowdsourced Query Processing on MicroblogsDatabase Systems for Advanced Applications10.1007/978-3-319-32025-0_2(18-32)Online publication date: 25-Mar-2016
https://doi.org/10.1007/978-3-319-32025-0_2
Marcus AParameswaran A(2015)Crowdsourced Data ManagementFoundations and Trends in Databases10.1561/19000000446:1-2(1-161)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1561/1900000044
Belhajjame KPaton NHedeler CFernandes A(2015)Enabling community-driven information integration through clusteringDistributed and Parallel Databases10.1007/s10619-014-7160-z33:1(33-67)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1007/s10619-014-7160-z
Chen ZCafarella MMacskassy SPerlich CLeskovec JWang WGhani R(2014)Integrating spreadsheet data via accurate and low-effort extractionProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2623330.2623617(1126-1135)Online publication date: 24-Aug-2014
https://dl.acm.org/doi/10.1145/2623330.2623617
Li YWen ALin QLi RLu Z(2014)Name disambiguation in scientific cooperation network by exploiting user feedbackArtificial Intelligence Review10.1007/s10462-012-9323-541:4(563-578)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1007/s10462-012-9323-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten