skip to main content
10.1145/1559845.1559857acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficiently incorporating user feedback into information extraction and integration programs

Published: 29 June 2009 Publication History

Abstract

Many applications increasingly employ information extraction and integration (IE/II) programs to infer structures from unstructured data. Automatic IE/II are inherently imprecise. Hence such programs often make many IE/II mistakes, and thus can significantly benefit from user feedback. Today, however, there is no good way to automatically provide and process such feedback. When finding an IE/II mistake, users often must alert the developer team (e.g., via email or Web form) about the mistake, and then wait for the team to manually examine the program internals to locate and fix the mistake, a slow, error-prone, and frustrating process.
In this paper we propose a solution for users to directly provide feedback and for IE/II programs to automatically process such feedback. In our solution a developer U uses hlog, a declarative IE/II language, to write an IE/II program P. Next, U writes declarative user feedback rules that specify which parts of P's data (e.g., input, intermediate, or output data) users can edit, and via which user interfaces. Next, the so-augmented program P is executed, then enters a loop of waiting for and incorporating user feedback. Given user feedback F on a data portion of P, we show how to automatically propagate F to the rest of P, and to seamlessly combine F with prior user feedback. We describe the syntax and semantics of hlog, a baseline execution strategy, and then various optimization techniques. Finally, we describe experiments with real-world data that demonstrate the promise of our solution.

References

[1]
P. A. Bernstein, S. Melnik, and J. E. Churchill. Incremental schema matching. In VLDB-06.
[2]
J. A. Blakeley, P.-A. Larson, and F. W. Tompa. Efficiently updating materialized views. SIGMOD Record, 15(2), 1986.
[3]
P. Bohannon, S. Merugu, C. Yu, V. Agarwal, P. DeRose, A. Iyer, A. Jain, V. Kakade, M. Muralidharan, R. Ramakrishnan, and W. Shen. Purple SOX extraction management system. SIGMOD Record, 37(4), 2008.
[4]
P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT-01.
[5]
P. Buneman and W. C. Tan. Provenance in databases. In SIGMOD-07.
[6]
X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. Technical report. {Online} Available: http://www.cs.wisc.edu/~xchai/papers/hlog_report.pdf.
[7]
F. Chen, A. Doan, J. Yang, and R. Ramakrishnan. Efficient information extraction over evolving text data. In ICDE-08.
[8]
F. Chen, B. J. Gao, A. Doan, J. Yang, and R. Ramakrishnan. Optimizing complex extraction programs over evolving text data. In SIGMOD-09.
[9]
L. Chiticariu, P. G. Kolaitis, and L. Popa. Interactive generation of integrated schemas. In SIGMOD-08.
[10]
Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. In VLDB-01.
[11]
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL-02.
[12]
P. DeRose, X. Chai, B. Gao, W. Shen, A. Doan, P. Bohannon, and J. Zhu. Building community wikipedias: A human-machine approach. In ICDE-08.
[13]
P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web data portals: A top-down, compositional, and incremental approach. In VLDB-07.
[14]
P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. DBLife: A community information management platform for the database research community. In CIDR-07.
[15]
A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD-01.
[16]
A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, and B.-Q. Vuong. Information extraction challenges in managing unstructured data. SIGMOD Record, 37(4), 2008.
[17]
A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Eng. Bull., 29(1), 2006.
[18]
D. Ferrucci and A. Lally. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10(3--4), 2004.
[19]
M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. SIGMOD Record, 34(4), 2005.
[20]
G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project -- back and forth between theory and practice. In PODS-04.
[21]
J. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. Granularity of locks and degrees of consistency in a shared data base. In IFIP-76.
[22]
T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB-07.
[23]
T. Griffin and L. Libkin. Incremental maintenance of views with duplicates. SIGMOD Record, 24(2), 1995.
[24]
A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. Data Eng. Bulletin, 18(2), 1995.
[25]
J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non--answers to queries over extracted data. PVLDB, 1(1), 2008.
[26]
P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. To search or to crawl?: Towards a query optimizer for text-centric tasks. In SIGMOD-06.
[27]
S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD-08.
[28]
G. Kasneci, M. Ramanath, F. Suchanek, and G. Weikum. The YAGO-NAGA approach to knowledge discovery. SIGMOD Record, 37(4), 2008.
[29]
Y. Katsis, A. Deutsch, and Y. Papakonstantinou. Interactive source registration in community-oriented information integration. In VLDB-08.
[30]
H. T. Kung and J. T. Robinson. On optimistic methods for concurrency control. ACM Trans. Database Syst., 6(2), 1981.
[31]
P. L. Lehman and S. B. Yao. Efficient locking for concurrent operations on b-trees. ACM Trans. Database Syst., 6(4), 1981.
[32]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE-08.
[33]
S. Sarawagi. Information extraction. FnT Databases, 1(3), 2008.
[34]
W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD-08.
[35]
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB-07.
[36]
W. Wu, C. T. Yu, A. Doan, and W. Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In SIGMOD-04.

Cited By

View all

Index Terms

  1. Efficiently incorporating user feedback into information extraction and integration programs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
    June 2009
    1168 pages
    ISBN:9781605585512
    DOI:10.1145/1559845
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. incremental execution
    2. information extraction
    3. information integration
    4. provenance
    5. user feedback

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '09
    Sponsor:
    SIGMOD/PODS '09: International Conference on Management of Data
    June 29 - July 2, 2009
    Rhode Island, Providence, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Crowdsourcing applications and platformsProceedings of the VLDB Endowment10.14778/3402755.34028094:12(1508-1509)Online publication date: 3-Jun-2020
    • (2017)UFeedProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3132887(187-196)Online publication date: 6-Nov-2017
    • (2016)Q2PACM Transactions on the Web10.1145/287306110:2(1-29)Online publication date: 29-Apr-2016
    • (2016)BINARY: A framework for big data integration for ad-hoc querying2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840922(2746-2753)Online publication date: Dec-2016
    • (2016)Pay-as-you-go Data IntegrationProceedings of the 42nd International Conference on SOFSEM 2016: Theory and Practice of Computer Science - Volume 958710.1007/978-3-662-49192-8_7(81-92)Online publication date: 23-Jan-2016
    • (2016)Crowdsourced Query Processing on MicroblogsDatabase Systems for Advanced Applications10.1007/978-3-319-32025-0_2(18-32)Online publication date: 25-Mar-2016
    • (2015)Crowdsourced Data ManagementFoundations and Trends in Databases10.1561/19000000446:1-2(1-161)Online publication date: 1-Dec-2015
    • (2015)Enabling community-driven information integration through clusteringDistributed and Parallel Databases10.1007/s10619-014-7160-z33:1(33-67)Online publication date: 1-Mar-2015
    • (2014)Integrating spreadsheet data via accurate and low-effort extractionProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2623330.2623617(1126-1135)Online publication date: 24-Aug-2014
    • (2014)Name disambiguation in scientific cooperation network by exploiting user feedbackArtificial Intelligence Review10.1007/s10462-012-9323-541:4(563-578)Online publication date: 1-Apr-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media