research-article

Self-supervised web search for any-k complete tuples

Authors:

Alexander Löser,

Christoph Nagel,

Stephan Pieper,

Christoph BodenAuthors Info & Claims

BEWEB '11: Proceedings of the 2nd International Workshop on Business intelligencE and the WEB

Pages 4 - 11

https://doi.org/10.1145/1966883.1966889

Published: 25 March 2011 Publication History

Abstract

A common task of Web users is querying structured information from Web pages. In this paper we propose a novel query processor for systematically discovering any-k relations from Web search results with conjunctive queries. The 'any-k' phrase denotes that retrieved tuples are not ranked by the system.

For realizing this interesting scenario the query processor transfers a structured query into keyword queries that are submitted to a search engine, forwards search results to relation extractors, and then combines relations into result tuples.

Unfortunately, relation extractors may fail to return a relation for a result tuple. We propose a solid information theory-based approach for retrieving missing attribute values of partially retrieved relations. Moreover, user-defined data sources may not return at least k complete result tuples. To solve this problem, we extend the Eddy query processing mechanism [14] for our 'querying the Web' scenario with a continuous, adaptive routing model. The model determines the most promising next incomplete row for returning any-k complete result tuples at any point during the query execution process.

We report a thorough experimental evaluation over multiple relation extractors. Our experiments demonstrate that our query processor returns complete result tuples while processing only very few Web pages.

References

[1]

Markl V., Raman V., Simmen D. E., Lohman G. M., Pirahesh M.: Robust Query Processing through Progressive Optimization. SIGMOD Conference 2004: 659--670

Digital Library

[2]

Kasneci G., Ramanath M, Suchanek F. M., Weikum G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Row 37(4): 41--47 (2008)

Digital Library

[3]

Jain, A., Doan, A., and Gravano, L. 2008. Optimizing SQL Queries over Text Databases. ICDE. IEEE Computer Society, Washington, DC, 636--645.

Digital Library

[4]

Jain, A. and Srivastava, D. 2009. Exploring a Few Good Tuples from Text Databases. ICDE. IEEE Computer Society, Washington, DC, 616--627.

Digital Library

[5]

Jain, A., Ipeirotis, P. G., Doan, A., and Gravano, L. 2009. Join Optimization of Information Extraction Output: Quality Matters! ICDE. IEEE Computer Society, Washington, DC.

Digital Library

[6]

Naumann, F., 2002. Quality-Driven Query Answering for Integrated Information Systems. Springer

Digital Library

[7]

Shen, W., DeRose, P., McCann, R., Doan, A., and Ramakrishnan, R. 2008. Toward best-effort information extraction. SIGMOD '08. ACM, New York, NY, 1031--1042.

Digital Library

[8]

Agichtein, E. and Gravano, L. 2003. QXtract: a building block for efficient information extraction from Web page collections. SIGMOD '03. ACM, New York, NY, 663--663.

Digital Library

[9]

Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C. 2001. Declarative Data Cleaning: Language, Model, and Algorithms. Very Large Data Bases. Morgan Kaufmann Publishers, Rome, CA, 371--380.

Digital Library

[10]

Etzioni, O., Banko M., Soderland S., Weld, D. S.: Open information extraction from the Web. Commun. ACM 51(12): 68--74 (2008)

Digital Library

[11]

Ipeirotis, P. G., Agichtein, E., Jain, P., and Gravano, L. 2006. To search or to crawl?: towards a query optimizer for text-centric tasks. SIGMOD '06. ACM, New York, NY

Digital Library

[12]

Xin Dong, Alon Y. Halevy, Jayant Madhavan: Reference Reconciliation in Complex Information Spaces. SIGMOD Conference 2005: 85--96

Digital Library

[13]

L&#246;ser, A, Lutter S., D&#252;ssel, P, Markl V. 2009. Ad-hoc Queries over Web page Collections -- a Case Study. BIRTE Workshop at VLDB 2009.

[14]

Avnur R., Hellerstein J. M.: Eddies: Continuously Adaptive Query Processing SIGMOD Conference 2000: 261--272

Digital Library

[15]

YahooBoss service. http://developer.yahoo.com/search/boss (Last visited 01/03/10)

[16]

OpenCalais. http://www.opencalais.com (Last visited 01/01/11)

[17]

Levy A., Mendelzon A. O., Sagiv Y., Srivastava D.: Answering Queries Using Views. PODS 1995: 95--104

Digital Library

[18]

Liu J., Dong X., Halevy A. Y.: Answering Structured Queries on Unstructured Data. WebDB 2006

[19]

HSQLDB. http://hsqldb.org/ (Last visited 01/01/11)

[20]

Fung G., Yu, J., Lu, H.: Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. ICDM 2002: 187--194

Digital Library

[21]

Feldman R., Regev Y., Gorodetsky M.: A modular information extraction system. Intell. Data Anal. 12(1): 51--71 (2008)

Digital Library

[22]

L&#246;ser, A, H&#252;ske F., Markl V. Situational Business Intelligence. BIRTE Workshop at VLDB 2008

[23]

L&#246;ser, A. Beyond Search: Web-Scale Business Analytics. WISE 2009: 5

Digital Library

[24]

L&#246;ser, A., Nagel, C., Pieper, S. Augmenting Tables by Self-Supervised Web search. BIRTE Workshop at VLDB 2010

[25]

Freebase. www.freebase.com (Last visited 01/01/11)

[26]

Bizer C., Heath T., Berners-Lee T.: Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst. 5(3): 1--22 (2009)

[27]

Goolap.info. www.goolap.info (Last visited 01/01/11)

[28]

Mitchell T. Machine Learning. McGraw-Hill, 1997.

Digital Library

[29]

Ilyas I., Beskales G., and Soliman M. A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4): (2008)

Digital Library

[30]

Jain A, Pantel P.: FactRank: Random Walks on a Web of Facts. COLING- 2010

Digital Library

[31]

Boden C., H&#228;fele T., L&#246;ser A.: Classification Algorithms for Relation Prediction. DaLi Workshop at ICDE 2011 (forthcoming)

Digital Library

Cited By

Abelló ADarmont JEtcheverry LGolfarelli MMazón JNaumann FPedersen TRizzi STrujillo JVassiliadis PVossen G(2013)Fusion CubesInternational Journal of Data Warehousing and Mining10.4018/jdwm.20130401049:2(66-88)Online publication date: Apr-2013
https://doi.org/10.4018/jdwm.2013040104
Löser AArnold SFiehn T(2012)The GoOLAP Fact Retrieval FrameworkBusiness Intelligence10.1007/978-3-642-27358-2_4(84-97)Online publication date: 2012
https://doi.org/10.1007/978-3-642-27358-2_4

Index Terms

Self-supervised web search for any-k complete tuples
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Self-Supervised Query Reformulation for Code Search
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a ...
Detecting redundant tuples during query evaluation
Discovering queries based on example tuples
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

An enterprise information worker is often aware of a few example tuples (but not the entire result) that should be present in the output of the query. We study the problem of discovering the minimal project join query that contains the given example ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

BEWEB '11: Proceedings of the 2nd International Workshop on Business intelligencE and the WEB

March 2011

54 pages

ISBN:9781450306102

DOI:10.1145/1966883

Editors:
Jose-Norberto Mazón
University of Alicante, Spain
,
Irene Garrigós
University of Alicante, Spain
,
Florian Daniel
University of Trento, Italy
,
Malu Castellanos
Hewlett-Packard Labs
,
Kjell Orsborn
Uppsala University, Sweden
,
Silvia Stefanova
Uppsala University, Sweden

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Seventh Framework Programme

Conference

EDBT/ICDT '11

EDBT/ICDT '11: EDBT/ICDT '11 joint conference

March 25, 2011

Uppsala, Sweden

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
96
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abelló ADarmont JEtcheverry LGolfarelli MMazón JNaumann FPedersen TRizzi STrujillo JVassiliadis PVossen G(2013)Fusion CubesInternational Journal of Data Warehousing and Mining10.4018/jdwm.20130401049:2(66-88)Online publication date: Apr-2013
https://doi.org/10.4018/jdwm.2013040104
Löser AArnold SFiehn T(2012)The GoOLAP Fact Retrieval FrameworkBusiness Intelligence10.1007/978-3-642-27358-2_4(84-97)Online publication date: 2012
https://doi.org/10.1007/978-3-642-27358-2_4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents