skip to main content
10.1145/1966883.1966889acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbewebConference Proceedingsconference-collections
research-article

Self-supervised web search for any-k complete tuples

Published: 25 March 2011 Publication History

Abstract

A common task of Web users is querying structured information from Web pages. In this paper we propose a novel query processor for systematically discovering any-k relations from Web search results with conjunctive queries. The 'any-k' phrase denotes that retrieved tuples are not ranked by the system.
For realizing this interesting scenario the query processor transfers a structured query into keyword queries that are submitted to a search engine, forwards search results to relation extractors, and then combines relations into result tuples.
Unfortunately, relation extractors may fail to return a relation for a result tuple. We propose a solid information theory-based approach for retrieving missing attribute values of partially retrieved relations. Moreover, user-defined data sources may not return at least k complete result tuples. To solve this problem, we extend the Eddy query processing mechanism [14] for our 'querying the Web' scenario with a continuous, adaptive routing model. The model determines the most promising next incomplete row for returning any-k complete result tuples at any point during the query execution process.
We report a thorough experimental evaluation over multiple relation extractors. Our experiments demonstrate that our query processor returns complete result tuples while processing only very few Web pages.

References

[1]
Markl V., Raman V., Simmen D. E., Lohman G. M., Pirahesh M.: Robust Query Processing through Progressive Optimization. SIGMOD Conference 2004: 659--670
[2]
Kasneci G., Ramanath M, Suchanek F. M., Weikum G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Row 37(4): 41--47 (2008)
[3]
Jain, A., Doan, A., and Gravano, L. 2008. Optimizing SQL Queries over Text Databases. ICDE. IEEE Computer Society, Washington, DC, 636--645.
[4]
Jain, A. and Srivastava, D. 2009. Exploring a Few Good Tuples from Text Databases. ICDE. IEEE Computer Society, Washington, DC, 616--627.
[5]
Jain, A., Ipeirotis, P. G., Doan, A., and Gravano, L. 2009. Join Optimization of Information Extraction Output: Quality Matters! ICDE. IEEE Computer Society, Washington, DC.
[6]
Naumann, F., 2002. Quality-Driven Query Answering for Integrated Information Systems. Springer
[7]
Shen, W., DeRose, P., McCann, R., Doan, A., and Ramakrishnan, R. 2008. Toward best-effort information extraction. SIGMOD '08. ACM, New York, NY, 1031--1042.
[8]
Agichtein, E. and Gravano, L. 2003. QXtract: a building block for efficient information extraction from Web page collections. SIGMOD '03. ACM, New York, NY, 663--663.
[9]
Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C. 2001. Declarative Data Cleaning: Language, Model, and Algorithms. Very Large Data Bases. Morgan Kaufmann Publishers, Rome, CA, 371--380.
[10]
Etzioni, O., Banko M., Soderland S., Weld, D. S.: Open information extraction from the Web. Commun. ACM 51(12): 68--74 (2008)
[11]
Ipeirotis, P. G., Agichtein, E., Jain, P., and Gravano, L. 2006. To search or to crawl?: towards a query optimizer for text-centric tasks. SIGMOD '06. ACM, New York, NY
[12]
Xin Dong, Alon Y. Halevy, Jayant Madhavan: Reference Reconciliation in Complex Information Spaces. SIGMOD Conference 2005: 85--96
[13]
Löser, A, Lutter S., Düssel, P, Markl V. 2009. Ad-hoc Queries over Web page Collections -- a Case Study. BIRTE Workshop at VLDB 2009.
[14]
Avnur R., Hellerstein J. M.: Eddies: Continuously Adaptive Query Processing SIGMOD Conference 2000: 261--272
[15]
YahooBoss service. http://developer.yahoo.com/search/boss (Last visited 01/03/10)
[16]
OpenCalais. http://www.opencalais.com (Last visited 01/01/11)
[17]
Levy A., Mendelzon A. O., Sagiv Y., Srivastava D.: Answering Queries Using Views. PODS 1995: 95--104
[18]
Liu J., Dong X., Halevy A. Y.: Answering Structured Queries on Unstructured Data. WebDB 2006
[19]
HSQLDB. http://hsqldb.org/ (Last visited 01/01/11)
[20]
Fung G., Yu, J., Lu, H.: Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. ICDM 2002: 187--194
[21]
Feldman R., Regev Y., Gorodetsky M.: A modular information extraction system. Intell. Data Anal. 12(1): 51--71 (2008)
[22]
Löser, A, Hüske F., Markl V. Situational Business Intelligence. BIRTE Workshop at VLDB 2008
[23]
Löser, A. Beyond Search: Web-Scale Business Analytics. WISE 2009: 5
[24]
Löser, A., Nagel, C., Pieper, S. Augmenting Tables by Self-Supervised Web search. BIRTE Workshop at VLDB 2010
[25]
Freebase. www.freebase.com (Last visited 01/01/11)
[26]
Bizer C., Heath T., Berners-Lee T.: Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst. 5(3): 1--22 (2009)
[27]
Goolap.info. www.goolap.info (Last visited 01/01/11)
[28]
Mitchell T. Machine Learning. McGraw-Hill, 1997.
[29]
Ilyas I., Beskales G., and Soliman M. A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4): (2008)
[30]
Jain A, Pantel P.: FactRank: Random Walks on a Web of Facts. COLING- 2010
[31]
Boden C., Häfele T., Löser A.: Classification Algorithms for Relation Prediction. DaLi Workshop at ICDE 2011 (forthcoming)

Cited By

View all
  • (2013)Fusion CubesInternational Journal of Data Warehousing and Mining10.4018/jdwm.20130401049:2(66-88)Online publication date: Apr-2013
  • (2012)The GoOLAP Fact Retrieval FrameworkBusiness Intelligence10.1007/978-3-642-27358-2_4(84-97)Online publication date: 2012

Index Terms

  1. Self-supervised web search for any-k complete tuples

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    BEWEB '11: Proceedings of the 2nd International Workshop on Business intelligencE and the WEB
    March 2011
    54 pages
    ISBN:9781450306102
    DOI:10.1145/1966883
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 March 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. join execution on web search results
    2. text analytics

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    EDBT/ICDT '11

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2013)Fusion CubesInternational Journal of Data Warehousing and Mining10.4018/jdwm.20130401049:2(66-88)Online publication date: Apr-2013
    • (2012)The GoOLAP Fact Retrieval FrameworkBusiness Intelligence10.1007/978-3-642-27358-2_4(84-97)Online publication date: 2012

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media