Using Query Probing to Identify Query Language Features on the Web

Bergholz, André; Chidlovskii, Boris

doi:10.1007/978-3-540-24610-7_2

André Bergholz⁷ &
Boris Chidlovskii⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2924))

Included in the following conference series:

Workshop on Distributed Information Retrieval

199 Accesses

Abstract

We address the problem of automatic discovery of the query language features supported by a Web information resource. We propose a method that automatically probes the resource’s search interface with a set of selected probe queries and analyzes the returned pages to recognize supported query language features. The automatic discovery assumes that the number of matches a server returns for a submitted query is available on the first result page. The method uses these match numbers to train a learner and generate classification rules that distinguish different semantics for specific, predefined model queries. Later these rules are used during automatic probing of new providers to reason about query features they support. We report experiments that demonstrate the suitability of our approach. Our approach has relatively low costs, because only a small set of resources has to be inspected manually to create a training set for the machine learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Enhancing Web Search Through Question Classifier

Web Page Structured Content Detection Using Supervised Machine Learning

Web as a Corpus: Going Beyond the n-gram

References

The InvisibleWeb, http://www.invisibleweb.com/
BrightPlanet, http://www.brightplanet.com/
CompletePlanet, http://www.completeplanet.com/
AskOnce: The Enterprise Content Integration Solution, http://www.askonce.com/
Inktomi, http://www.inktomi.com/
Bergman, M.K.: The Deep Web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)
Google Scholar
Borgelt, C.: Christian Borgelt’s software page, http://fuzzy.cs.uni-magdeburg.de/borgelt/software.html
Bredelet, D., Roustant, B.: Java IWrap: Wrapper induction by grammar learning. Master’s thesis, ENSIMAG Grenoble (2000)
Google Scholar
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) 19(2), 97–130 (2001)
Article Google Scholar
Callan, J.P., Connell, M., Du., A.: Automatic discovery of language models for text databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, June 1999, pp. 479–490 (1999)
Google Scholar
Chang, C.-C.K., Garcia-Molina, H., Paepcke, A.: Boolean query mapping across heterogeneous information sources. IEEE Transactions on Knowledge and Data Engineering 8(4), 515–521 (1996)
Article Google Scholar
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the International Conference on Very Large Databases (VLDB), Hong Kong, China, August 2002, pp. 394–405 (2002)
Google Scholar
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 2001, pp. 67–78 (2001)
Google Scholar
Perkowitz, M., Doorenbos, R.B., Etzioni, O., Weld, D.S.: Learning to understand information on the internet: An example-based approach. Journal of Intelligent Information Systems 8(2), 133–153 (1997)
Article Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the International Conference on Very Large Databases (VLDB), Rome, Italy, September 2001, pp. 129–138 (2001)
Google Scholar
Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization. In: Proceedings of the International Conference on Web Information Systems Engineering (WISE), Hong Kong, China, June 2000, pp. 283–290 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe (Grenoble), 6 chemin de Maupertuis, 38240, Meylan, France
André Bergholz & Boris Chidlovskii

Authors

André Bergholz
View author publications
You can also search for this author in PubMed Google Scholar
Boris Chidlovskii
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Ave, 15213, Pittsburgh, PA, USA
Jamie Callan
Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bergholz, A., Chidlovskii, B. (2004). Using Query Probing to Identify Query Language Features on the Web. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-24610-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20875-4
Online ISBN: 978-3-540-24610-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics