Abstract
We address the problem of automatic discovery of the query language features supported by a Web information resource. We propose a method that automatically probes the resource’s search interface with a set of selected probe queries and analyzes the returned pages to recognize supported query language features. The automatic discovery assumes that the number of matches a server returns for a submitted query is available on the first result page. The method uses these match numbers to train a learner and generate classification rules that distinguish different semantics for specific, predefined model queries. Later these rules are used during automatic probing of new providers to reason about query features they support. We report experiments that demonstrate the suitability of our approach. Our approach has relatively low costs, because only a small set of resources has to be inspected manually to create a training set for the machine learning algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
The InvisibleWeb, http://www.invisibleweb.com/
BrightPlanet, http://www.brightplanet.com/
CompletePlanet, http://www.completeplanet.com/
AskOnce: The Enterprise Content Integration Solution, http://www.askonce.com/
Inktomi, http://www.inktomi.com/
Bergman, M.K.: The Deep Web: Surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)
Borgelt, C.: Christian Borgelt’s software page, http://fuzzy.cs.uni-magdeburg.de/borgelt/software.html
Bredelet, D., Roustant, B.: Java IWrap: Wrapper induction by grammar learning. Master’s thesis, ENSIMAG Grenoble (2000)
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS) 19(2), 97–130 (2001)
Callan, J.P., Connell, M., Du., A.: Automatic discovery of language models for text databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, June 1999, pp. 479–490 (1999)
Chang, C.-C.K., Garcia-Molina, H., Paepcke, A.: Boolean query mapping across heterogeneous information sources. IEEE Transactions on Knowledge and Data Engineering 8(4), 515–521 (1996)
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: Proceedings of the International Conference on Very Large Databases (VLDB), Hong Kong, China, August 2002, pp. 394–405 (2002)
Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify: Categorizing hidden-web databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, May 2001, pp. 67–78 (2001)
Perkowitz, M., Doorenbos, R.B., Etzioni, O., Weld, D.S.: Learning to understand information on the internet: An example-based approach. Journal of Intelligent Information Systems 8(2), 133–153 (1997)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proceedings of the International Conference on Very Large Databases (VLDB), Rome, Italy, September 2001, pp. 129–138 (2001)
Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization. In: Proceedings of the International Conference on Web Information Systems Engineering (WISE), Hong Kong, China, June 2000, pp. 283–290 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bergholz, A., Chidlovskii, B. (2004). Using Query Probing to Identify Query Language Features on the Web. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-24610-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20875-4
Online ISBN: 978-3-540-24610-7
eBook Packages: Springer Book Archive