ABSTRACT
This paper studies the problem of automatic acquisition of the query languages supported by a Web information resource. We describe a system that automatically probes the search interface of a resource with a set of test queries and analyses the returned pages to recognize supported query operators. The automatic acquisition assumes the availability of the number of matches the resource returns for a submitted query. The match numbers are used to train a learning system and to generate classification rules that recognize the query operators supported by a provider and their syntactic encodings. These classification rules are employed during the automatic probing of new providers to determine query operators they support. We report on results of experiments with a set of real Web resources.
- The Invisible Web, http://www.invisibleweb.com/.Google Scholar
- BrightPlanet, http://www.brightplanet.com/.Google Scholar
- CompletePlanet, http://www.completeplanet.com/.Google Scholar
- G. Alonso. Myths around web services. IEEE Bulletin on Data Engineering, 25(4):3--9, 2002.Google Scholar
- D. Angluin. Queries and concept learning. Machine Learning, 2(4):319--342, 1987. Google ScholarDigital Library
- M. K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), 2001.Google ScholarCross Ref
- D. Bredelet and B. Roustant. Java IWrap: Wrapper induction by grammar learning. Master's thesis, ENSIMAG Grenoble, 2000.Google Scholar
- S. Byers, J. Freire, and C. T. Silva. Efficient acquisition of web data through restricted query interfaces. In Proc. WWW Conf., China, May 2001.Google Scholar
- J. P. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In Proc. ACM SIGMOD Conf., pp. 479--490, June 1999. Google ScholarDigital Library
- C.-C. K. Chang and H Garcia-Molina. Approximate query translation across heterogeneous information sources. In Proc. VLDB Conf., pp. 566--577, Cairo, Egypt, September 2000. Google ScholarDigital Library
- C.-C. K. Chang, H. Garcia-Molina, and A. Paepcke. Boolean query mapping across heterogeneous information sources. IEEE TKDE, 8(4):515--521, 1996. Google ScholarDigital Library
- B. Chidlovskii. Automatic repairing of web wrappers by combining redundant views. In Proc. of the IEEE Intern. Conf. Tools with AI, USA, November 2002. Google ScholarDigital Library
- L. Gravano, H. Garcia-Molina, and A. Tomasic. Gloss: Text-source discovery over the internet. ACM TODS, 24(2):229--264, 1999. Google ScholarDigital Library
- P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In Proc. VLDB Conf., pp. 394--405, Hong Kong, China, August 2002. Google ScholarDigital Library
- P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Categorizing hidden-web databases. In Proc. ACM SIGMOD Conf., pp. 67--78, Santa Barbara, CA, USA, May 2001. Google ScholarDigital Library
- M. Perkowitz, R. B. Doorenbos, O. Etzioni, and D. S. Weld. Learning to understand information on the internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133--153, 1997. Google ScholarDigital Library
- S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In Proc. VLDB Conf., pp. 129--138, Rome, Italy, September 2001. Google ScholarDigital Library
- D. Tsur. Are web services the next revolution in e-commerce? In Proc. VLDB Conf., pp. 614--617, Rome, Italy, September 2001. Google ScholarDigital Library
- W. Wang, W. Meng, and C. Yu. Concept hierarchy based text database categorization. In Proc. Intern. WISE Conf., pp. 283--290, China, June 2000. Google ScholarDigital Library
- R. Yerneni, C. Li, H. Garcia-Molina, and J. Ullman. Computing capabilities of mediators. In Proc. ACM SIGMOD Conf., pp. 443--454, PA, USA, June 1999. Google ScholarDigital Library
- Learning query languages of Web interfaces
Recommendations
Supporting top-k join queries in relational databases
Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging ...
Coverage, relevance, and ranking: The impact of query operators on Web search engine results
Research has reported that about 10% of Web searchers utilize advanced query operators, with the other 90% using extremely simple queries. It is often assumed that the use of query operators, such as Boolean operators and phrase searching, improves the ...
Efficient deep web crawling using reinforcement learning
PAKDD'10: Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part IDeep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper ...
Comments