Abstract
A huge portion of the Web known as the deep Web is accessible via search interfaces to myriads of databases on the Web. While relatively good approaches for querying the contents of web databases have been recently proposed, one cannot fully utilize them having most search interfaces unlocated. Thus, the automatic recognition of search interfaces to online databases is crucial for any application accessing the deep Web. This paper describes the architecture of the I-Crawler, a system for finding and classifying search interfaces. The I-Crawler is intentionally designed to be used in the deep web characterization surveys and for constructing directories of deep web resources.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: Proc. of WebDB 2005, pp. 1–6 (2005)
Barbosa, L., Freire, J.: Combining Classifiers to Identify Online Databases. In: Proc. of WWW 2007, pp. 431–440 (2007)
Bergholz, A., Childlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proc. of WISE 2003, pp. 125–133 (2003)
Broder, A.: A Taxonomy of Web Search. SIGIR Forum 36(2), 3–10 (2002)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the web: Observations and Implications. SIGMOD Rec. 33(3), 61–70 (2004)
Cope, J., Craswell, N., Hawking, D.: Automated Discovery of Search Interfaces on the Web. In: Proc. of ADC 2003, pp. 181–189 (2003)
Flanagan, D.: JavaScript: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2001)
Galperin, M., Cochrane, G.: Nucleic Acids Research Annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucl. Acids Res. 37(Suppl. 1), 1–4 (2009)
Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)
He, B., Tao, T., Chang, K.: Organizing Structured Web Sources by Query Schemas: a Clustering Approach. In: Proc. of CIKM 2004, pp. 22–31 (2004)
Jayapandian, M., Jagadish, H.V.: Automating the Design and Construction of Query Forms. Trans. Knowl. Data Eng. 21(10), 1389–1402 (2009)
Kohavi, R.: A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. of IJCAI 1995, pp. 1137–1143 (1995)
Lage, J., da Silva, A., Golgher, P., Laender, A.: Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data Knowl. Eng. 49(2), 177–196 (2004)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep Web crawl. In: Proc. of VLDB 2008 (2008)
McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proc. of VLDB 2001 (2001)
Reis, C., de Mattos Forte, R.: An Overview of the Software Engineering Process and Tools in the Mozilla Project. In: Proc. of Open Source Software Development Workshop, pp. 155–175 (2002)
Shestakov, D.: Characterization of National Deep Web. TUCS Technical Report 892 (2008)
Shestakov, D.: Deep Web: Databases on the Web. In: Entry in Handbook of Research on Innovations in Database Technologies and Applications. IGI Global (2009)
Shestakov, D., Bhowmick, S., Lim, E.-P.: DEQUE: Querying the Deep Web. Data Knowl. Eng. 52(3), 273–311 (2005)
Shestakov, D., Salakoski, T.: Host-IP Clustering Technique for Deep Web Characterization. In: Proc. of APWeb 2010 (2010)
Shestakov, D., Salakoski, T.: On Estimating the Scale of National Deep Web. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 780–789. Springer, Heidelberg (2007)
Wistow, S.: Deconstructing Flash: Investigations into the SWF File Format. Technical Report (2000)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proc. of ICDE 2006 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shestakov, D. (2010). On Building a Search Interface Discovery System. In: Lacroix, Z. (eds) Resource Discovery. RED 2009. Lecture Notes in Computer Science, vol 6162. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14415-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-14415-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14414-1
Online ISBN: 978-3-642-14415-8
eBook Packages: Computer ScienceComputer Science (R0)