On Building a Search Interface Discovery System

Shestakov, Denis

doi:10.1007/978-3-642-14415-8_6

On Building a Search Interface Discovery System

Denis Shestakov¹⁷

Conference paper

315 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6162))

Abstract

A huge portion of the Web known as the deep Web is accessible via search interfaces to myriads of databases on the Web. While relatively good approaches for querying the contents of web databases have been recently proposed, one cannot fully utilize them having most search interfaces unlocated. Thus, the automatic recognition of search interfaces to online databases is crucial for any application accessing the deep Web. This paper describes the architecture of the I-Crawler, a system for finding and classifying search interfaces. The I-Crawler is intentionally designed to be used in the deep web characterization surveys and for constructing directories of deep web resources.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: Proc. of WebDB 2005, pp. 1–6 (2005)
Google Scholar
Barbosa, L., Freire, J.: Combining Classifiers to Identify Online Databases. In: Proc. of WWW 2007, pp. 431–440 (2007)
Google Scholar
Bergholz, A., Childlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proc. of WISE 2003, pp. 125–133 (2003)
Google Scholar
Broder, A.: A Taxonomy of Web Search. SIGIR Forum 36(2), 3–10 (2002)
Article Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Article Google Scholar
Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the web: Observations and Implications. SIGMOD Rec. 33(3), 61–70 (2004)
Article Google Scholar
Cope, J., Craswell, N., Hawking, D.: Automated Discovery of Search Interfaces on the Web. In: Proc. of ADC 2003, pp. 181–189 (2003)
Google Scholar
Flanagan, D.: JavaScript: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2001)
Google Scholar
Galperin, M., Cochrane, G.: Nucleic Acids Research Annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucl. Acids Res. 37(Suppl. 1), 1–4 (2009)
Article Google Scholar
Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)
Article Google Scholar
He, B., Tao, T., Chang, K.: Organizing Structured Web Sources by Query Schemas: a Clustering Approach. In: Proc. of CIKM 2004, pp. 22–31 (2004)
Google Scholar
Jayapandian, M., Jagadish, H.V.: Automating the Design and Construction of Query Forms. Trans. Knowl. Data Eng. 21(10), 1389–1402 (2009)
Article Google Scholar
Kohavi, R.: A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. of IJCAI 1995, pp. 1137–1143 (1995)
Google Scholar
Lage, J., da Silva, A., Golgher, P., Laender, A.: Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data Knowl. Eng. 49(2), 177–196 (2004)
Article Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep Web crawl. In: Proc. of VLDB 2008 (2008)
Google Scholar
McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proc. of VLDB 2001 (2001)
Google Scholar
Reis, C., de Mattos Forte, R.: An Overview of the Software Engineering Process and Tools in the Mozilla Project. In: Proc. of Open Source Software Development Workshop, pp. 155–175 (2002)
Google Scholar
Shestakov, D.: Characterization of National Deep Web. TUCS Technical Report 892 (2008)
Google Scholar
Shestakov, D.: Deep Web: Databases on the Web. In: Entry in Handbook of Research on Innovations in Database Technologies and Applications. IGI Global (2009)
Google Scholar
Shestakov, D., Bhowmick, S., Lim, E.-P.: DEQUE: Querying the Deep Web. Data Knowl. Eng. 52(3), 273–311 (2005)
Article Google Scholar
Shestakov, D., Salakoski, T.: Host-IP Clustering Technique for Deep Web Characterization. In: Proc. of APWeb 2010 (2010)
Google Scholar
Shestakov, D., Salakoski, T.: On Estimating the Scale of National Deep Web. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 780–789. Springer, Heidelberg (2007)
Chapter Google Scholar
Wistow, S.: Deconstructing Flash: Investigations into the SWF File Format. Technical Report (2000)
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proc. of ICDE 2006 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Media Technology, Aalto University, Espoo, Finland, 02150
Denis Shestakov

Authors

Denis Shestakov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Scientific Data Management Laboratory, Arizona State University, 85287-5706, Tempe, AZ, USA
Zoé Lacroix

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shestakov, D. (2010). On Building a Search Interface Discovery System. In: Lacroix, Z. (eds) Resource Discovery. RED 2009. Lecture Notes in Computer Science, vol 6162. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14415-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-14415-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14414-1
Online ISBN: 978-3-642-14415-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics