Skip to main content

On Building a Search Interface Discovery System

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6162))

Abstract

A huge portion of the Web known as the deep Web is accessible via search interfaces to myriads of databases on the Web. While relatively good approaches for querying the contents of web databases have been recently proposed, one cannot fully utilize them having most search interfaces unlocated. Thus, the automatic recognition of search interfaces to online databases is crucial for any application accessing the deep Web. This paper describes the architecture of the I-Crawler, a system for finding and classifying search interfaces. The I-Crawler is intentionally designed to be used in the deep web characterization surveys and for constructing directories of deep web resources.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barbosa, L., Freire, J.: Searching for Hidden-Web Databases. In: Proc. of WebDB 2005, pp. 1–6 (2005)

    Google Scholar 

  2. Barbosa, L., Freire, J.: Combining Classifiers to Identify Online Databases. In: Proc. of WWW 2007, pp. 431–440 (2007)

    Google Scholar 

  3. Bergholz, A., Childlovskii, B.: Crawling for Domain-Specific Hidden Web Resources. In: Proc. of WISE 2003, pp. 125–133 (2003)

    Google Scholar 

  4. Broder, A.: A Taxonomy of Web Search. SIGIR Forum 36(2), 3–10 (2002)

    Article  Google Scholar 

  5. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery. Computer Networks 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  6. Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the web: Observations and Implications. SIGMOD Rec. 33(3), 61–70 (2004)

    Article  Google Scholar 

  7. Cope, J., Craswell, N., Hawking, D.: Automated Discovery of Search Interfaces on the Web. In: Proc. of ADC 2003, pp. 181–189 (2003)

    Google Scholar 

  8. Flanagan, D.: JavaScript: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2001)

    Google Scholar 

  9. Galperin, M., Cochrane, G.: Nucleic Acids Research Annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucl. Acids Res. 37(Suppl. 1), 1–4 (2009)

    Article  Google Scholar 

  10. Gravano, L., Ipeirotis, P., Sahami, M.: QProber: A System for Automatic Classification of Hidden-Web Databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)

    Article  Google Scholar 

  11. He, B., Tao, T., Chang, K.: Organizing Structured Web Sources by Query Schemas: a Clustering Approach. In: Proc. of CIKM 2004, pp. 22–31 (2004)

    Google Scholar 

  12. Jayapandian, M., Jagadish, H.V.: Automating the Design and Construction of Query Forms. Trans. Knowl. Data Eng. 21(10), 1389–1402 (2009)

    Article  Google Scholar 

  13. Kohavi, R.: A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proc. of IJCAI 1995, pp. 1137–1143 (1995)

    Google Scholar 

  14. Lage, J., da Silva, A., Golgher, P., Laender, A.: Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data Knowl. Eng. 49(2), 177–196 (2004)

    Article  Google Scholar 

  15. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep Web crawl. In: Proc. of VLDB 2008 (2008)

    Google Scholar 

  16. McCallum, A.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow

  17. Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proc. of VLDB 2001 (2001)

    Google Scholar 

  18. Reis, C., de Mattos Forte, R.: An Overview of the Software Engineering Process and Tools in the Mozilla Project. In: Proc. of Open Source Software Development Workshop, pp. 155–175 (2002)

    Google Scholar 

  19. Shestakov, D.: Characterization of National Deep Web. TUCS Technical Report 892 (2008)

    Google Scholar 

  20. Shestakov, D.: Deep Web: Databases on the Web. In: Entry in Handbook of Research on Innovations in Database Technologies and Applications. IGI Global (2009)

    Google Scholar 

  21. Shestakov, D., Bhowmick, S., Lim, E.-P.: DEQUE: Querying the Deep Web. Data Knowl. Eng. 52(3), 273–311 (2005)

    Article  Google Scholar 

  22. Shestakov, D., Salakoski, T.: Host-IP Clustering Technique for Deep Web Characterization. In: Proc. of APWeb 2010 (2010)

    Google Scholar 

  23. Shestakov, D., Salakoski, T.: On Estimating the Scale of National Deep Web. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 780–789. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  24. Wistow, S.: Deconstructing Flash: Investigations into the SWF File Format. Technical Report (2000)

    Google Scholar 

  25. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  26. Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proc. of ICDE 2006 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shestakov, D. (2010). On Building a Search Interface Discovery System. In: Lacroix, Z. (eds) Resource Discovery. RED 2009. Lecture Notes in Computer Science, vol 6162. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14415-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14415-8_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14414-1

  • Online ISBN: 978-3-642-14415-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics