Skip to main content

Automatic Identification of Web Query Interfaces

  • Conference paper
  • 891 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7095))

Abstract

The amount of information contained in databases in the Web has grown explosively in the last years. This information, known as the Deep Web, is dynamically obtained from specific queries to these databases through Web Query Interfaces (WQIs). The problem of finding and accessing databases in the Web is a great challenge due to the Web sites are very dynamic and the information existing is heterogeneous. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in databases in the Web. Since WQIs are the only means to access databases in the Web, the automatic identification of WQIs plays an important role facilitating traditional search engines to increase the coverage and access interesting information not available on the indexable Web. In this paper we present a strategy for automatic identification of WQIs using supervised learning and making an adequate selection and extraction of HTML elements in the WQIs to form the training set. We present two experimental tests over a corpora of HTML forms considering positive and negative examples. Our proposed strategy achieves better accuracy than previous works reported in the literature.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign (2003), http://metaquerier.cs.uiuc.edu/repository

  2. Barbosa, L., Freire, J.: Searching for hidden-web databases. In: Proceedings of the 8th ACM SIGMOD International Workshop on Web and Databases, Baltimore, Maryland, USA, pp. 1–6 (2005)

    Google Scholar 

  3. Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 431–440. ACM, New York (2007), http://doi.acm.org/10.1145/1242572.1242631

    Google Scholar 

  4. Bergman, M.K.: The deep web: Surfacing hidden value (white paper). Journal of Electronic Publishing 7(1) (2001)

    Google Scholar 

  5. Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference ADC 2003, vol. 17, pp. 181–189. Australian Computer Society, Inc., Darlinghurst (2003), http://portal.acm.org/citation.cfm?id=820085.820120

    Google Scholar 

  6. Jericho HTML Parser: A Java Library for parsing HTML documents. Sourceforge Project (2010), http://jericho.htmlparser.net/docs/index.html (last accessed December 2010)

  7. Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U.: A hierarchical approach to model web query interfaces for web source integration. PVLDB 2(1), 325–336 (2009)

    Google Scholar 

  8. Chang, K.C.-C., He, B., Li, C., Zhang, Z.: TEL-8 Query Interfaces. UIUC Web Integration Repository (2003), http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html (last accessed June 2011)

  9. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  10. Weka Machine Learning Project: Weka, http://www.cs.waikato.ac.nz/~ml/weka

  11. Wu, W., Yu, C., Doan, A., Meng, W.: ICQ Query Interfaces. UIUC Web Integration Repository (2003), http://metaquerier.cs.uiuc.edu/repository/datasets/icq/index.html (last accessed June 2011)

  12. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press, USA (2000)

    Google Scholar 

  13. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep Web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 95–106. ACM, New York (2004), http://doi.acm.org/10.1145/1007568.1007582

    Google Scholar 

  14. Zhang, Z., He, B., Chang, K.C.C.: Understanding Web query interfaces: best-effort parsing with hidden syntax. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 107–118. ACM, New York (2004), http://doi.acm.org/10.1145/1007568.1007583

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Marin-Castro, H.M., Sosa-Sosa, V.J., Lopez-Arevalo, I. (2011). Automatic Identification of Web Query Interfaces. In: Batyrshin, I., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2011. Lecture Notes in Computer Science(), vol 7095. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25330-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25330-0_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25329-4

  • Online ISBN: 978-3-642-25330-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics