Skip to main content

Automatic Hierarchical Classification of Structured Deep Web Databases

  • Conference paper
Web Information Systems – WISE 2006 (WISE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4255))

Included in the following conference series:

Abstract

We present a method that automatically classifies structured deep Web databases according to a pre-defined topic hierarchy. We assume that there are some manually classified databases, i.e., training databases, in every node of the topic hierarchy. Each training database is probed using queries constructed from the node titles of the topic hierarchy and the query result counts reported by the database are used to represent the content of the database. Hence, when adding a new database it can be probed by the same set of queries and classified to a node whose training databases are most similar to the new one. Specifically, a support vector machine classifier is trained on each internal node of the topic hierarchy with these training databases and the new database can be classified into the hierarchy top-down level by level. A feature extension method is proposed to create discriminant features. Experiments run on real structured Web databases collected from the Internet show that this classification method is quite accurate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. CompletePlanet, http://www.completeplanet.com

  2. InvisibleWeb, http://www.invisibleweb.com

  3. Librarians’ Index to the Internet, http://www.lii.org

  4. Chang, K.C.-C., He, B., Li, C., Zhang, Z.: Structured databases on the Web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, CS Department, University of Illinois at Urbana-Champaign (February 2003)

    Google Scholar 

  5. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  6. Gravano, L., Ipeirotis, P.G., Sahami, M.: Probe, count, and classify: Categorizing hidden Web databases. In: ACM SIGMOD Conference, pp. 363–374 (2001)

    Google Scholar 

  7. Gravano, L., Ipeirotis, P.G., Sahami, M.: Qprober: A system for automatic classification of hidden-Web databases. ACM Transactions on Information Systems 21(1), 1–41 (2003)

    Article  Google Scholar 

  8. He, B., Tao, T., Chang, K.C.-C.: Organizing structured Web sources by query schemas: A clustering approach. In: Proceedings of the 13th Conference on Information and Knowledge Management, pp. 22–31 (2004)

    Google Scholar 

  9. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  10. Kriegel, H., Kroeger, P., Pryakhin, A., Schubert, M.: Using support vector machines for classifying large sets of multi-represented objects. In: SIAM International Conference on Data Mining

    Google Scholar 

  11. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods (2000)

    Google Scholar 

  12. Sun, A., Lim, E.: Hierarchical text classification and evaluation. In: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 521–528 (2001)

    Google Scholar 

  13. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  14. Wang, J., Lochovsky, F.: Data extraction and label assignment for web databases. In: Proceedings of the 12th International Conference on World Wide Web (2003)

    Google Scholar 

  15. Wang, W., Meng, W., Yu, C.: Concept hierarchy based text database categorization in a metasearch engine environment. In: Proceedings of the First International Conference on Web Information Systems Engineering, pp. 283–290 (June 2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Su, W., Wang, J., Lochovsky, F. (2006). Automatic Hierarchical Classification of Structured Deep Web Databases. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds) Web Information Systems – WISE 2006. WISE 2006. Lecture Notes in Computer Science, vol 4255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11912873_23

Download citation

  • DOI: https://doi.org/10.1007/11912873_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-48105-8

  • Online ISBN: 978-3-540-48107-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics