Skip to main content

Measuring Similarity of Chinese Web Databases Based on Category Hierarchy

  • Conference paper
Web Technologies and Applications (APWeb 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6612))

Included in the following conference series:

  • 1048 Accesses

Abstract

The amount of high-quality data in the Web databases has been increasing dramatically. To utilize such wealth of information, measuring the similarity betweenWeb databases has been proposed for many applications, such as clustering and top-k recommendation. Most of the existing methods use the text information either in the interfaces of Web databases or in the Web pages where the interfaces are located, to represent the Web databases. These methods have the limitation that the text may contain a lot of noisy words, which are rarely discriminative and cannot capture the characteristics of the Web databases. To better measure the similarity between Web databases, we introduce a novel Web database similarity method.We employ the categories of the records in the Web databases, which can be automatically extracted from the Web sites where the Web databases are located, to represent the Web databases. The record categories are of high-quality and can capture the characteristics of the corresponding Web databases effectively. In order to better utilize the record categories, we measure the similarity between Web databases based on a unified category hierarchy, and propose an effective method to construct the category hierarchy from the record categories obtained from all the Web databases. We conducted experiments on real ChineseWeb Databases to evaluate our method. The results show that our method is effective in clustering and top-k recommendation for Web Databases, compared with the baseline method, and can be used in real Web database related applications.

This work was supported in part by National Natural Science Foundation of China under grant No. 60833003.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321 (2004)

    Google Scholar 

  2. Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)

    Google Scholar 

  3. Barbosa, L., Freire, J., da Silva, A.S.: Organizing hidden-web databases by clustering visible web documents. In: ICDE, pp. 326–335 (2007)

    Google Scholar 

  4. Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Record 33(3), 61–70 (2004)

    Article  Google Scholar 

  5. Chang, K.C.-C., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR, pp. 44–55 (2005)

    Google Scholar 

  6. Dragut, E.C., Wu, W., Sistla, A.P., Yu, C.T., Meng, W.: Merging source query interfaces onweb databases. In: ICDE, p. 46 (2006)

    Google Scholar 

  7. Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21(1), 64–93 (2003)

    Article  Google Scholar 

  8. He, B., Chang, K.C.-C., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: KDD, pp. 148–157 (2004)

    Google Scholar 

  9. He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: A schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)

    Google Scholar 

  11. He, H., Meng, W., Yu, C.T., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In: VLDB, pp. 357–368 (2003)

    Google Scholar 

  12. Hsieh, W.C., Madhavan, J., Pike, R.: Data management projects at google. In: SIGMOD Conference, pp. 725–726 (2006)

    Google Scholar 

  13. Ipeirotis, P.G., Gravano, L.: Classification-aware hidden-web text database selection. ACM Trans. Inf. Syst. 26(2) (2008)

    Google Scholar 

  14. Kaufman, L., Rousseeuw, P.: Finding groups in data, vol. 16. Wiley, New York (1990)

    Book  MATH  Google Scholar 

  15. Lakkaraju, P., Gauch, S., Speretta, M.: Document similarity based on concept tree distance. In: Hypertext, pp. 127–132 (2008)

    Google Scholar 

  16. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD, pp. 16–22 (1999)

    Google Scholar 

  17. Peng, Q., Meng, W., He, H., Yu, C.T.: Wise-cluster: clustering e-commerce search engines automatically. In: WIDM, pp. 104–111 (2004)

    Google Scholar 

  18. Wu, W., Doan, A., Yu, C.T.: Merging interface schemas on the deep web via clustering aggregation. In: ICDM, pp. 801–804 (2005)

    Google Scholar 

  19. Wu, W., Yu, C.T., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp. 95–106 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, J., Fan, J., Zhou, L. (2011). Measuring Similarity of Chinese Web Databases Based on Category Hierarchy. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20291-9_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20290-2

  • Online ISBN: 978-3-642-20291-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics