Measuring Similarity of Chinese Web Databases Based on Category Hierarchy

Liu, Juan; Fan, Ju; Zhou, Lizhu

doi:10.1007/978-3-642-20291-9_23

Juan Liu²¹,
Ju Fan²¹ &
Lizhu Zhou²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6612))

Included in the following conference series:

Asia-Pacific Web Conference

1048 Accesses

Abstract

The amount of high-quality data in the Web databases has been increasing dramatically. To utilize such wealth of information, measuring the similarity betweenWeb databases has been proposed for many applications, such as clustering and top-k recommendation. Most of the existing methods use the text information either in the interfaces of Web databases or in the Web pages where the interfaces are located, to represent the Web databases. These methods have the limitation that the text may contain a lot of noisy words, which are rarely discriminative and cannot capture the characteristics of the Web databases. To better measure the similarity between Web databases, we introduce a novel Web database similarity method.We employ the categories of the records in the Web databases, which can be automatically extracted from the Web sites where the Web databases are located, to represent the Web databases. The record categories are of high-quality and can capture the characteristics of the corresponding Web databases effectively. In order to better utilize the record categories, we measure the similarity between Web databases based on a unified category hierarchy, and propose an effective method to construct the category hierarchy from the record categories obtained from all the Web databases. We conducted experiments on real ChineseWeb Databases to evaluate our method. The results show that our method is effective in clustering and top-k recommendation for Web Databases, compared with the baseline method, and can be used in real Web database related applications.

This work was supported in part by National Natural Science Foundation of China under grant No. 60833003.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321 (2004)
Google Scholar
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
Google Scholar
Barbosa, L., Freire, J., da Silva, A.S.: Organizing hidden-web databases by clustering visible web documents. In: ICDE, pp. 326–335 (2007)
Google Scholar
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Record 33(3), 61–70 (2004)
Article Google Scholar
Chang, K.C.-C., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR, pp. 44–55 (2005)
Google Scholar
Dragut, E.C., Wu, W., Sistla, A.P., Yu, C.T., Meng, W.: Merging source query interfaces onweb databases. In: ICDE, p. 46 (2006)
Google Scholar
Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21(1), 64–93 (2003)
Article Google Scholar
He, B., Chang, K.C.-C., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: KDD, pp. 148–157 (2004)
Google Scholar
He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: A schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)
Chapter Google Scholar
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
Google Scholar
He, H., Meng, W., Yu, C.T., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In: VLDB, pp. 357–368 (2003)
Google Scholar
Hsieh, W.C., Madhavan, J., Pike, R.: Data management projects at google. In: SIGMOD Conference, pp. 725–726 (2006)
Google Scholar
Ipeirotis, P.G., Gravano, L.: Classification-aware hidden-web text database selection. ACM Trans. Inf. Syst. 26(2) (2008)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding groups in data, vol. 16. Wiley, New York (1990)
Book MATH Google Scholar
Lakkaraju, P., Gauch, S., Speretta, M.: Document similarity based on concept tree distance. In: Hypertext, pp. 127–132 (2008)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD, pp. 16–22 (1999)
Google Scholar
Peng, Q., Meng, W., He, H., Yu, C.T.: Wise-cluster: clustering e-commerce search engines automatically. In: WIDM, pp. 104–111 (2004)
Google Scholar
Wu, W., Doan, A., Yu, C.T.: Merging interface schemas on the deep web via clustering aggregation. In: ICDM, pp. 801–804 (2005)
Google Scholar
Wu, W., Yu, C.T., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp. 95–106 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Juan Liu, Ju Fan & Lizhu Zhou

Authors

Juan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ju Fan
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information, Renmin University of China, 100872, Beijing, China
Xiaoyong Du
LFCS, School of Informatics, University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, Scotland, UK
Wenfei Fan
School of Software, Tsinghua University, Room 819, Main Building, 100084, Beijing, China
Jianmin Wang
Computer School, Wuhan University, Luojiashan Road, 430072, Wuhan, Hubei, China
Zhiyong Peng
School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, St. Lucia, Australia
Mohamed A. Sharaf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, J., Fan, J., Zhou, L. (2011). Measuring Similarity of Chinese Web Databases Based on Category Hierarchy. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-20291-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20290-2
Online ISBN: 978-3-642-20291-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics