Abstract
The amount of high-quality data in the Web databases has been increasing dramatically. To utilize such wealth of information, measuring the similarity betweenWeb databases has been proposed for many applications, such as clustering and top-k recommendation. Most of the existing methods use the text information either in the interfaces of Web databases or in the Web pages where the interfaces are located, to represent the Web databases. These methods have the limitation that the text may contain a lot of noisy words, which are rarely discriminative and cannot capture the characteristics of the Web databases. To better measure the similarity between Web databases, we introduce a novel Web database similarity method.We employ the categories of the records in the Web databases, which can be automatically extracted from the Web sites where the Web databases are located, to represent the Web databases. The record categories are of high-quality and can capture the characteristics of the corresponding Web databases effectively. In order to better utilize the record categories, we measure the similarity between Web databases based on a unified category hierarchy, and propose an effective method to construct the category hierarchy from the record categories obtained from all the Web databases. We conducted experiments on real ChineseWeb Databases to evaluate our method. The results show that our method is effective in clustering and top-k recommendation for Web Databases, compared with the baseline method, and can be used in real Web database related applications.
This work was supported in part by National Natural Science Foundation of China under grant No. 60833003.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321 (2004)
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
Barbosa, L., Freire, J., da Silva, A.S.: Organizing hidden-web databases by clustering visible web documents. In: ICDE, pp. 326–335 (2007)
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: Observations and implications. SIGMOD Record 33(3), 61–70 (2004)
Chang, K.C.-C., He, B., Zhang, Z.: Toward large scale integration: Building a metaquerier over databases on the web. In: CIDR, pp. 44–55 (2005)
Dragut, E.C., Wu, W., Sistla, A.P., Yu, C.T., Meng, W.: Merging source query interfaces onweb databases. In: ICDE, p. 46 (2006)
Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21(1), 64–93 (2003)
He, B., Chang, K.C.-C., Han, J.: Discovering complex matchings across web query interfaces: a correlation mining approach. In: KDD, pp. 148–157 (2004)
He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: A schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
He, H., Meng, W., Yu, C.T., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In: VLDB, pp. 357–368 (2003)
Hsieh, W.C., Madhavan, J., Pike, R.: Data management projects at google. In: SIGMOD Conference, pp. 725–726 (2006)
Ipeirotis, P.G., Gravano, L.: Classification-aware hidden-web text database selection. ACM Trans. Inf. Syst. 26(2) (2008)
Kaufman, L., Rousseeuw, P.: Finding groups in data, vol. 16. Wiley, New York (1990)
Lakkaraju, P., Gauch, S., Speretta, M.: Document similarity based on concept tree distance. In: Hypertext, pp. 127–132 (2008)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD, pp. 16–22 (1999)
Peng, Q., Meng, W., He, H., Yu, C.T.: Wise-cluster: clustering e-commerce search engines automatically. In: WIDM, pp. 104–111 (2004)
Wu, W., Doan, A., Yu, C.T.: Merging interface schemas on the deep web via clustering aggregation. In: ICDM, pp. 801–804 (2005)
Wu, W., Yu, C.T., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: SIGMOD Conference, pp. 95–106 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, J., Fan, J., Zhou, L. (2011). Measuring Similarity of Chinese Web Databases Based on Category Hierarchy. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-20291-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20290-2
Online ISBN: 978-3-642-20291-9
eBook Packages: Computer ScienceComputer Science (R0)