Abstract
Deep Web database clustering is a key operation in organizing Deep Web resources. Cosine similarity in Vector Space Model (VSM) is used as the similarity computation in traditional ways. However it cannot denote the semantic similarity between the contents of two databases. In this paper how to cluster Deep Web databases semantically is discussed. Firstly, a fuzzy semantic measure, which integrates ontology and fuzzy set theory to compute semantic similarity between the visible features of two Deep Web forms, is proposed, and then a hybrid Particle Swarm Optimization (PSO) algorithm is provided for Deep Web databases clustering. Finally the clustering results are evaluated according to Average Similarity of Document to the Cluster Centroid (ASDC) and Rand Index (RI). Experiments show that: 1) the hybrid PSO approach has the higher ASDC values than those based on PSO and K-Means approaches. It means the hybrid PSO approach has the higher intra cluster similarity and lowest inter cluster similarity; 2) the clustering results based on fuzzy semantic similarity have higher ASDC values and higher RI values than those based on cosine similarity. It reflects the conclusion that the fuzzy semantic similarity approach can explore latent semantics.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hedley, Y.-L., Younas, M., James, A.: The categorisation of hidden web databases through concept specificity and coverage. In: Advanced Information Networking and Applications, 2005. 19th International Conference on AINA 2005, March 28-30, 2005, vol. 2(2), pp. 671–676 (2005)
Peng, Q., Meng, W., He, H., Yu, C.T.: WISE-cluster: clustering e-commerce search engines automatically. In: Proceedings of the 6th ACM International Workshop on Web Information and Data Management, Washington, pp. 104–111 (2004)
He, B., Tao, T., Chang, K.C.-C.: Organizing structured web sources by query schemas: a clustering approach. In: CIKM, pp. 22–31 (2004)
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2006)
Cui, X., Potok, T.E., Palathingal, P.: Object Clustering using Particle Swarm Optimization. In: Proceedings of the 2005 IEEE Swarm Intelligence Symposium, Pasadena, California, USA, June 2005, pp. 185–191 (2005)
Shan, S.M., Deng, G.S., He, Y.H.: Data Clustering using Hybridization of Clustering Based on Grid and Density with PSO. In: IEEE International Conference on Service Operations and Logistics, and Informatics, Shanghai, June 2006, pp. 868–872 (2006)
Van der Merwe, D.W., Engelbrecht, A.P.: Data Clustering using Particle Swarm Optimization. In: The 2003 Congress on Evolutionary Computation, vol. 1, pp. 215–220 (2003)
Srinoy, S., Kurutach, W.: Combination Artificial Ant Clustering and K-PSO Clustering Approach to Network Security Model. In: ICHIT 2006. International Conference on Hybrid Information Technology, Cheju Island, Korea, vol. 2, pp. 128–134 (2006)
Chen, C.-Y., Ye, F.: Particle Swarm Optimization Algorithm and Its Application to Clustering Analysis. In: Proceedings of the 2004 IEEE international Conference on Networking, Sensing Control, Taipei, Taiwan, March 2004, vol. 2, pp. 789–794 (2004)
Halevy, A.Y.: Why your data don’t mix. ACM Queue 3(8) (2005)
Ru, Y., Horowitz, E.: Indexing the invisibleWeb: a survey. Online Information Review 29(3), 249–265 (2005)
Caverlee, J., Liu, L., Buttler, D.: Probe, Cluster, and Discover:Focused Extraction of QA-Pagelets from the Deep Web
Barbosa, L., Freire, J., Silva, A.: Organizing hidden-web databases by clustering visible web documents. In: Data Engineering, 2007. IEEE 23rd International Conference on ICDE 2007, April 15-20, 2007, pp. 326–335 (2007)
Bloehdorn, S., Cimiano, P., Hotho, A.: Learning Ontologies to Improve Text Clustering and Classification. In: Data and Information Analysis to Knowledge Engineering, pp. 334–341. Springer, Heidelberg (2006)
Castells, P., Fernańdez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering 19(2), 261–272 (2007)
Shamsfard, M., Nematzadeh, A., Motiee, S.: ORank: An Ontology Based System for Ranking Objects. International Journal Of Computer Science 1(3), 1306–4428 (2006)
Varelas, G., Voutsakis, E., Raftopoulou, P.: Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web. In: Proceedings of the 7th annual ACM international workshop on Web information and data management, Bremen, Germany, pp. 10–16 (2005)
Zhang, X., Jing, L., Hu, X., Ng, M., Zhou, X.: A Comparative Study of Ontology Based Term Similarity Measures on PubMed Object Clustering, http://www.pages.drexel.edu/~xz38/pdf/209_Zhang_DASFAA07.pdf
Chaudhri, V.K., Farquhar, A., Fikes, R., Karp, P.D., Rice, J.P.: OKBC: A Progammatic Foundation for Knowledge Base Interoperability. In: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, Madison, Wisconsin, United States, pp. 600–607 (1998)
Zadeh, L.A.: Similarity Relations and Fuzzy Orderings. Information Science 3, 177–200 (1971)
Thomopoulos, R., Buche, P., Haemmerle, O.: Fuzzy Sets Defined on a Hierarchical Domain. IEEE Transaction on knowledge and engineering 18(10), 1397–1410 (2006)
Zadeh, L.A.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 100(supp. 1), 9–34 (1978)
Brucker, P.: On the complexity of clustering problems. In: Beckmenn, M., Kunzi, H.P. (eds.) Optimization and Operations Research. Lecture Notes in Economics and Malhemorical Sysrem, vol. lS7, pp. 45–54. Springer, Berlin (1978)
http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, L., Ma, J., Yan, P., Lian, L., Zhang, D. (2008). Clustering Deep Web Databases Semantically. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)