Abstract
In this paper, we discuss a method of finding useful clusters of web pages which are significant in the sense that their contents are similar or closely related to ones of higher-ranked pages. Since we are usually careless of pages with lower ranks, they are unconditionally discarded even if their contents are similar to some pages with high ranks. We try to extract such hidden pages together with significant higher-ranked pages as a cluster.
In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition(SVD) to the term-document matrix generated from a corpus w.r.t. a specific topic. Based on the correlations, we can evaluate potential similarities among web pages from which we try to obtain clusters. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. We present an algorithm for finding Top-N weighted pseudo-cliques. Our experimental result shows that quite valuable clusters can be actually extracted according to our method.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999), http://dbpubs.stanford.edu/pub/1999-66
Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)
Kita, K., Tsuda, K., Shishibori, M.: Information Retrieval Algorithms. Kyoritsu Shuppan (2002) (in Japanese)
Tomita, E., Seki, T.: An Efficient Branch-and-Bound Algorithm for Finding a Maximum Clique. In: Calude, C.S., Dinneen, M.J., Vajnovszki, V. (eds.) DMTCS 2003. LNCS, vol. 2731, pp. 278–289. Springer, Heidelberg (2003)
Fahle, T.: Simple and Fast: Improving a Branch-and-Bound Algorithm for Maximum Clique. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 485–498. Springer, Heidelberg (2002)
Satoh, K.: A Method for Generating Data Abstraction Based on Optimal Clique Search, Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2003) (in Japanese)
Masuda, S.: Analysis of Ascidian Gene Expression Data by Clique Search, Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)
Shi, B.: Top-N Clique Search of Web Pages, Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)
Okubo, Y., Haraguchi, M.: Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 418–425. Springer, Heidelberg (2003)
Okubo, Y., Haraguchi, M.: Finding Top-N Pseudo-Cliques in Simple Graph. In: Proceedings of the 9th World Multiconference on Systemics, Cybernetics and Informatics - WMSCI 2005, vol. III, pp. 215–220 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Okubo, Y., Haraguchi, M., Shi, B. (2005). Finding Significant Web Pages with Lower Ranks by Pseudo-Clique Search. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds) Discovery Science. DS 2005. Lecture Notes in Computer Science(), vol 3735. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563983_30
Download citation
DOI: https://doi.org/10.1007/11563983_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29230-2
Online ISBN: 978-3-540-31698-5
eBook Packages: Computer ScienceComputer Science (R0)