A Method for Pinpoint Clustering of Web Pages with Pseudo-Clique Search

Haraguchi, Makoto; Okubo, Yoshiaki

doi:10.1007/11605126_4

Makoto Haraguchi²² &
Yoshiaki Okubo²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3847))

300 Accesses

Abstract

This paper presents a method for Pinpoint Clustering of web pages. We try to find useful clusters of web pages which are significant in the sense that their contents are similar to ones of higher-ranked pages. Since we are usually careless of lower-ranked pages, they are unconditionally discarded even if their contents are similar to some pages with high ranks. Such hidden pages together with significant higher-ranked pages are extracted as a cluster. As the result, our clusters can provide new valuable information for users.

In order to obtain such clusters, we first extract semantic correlations among terms by applying Singular Value Decomposition (SVD) to the term-document matrix generated from a corpus. Based on the correlations, we can evaluate potential similarities among web pages to be clustered. The set of web pages is represented as a weighted graph G based on the similarities and their ranks. Our clusters can be found as pseudo-cliques in G. An algorithm for finding Top-N weighted pseudo-cliques is presented. Our experimental result shows that a quite valuable cluster can be actually extracted according to our method.

We also discuss an idea for improvement on meanings of clusters. With the help of Formal Concept Analysis, our clusters, called FC-based clusters, can be provided with clear meanings. Our preliminary experimentation shows that the extended method would be a promising approach to finding meaningful clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Novel Technique for Web Pages Clustering Using LSA and K-Medoids Algorithm

Spectral Clustering

A competitive optimization approach for data clustering and orthogonal non-negative matrix factorization

Article 01 December 2020

References

Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web (1999), http://dbpubs.stanford.edu/pub/1999-66
Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering Practices. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)
Chapter Google Scholar
Strang, G.: Introduction to Linear Algebra, 3rd edn. Wellesley-Cambridge Press (2003)
Google Scholar
Kita, K., Tsuda, K., Shishibori, M.: Information Retrieval Algorithms. Kyoritsu Shuppan (2002) (in Japanese)
Google Scholar
Tomita, E., Seki, T.: An Efficient Branch-and-Bound Algorithm for Finding a Maximum Clique. In: Calude, C.S., Dinneen, M.J., Vajnovszki, V. (eds.) DMTCS 2003. LNCS, vol. 2731, pp. 278–289. Springer, Heidelberg (2003)
Chapter Google Scholar
Fahle, T.: Simple and Fast: Improving a Branch-and-Bound Algorithm for Maximum Clique. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 485–498. Springer, Heidelberg (2002)
Chapter Google Scholar
Carraghan, R., Pardalos, P.M.: An Exact Algorithm for the Maximum Clique Problem. Operations Research Letters 9, 375–382 (1990)
Article MATH Google Scholar
Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer, Heidelberg (1999)
MATH Google Scholar
Satoh, K.: A Method for Generating Data Abstraction Based on Optimal Clique Search. Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2003) (in Japanese)
Google Scholar
Masuda, S.: Analysis of Ascidian Gene Expression Data by Clique Search. Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)
Google Scholar
Shi, B.: Top-N Clique Search of Web Pages. Master’s Thesis, Graduate School of Eng., Hokkaido Univ. (March 2005) (in Japanese)
Google Scholar
Okubo, Y., Haraguchi, M.: Creating Abstract Concepts for Classification by Finding Top-N Maximal Weighted Cliques. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 418–425. Springer, Heidelberg (2003)
Chapter Google Scholar
Okubo, Y., Haraguchi, M., Shi, B.: Finding Significant Web Pages with Lower Ranks by Pseudo-Clique Search. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 345–352. Springer, Heidelberg (2005)
Chapter Google Scholar
Okubo, Y., Haraguchi, M.: Finding Top-N Pseudo-Cliques in Simple Graph. In: Proceedings of the 9th World Multiconference on Systemics, Cybernetics and Informatics - WMSCI 2005, vol. III, pp. 215–220 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University, N-14 W-9, Sapporo, 060-0814, Japan
Makoto Haraguchi & Yoshiaki Okubo

Authors

Makoto Haraguchi
View author publications
You can also search for this author in PubMed Google Scholar
Yoshiaki Okubo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Meme Media Laboratory, Hokkaido University Sapporo, Kita 13, Nishi 8, Kita-ku, 060-8628, Sapporo, Japan
Klaus P. Jantke
Meme Media Laboratory, Hokkaido University, 060-8628, Sapporo, Japan
Aran Lunzer
Laboratoire de Recherche en Informatique, Université Paris-Sud, Orsay Cedex, France
Nicolas Spyratos
Meme Media Laboratory, Hokkaido University, N13 W8, 0608628, Sapporo, Japan
Yuzuru Tanaka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haraguchi, M., Okubo, Y. (2006). A Method for Pinpoint Clustering of Web Pages with Pseudo-Clique Search. In: Jantke, K.P., Lunzer, A., Spyratos, N., Tanaka, Y. (eds) Federation over the Web. Lecture Notes in Computer Science(), vol 3847. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11605126_4

Download citation

DOI: https://doi.org/10.1007/11605126_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31018-1
Online ISBN: 978-3-540-32587-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics