research-article

A scalable pattern mining approach to web graph compression with communities

Authors:

Gregory Buehrer,

Kumar ChellapillaAuthors Info & Claims

WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

Pages 95 - 106

https://doi.org/10.1145/1341531.1341547

Published: 11 February 2008 Publication History

Abstract

A link server is a system designed to support efficient implementations of graph computations on the web graph. In this work, we present a compression scheme for the web graph specifically designed to accommodate community queries and other random access algorithms on link servers. We use a frequent pattern mining approach to extract meaningful connectivity formations. Our Virtual Node Miner achieves graph compression without sacrificing random access by generating virtual nodes from frequent itemsets in vertex adjacency lists. The mining phase guarantees scalability by bounding the pattern mining complexity to O(E log E). We facilitate global mining, relaxing the requirement for the graph to be sorted by URL, enabling discovery for both inter-domain as well as intra-domain patterns. As a consequence, the approach allows incremental graph updates. Further, it not only facilitates but can also expedite graph computations such as PageRank and local random walks by implementing them directly on the compressed graph. We demonstrate the effectiveness of the proposed approach on several publicly available large web graph data sets. Experimental results indicate that the proposed algorithm achieves a 10- to 15-fold compression on most real word web graph data sets

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data (SIGMOD), 1993.

Digital Library

[2]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases, 1994.

Digital Library

[3]

R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06), pages 475--486. IEEE Press, 2006.

Digital Library

[4]

P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. In Software: Practice & Experience, number 8, pages 711--726, 2004.

Digital Library

[5]

P. Boldi and S. Vigna. The webgraph framework ii: Codes for the world-wide web. In Technical Report 294--03. Universit ¿ A ¿ Cdi Milano, Dipartimento di Scienze dell'Informazione, 2003.

[6]

P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In Proceedings of the 13th International World Wide Web Conference (WWW), pages 595--601, Manhattan, USA, 2004. ACM Press.

Digital Library

[7]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World Wide Web Conferece, 1998.

Digital Library

[8]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. volume 60, pages 630--659, 2000.

Digital Library

[9]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. volume 29, pages 1157--1166, 1997.

Digital Library

[10]

G. Buehrer, K. Chellapilla, and S. Parthasarathy. Itemset mining in log-linear time. In OSU-CISRC-11/07-TR76, 2007.

[11]

F. R. K. Chung. Spectral graph theory. In American Mathematical Society, Providence, RI, 1997.

[12]

E. Cohen. Size-estimation framework with applications to transitive closure and reachability. In Journal of Computer and System Science, volume 55, pages 441--453, 1997.

Digital Library

[13]

E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In IEEE Transactions on Knowledge and Data Engineering, volume 13, 2001.

Digital Library

[14]

Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification of dense communities in the web. In Proceedings of the International World Wide Web Conference (WWW), 2007.

Digital Library

[15]

G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proceedings of the 6th International Conference on Knowledge Discovery and Data mining (KDD), pages 150--160, New York, NY, 2000. ACM Press.

Digital Library

[16]

G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self organization of the web and identification of communities. In IEEE Computer, volume 35, pages 66--71, 2002.

Digital Library

[17]

D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In HYPERTEXT, pages 224--235, 1998.

Digital Library

[18]

D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In Proceedings of 31st International Conference on Very Large Data Bases, 2005.

Digital Library

[19]

A. Goinis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.

Digital Library

[20]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000.

Digital Library

[21]

P. Indyk and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. In 30th Annual Symposium on Theory of Computing, pages 604--613, 1998.

Digital Library

[22]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In J. ACM, volume 48, 1999.

Digital Library

[23]

R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In Computer Networks, pages 1481--1493. Elsevier Science, 1999.

Digital Library

[24]

G. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.

Digital Library

[25]

K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener. The link database: Fast access to graphs of the web. In Proceedings of the Data Compression Conference. IEEE Press, 2002.

Digital Library

[26]

H. Toivenen. Sampling large database for association rules. In Proceedings of 22th International Conference on Very Large Data Bases, pages 134--145, 1996.

Digital Library

[27]

A. Vetta. On clusterings: Good, bad and spectral. In J. ACM, volume 51, page 497 ¿ U515, 2004.

Digital Library

[28]

S. White and P. Smyth. A spectral clustering approach to finding communities in graphs. In SIAM Data Mining Conference, 2005.

Cited By

Mitrović SPan TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Faster streaming and scalable algorithms for finding directed dense subgraphs in large graphsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693532(35876-35891)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693532
Xu QYang JZhang FChen ZGuan JChen KFan JShen YYang KZhang YDu X(2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665852
Li PZhao WOosterhuis HBast HXiong C(2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672523
Show More Cited By

Index Terms

A scalable pattern mining approach to web graph compression with communities
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Web Structure Mining by Isolated Cliques

The link structure of the Web is generally viewed as the webgraph. Web structure mining is a research area that mainly aims to find hidden communities by focusing on the webgraph, and communities or their cores are supposed to constitute dense ...
Supporting efficient and scalable frequent pattern mining
From sequential pattern mining to structured pattern mining: A pattern-growth approach
Abstract
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining

February 2008

270 pages

ISBN:9781595939272

DOI:10.1145/1341531

General Chair:
Marc Najork
Microsoft, USA
,
Program Chairs:
Andrei Broder
Yahoo!, USA
,
Soumen Chakrabarti
IIT Bombay, India

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 February 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

167
Total Citations
View Citations
1,584
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mitrović SPan TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Faster streaming and scalable algorithms for finding directed dense subgraphs in large graphsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693532(35876-35891)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693532
Xu QYang JZhang FChen ZGuan JChen KFan JShen YYang KZhang YDu X(2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665852
Li PZhao WOosterhuis HBast HXiong C(2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672523
Chu DZhang FZhang WZhang YLin X(2024)Graph Summarization: Compactness Meets EfficiencyProceedings of the ACM on Management of Data10.1145/36549432:3(1-26)Online publication date: 30-May-2024
https://doi.org/10.1145/3654943
Pang JMa CFang YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)A Similarity-based Approach for Efficient Large Quasi-clique DetectionProceedings of the ACM Web Conference 202410.1145/3589334.3645374(401-409)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645374
He YWang KZhang WLin XZhang Y(2023)Scaling Up k-Clique Densest Subgraph DetectionProceedings of the ACM on Management of Data10.1145/35889231:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588923
Chen ZZhang FGuan JZhai JShen XZhang HShu WDu X(2023)CompressGraph: Efficient Parallel Graph Analytics with Rule-Based CompressionProceedings of the ACM on Management of Data10.1145/35886841:1(1-31)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588684
Zheng XZhao WLi XLi PChen HDuh WHuang HKato MMothe JPoblete B(2023)Building K-Anonymous User Cohorts with Consecutive Consistent Weighted Sampling (CCWS)Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591857(3374-3379)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591857
Merchant AMathioudakis MWang YChua TLauw HSi LTerzi ETsaparas P(2023)Graph Summarization via Node Grouping: A Spectral AlgorithmProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570441(742-750)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570441
Wang PXu CLi CWang JWang TZhang LHou XGuo M(2023)Optimizing GPU-Based Graph Sampling and Random Walk for Efficiency and ScalabilityIEEE Transactions on Computers10.1109/TC.2023.325186072:9(2508-2521)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TC.2023.3251860
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten