skip to main content
10.1145/1341531.1341547acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

A scalable pattern mining approach to web graph compression with communities

Published: 11 February 2008 Publication History

Abstract

A link server is a system designed to support efficient implementations of graph computations on the web graph. In this work, we present a compression scheme for the web graph specifically designed to accommodate community queries and other random access algorithms on link servers. We use a frequent pattern mining approach to extract meaningful connectivity formations. Our Virtual Node Miner achieves graph compression without sacrificing random access by generating virtual nodes from frequent itemsets in vertex adjacency lists. The mining phase guarantees scalability by bounding the pattern mining complexity to O(E log E). We facilitate global mining, relaxing the requirement for the graph to be sorted by URL, enabling discovery for both inter-domain as well as intra-domain patterns. As a consequence, the approach allows incremental graph updates. Further, it not only facilitates but can also expedite graph computations such as PageRank and local random walks by implementing them directly on the compressed graph. We demonstrate the effectiveness of the proposed approach on several publicly available large web graph data sets. Experimental results indicate that the proposed algorithm achieves a 10- to 15-fold compression on most real word web graph data sets

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data (SIGMOD), 1993.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases, 1994.
[3]
R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06), pages 475--486. IEEE Press, 2006.
[4]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. In Software: Practice & Experience, number 8, pages 711--726, 2004.
[5]
P. Boldi and S. Vigna. The webgraph framework ii: Codes for the world-wide web. In Technical Report 294--03. Universit ¿ A ¿ Cdi Milano, Dipartimento di Scienze dell'Informazione, 2003.
[6]
P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In Proceedings of the 13th International World Wide Web Conference (WWW), pages 595--601, Manhattan, USA, 2004. ACM Press.
[7]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World Wide Web Conferece, 1998.
[8]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. volume 60, pages 630--659, 2000.
[9]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. volume 29, pages 1157--1166, 1997.
[10]
G. Buehrer, K. Chellapilla, and S. Parthasarathy. Itemset mining in log-linear time. In OSU-CISRC-11/07-TR76, 2007.
[11]
F. R. K. Chung. Spectral graph theory. In American Mathematical Society, Providence, RI, 1997.
[12]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. In Journal of Computer and System Science, volume 55, pages 441--453, 1997.
[13]
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support pruning. In IEEE Transactions on Knowledge and Data Engineering, volume 13, 2001.
[14]
Y. Dourisboure, F. Geraci, and M. Pellegrini. Extraction and classification of dense communities in the web. In Proceedings of the International World Wide Web Conference (WWW), 2007.
[15]
G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proceedings of the 6th International Conference on Knowledge Discovery and Data mining (KDD), pages 150--160, New York, NY, 2000. ACM Press.
[16]
G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self organization of the web and identification of communities. In IEEE Computer, volume 35, pages 66--71, 2002.
[17]
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In HYPERTEXT, pages 224--235, 1998.
[18]
D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In Proceedings of 31st International Conference on Very Large Data Bases, 2005.
[19]
A. Goinis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.
[20]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000.
[21]
P. Indyk and R. Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. In 30th Annual Symposium on Theory of Computing, pages 604--613, 1998.
[22]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In J. ACM, volume 48, 1999.
[23]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In Computer Networks, pages 1481--1493. Elsevier Science, 1999.
[24]
G. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.
[25]
K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener. The link database: Fast access to graphs of the web. In Proceedings of the Data Compression Conference. IEEE Press, 2002.
[26]
H. Toivenen. Sampling large database for association rules. In Proceedings of 22th International Conference on Very Large Data Bases, pages 134--145, 1996.
[27]
A. Vetta. On clusterings: Good, bad and spectral. In J. ACM, volume 51, page 497 ¿ U515, 2004.
[28]
S. White and P. Smyth. A spectral clustering approach to finding communities in graphs. In SIAM Data Mining Conference, 2005.

Cited By

View all
  • (2024)Faster streaming and scalable algorithms for finding directed dense subgraphs in large graphsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693532(35876-35891)Online publication date: 21-Jul-2024
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
  • Show More Cited By

Index Terms

  1. A scalable pattern mining approach to web graph compression with communities

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data Mining
      February 2008
      270 pages
      ISBN:9781595939272
      DOI:10.1145/1341531
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 February 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. link analysis
      2. log-linear mining
      3. webgraph compression

      Qualifiers

      • Research-article

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)45
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 13 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Faster streaming and scalable algorithms for finding directed dense subgraphs in large graphsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693532(35876-35891)Online publication date: 21-Jul-2024
      • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
      • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
      • (2024)Graph Summarization: Compactness Meets EfficiencyProceedings of the ACM on Management of Data10.1145/36549432:3(1-26)Online publication date: 30-May-2024
      • (2024)A Similarity-based Approach for Efficient Large Quasi-clique DetectionProceedings of the ACM Web Conference 202410.1145/3589334.3645374(401-409)Online publication date: 13-May-2024
      • (2023)Scaling Up k-Clique Densest Subgraph DetectionProceedings of the ACM on Management of Data10.1145/35889231:1(1-26)Online publication date: 30-May-2023
      • (2023)CompressGraph: Efficient Parallel Graph Analytics with Rule-Based CompressionProceedings of the ACM on Management of Data10.1145/35886841:1(1-31)Online publication date: 30-May-2023
      • (2023)Building K-Anonymous User Cohorts with Consecutive Consistent Weighted Sampling (CCWS)Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591857(3374-3379)Online publication date: 19-Jul-2023
      • (2023)Graph Summarization via Node Grouping: A Spectral AlgorithmProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570441(742-750)Online publication date: 27-Feb-2023
      • (2023)Optimizing GPU-Based Graph Sampling and Random Walk for Efficiency and ScalabilityIEEE Transactions on Computers10.1109/TC.2023.325186072:9(2508-2521)Online publication date: 1-Sep-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media