ABSTRACT
The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.
- A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In OSDI, 2002. Google ScholarDigital Library
- F. Baboescu and G. Varghese. Scalable packet clasification. In ACM SIGCOMM, pages 199--210, 2001. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995. Google ScholarDigital Library
- A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004.Google ScholarCross Ref
- Y. Chen, A. Kumar, and J. Xu. A new design of bloom filter for packet inspection speedup. In GLOBECOMM, pages 1--5, 2007.Google ScholarCross Ref
- S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004. Google ScholarDigital Library
- S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003. Google ScholarDigital Library
- P. C. Dillinger and P. Manolios. Bloom filters in probabilistic verification. In FMCAD, pages 367--381, 2004.Google ScholarCross Ref
- F. Douglis, J. Lavoie, J. M. Tracey, P. Kulkarni, and P. Kulkarni. Redundancy elimination within large collections of files. In In USENIX Annual Technical Conference, General Track, pages 59--72, 2004. Google ScholarDigital Library
- L. Fan, P. Cao, J. Almeida, and Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transaction on Networking, pages 281--293, 2000. Google ScholarDigital Library
- W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.Google Scholar
- T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer / LNCS), 5375:145--156, 2009. Google ScholarDigital Library
- Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006. Google ScholarDigital Library
- N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replica synchronization. In FAST, 2005. Google ScholarDigital Library
- A. Kirsch and M. Mitzenmacher. Less hashing, same performance: Building a better bloom filter. Random Struct. Algorithms, 33(2):187--218, 2008. Google ScholarDigital Library
- L. Kleinrock. Queueing Systems, Volume I: Theory. Wiley Interscience, New York, NY, USA, 1975. Google ScholarDigital Library
- A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.Google ScholarCross Ref
- M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST, pages 111--123, 2009. Google ScholarDigital Library
- M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.Google Scholar
- M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transaction on Networking, pages 604--612, 2002. Google ScholarDigital Library
- F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009. Google ScholarDigital Library
- S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST, pages 89--101, 2002. Google ScholarDigital Library
- S. M. Ross. Introduction to Probability Models. Academic Press, tenth edition, 2009.Google Scholar
- C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003. Google ScholarDigital Library
- H. Song, S. Dharmapurikar, J. Turner, and J. Lockwood. Fast hash table lookup using extended bloom filter: An aid to network processing. In ACM SIGCOMM, pages 181--192, 2005. Google ScholarDigital Library
- N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. C. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In USENIX Annual Technical Conference, General Track, pages 127--140, 2003.Google Scholar
- B. Zhu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, pages 269--282, 2008. Google ScholarDigital Library
- Y. Zhu, H. Jiang, and J. Wang. Hierarchical bloom filter arrays (hba): A novel, scalable metadata management system for large cluster-based storage. In 5th IEEE International Conference on Cluster Computing (Cluster), pages 165--174, 2004. Google ScholarDigital Library
Index Terms
- High throughput data redundancy removal algorithm with scalable performance
Recommendations
Real-time approximate Range Motif discovery & data redundancy removal algorithm
EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database TechnologyRemoving redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and ...
Real-time memory efficient data redundancy removal algorithm
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementData intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online ...
Towards Optimality in Parallel Scheduling
To keep pace with Moore's law, chip designers have focused on increasing the number of cores per chip rather than single core performance. In turn, modern jobs are often designed to run on any number of cores. However, to effectively leverage these ...
Comments