ABSTRACT
Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and others there is a strong need for real-time data redundancy removal of enormous amounts of data flowing at the rate of 1Gb/s or higher. We consider the problem of finding Range Motifs (clusters) over records in a large dataset such that records within the same cluster are approximately close to each other. This problem is closely related to the approximate nearest neighbour search but is more computationally expensive. Real-time scalable approximate Range Motif discovery on massive datasets is a challenging problem. We present the design of novel sequential and parallel approximate Range Motif discovery and data de-duplication algorithms using Bloom filters. We establish asymptotic upper bounds on the false positive and false negative rates for our algorithm. Further, time complexity analysis of our parallel algorithm on multi-core architectures has been presented. For 10 million records, our parallel algorithm can perform approximate Range Motif discovery and data de-duplication, on 4 sets (clusters), in 59s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 170K records/s and around 700Mb/s (using records of size 4K bits). To the best of our knowledge, this is the highest real-time throughput for approximate Range Motif discovery and data redundancy removal on such massive datasets.
- A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In OSDI, 2002. Google ScholarDigital Library
- F. Baboescu and G. Varghese. Scalable packet clasification. In ACM SIGCOMM, pages 199--210, 2001. Google ScholarDigital Library
- S. Bhattacherjee, A. Narang, and V. K. Garg. High throughput data redundancy removal algorithm with scalable performance. In HiPEAC, 2011. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995. Google ScholarDigital Library
- A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004.Google ScholarCross Ref
- S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, pages 436--447, 1998. Google ScholarDigital Library
- Y. Chen, A. Kumar, and J. Xu. A new design of bloom filter for packet inspection speedup. In GLOBECOMM, pages 1--5, 2007.Google ScholarCross Ref
- F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data using stable bloom filters. In SIGMOD Conference, pages 25--36, 2006. Google ScholarDigital Library
- S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004. Google ScholarDigital Library
- S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003. Google ScholarDigital Library
- F. Douglis, J. Lavoie, J. M. Tracey, P. Kulkarni, and P. Kulkarni. Redundancy elimination within large collections of files. In In USENIX Annual Technical Conference, General Track, pages 59--72, 2004. Google ScholarDigital Library
- L. Fan, P. Cao, J. Almeida, and Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transaction on Networking, pages 281--293, 2000. Google ScholarDigital Library
- W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.Google Scholar
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In 25th International Conference on Very Large Databases (VLDB), pages 518--529, 1999. Google ScholarDigital Library
- P. Gopalan and J. Radhakrishnan. Finding duplicates in a data stream. In SODA, pages 402--411, 2009. Google ScholarDigital Library
- T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer / LNCS), 5375:145--156, 2009. Google ScholarDigital Library
- Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006. Google ScholarDigital Library
- N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replica synchronization. In FAST, 2005. Google ScholarDigital Library
- K. R. Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In SIGMOD, pages 166--176, 1998. Google ScholarDigital Library
- A. Kirsch and M. Mitzenmacher. Distance-sensitive bloom filters. In Eighth Workshop on Algorithm Engineering & Experiments(ALENEX), 2006.Google ScholarCross Ref
- A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.Google ScholarCross Ref
- M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST, pages 111--123, 2009. Google ScholarDigital Library
- M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.Google Scholar
- G. Manku, S. Rajagopalan, and B. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD, pages 426--435, 1998. Google ScholarDigital Library
- Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimations. In SIGMOD, pages 448--459, 1998. Google ScholarDigital Library
- M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transaction on Networking, pages 604--612, 2002. Google ScholarDigital Library
- A. Mueen, E. Keogh, Q. Zhu, S. Cash, and B. Westover. Exact discovery of time series motifs. In Siam International Conference on Data Mining (SDM09), 2009.Google ScholarCross Ref
- F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009. Google ScholarDigital Library
- S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST, pages 89--101, 2002. Google ScholarDigital Library
- C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003. Google ScholarDigital Library
- N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. C. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In USENIX Annual Technical Conference, General Track, pages 127--140, 2003.Google Scholar
- R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In 24th International Conference on Very Large Databases (VLDB), pages 194--205, 1998. Google ScholarDigital Library
- B. Zhu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, pages 269--282, 2008. Google ScholarDigital Library
- Y. Zhu, H. Jiang, and J. Wang. Hierarchical bloom filter arrays (hba): A novel, scalable metadata management system for large cluster-based storage. In 5th IEEE International Conference on Cluster Computing (Cluster), pages 165--174, 2004. Google ScholarDigital Library
Index Terms
- Real-time approximate Range Motif discovery & data redundancy removal algorithm
Recommendations
High throughput data redundancy removal algorithm with scalable performance
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and CompilersThe ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data ...
Real-time memory efficient data redundancy removal algorithm
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementData intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online ...
Predicting protein-binding RNA nucleotides using the feature-based removal of data redundancy and the interaction propensity of nucleotide triplets
Several learning approaches have been used to predict RNA-binding amino acids in a protein sequence, but there has been little attempt to predict protein-binding nucleotides in an RNA sequence. One of the reasons is that the differences between ...
Comments