skip to main content
10.1145/1951365.1951422acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Real-time approximate Range Motif discovery & data redundancy removal algorithm

Published:21 March 2011Publication History

ABSTRACT

Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of massive (10 million to 100 million records) datasets. In application domains such as IR, stock markets, telecom and others there is a strong need for real-time data redundancy removal of enormous amounts of data flowing at the rate of 1Gb/s or higher. We consider the problem of finding Range Motifs (clusters) over records in a large dataset such that records within the same cluster are approximately close to each other. This problem is closely related to the approximate nearest neighbour search but is more computationally expensive. Real-time scalable approximate Range Motif discovery on massive datasets is a challenging problem. We present the design of novel sequential and parallel approximate Range Motif discovery and data de-duplication algorithms using Bloom filters. We establish asymptotic upper bounds on the false positive and false negative rates for our algorithm. Further, time complexity analysis of our parallel algorithm on multi-core architectures has been presented. For 10 million records, our parallel algorithm can perform approximate Range Motif discovery and data de-duplication, on 4 sets (clusters), in 59s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 170K records/s and around 700Mb/s (using records of size 4K bits). To the best of our knowledge, this is the highest real-time throughput for approximate Range Motif discovery and data redundancy removal on such massive datasets.

References

  1. A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In OSDI, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Baboescu and G. Varghese. Scalable packet clasification. In ACM SIGCOMM, pages 199--210, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Bhattacherjee, A. Narang, and V. K. Garg. High throughput data redundancy removal algorithm with scalable performance. In HiPEAC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, pages 436--447, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Chen, A. Kumar, and J. Xu. A new design of bloom filter for packet inspection speedup. In GLOBECOMM, pages 1--5, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data using stable bloom filters. In SIGMOD Conference, pages 25--36, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, and J. W. Lockwood. Deep packet inspection using parallel bloom filters. IEEE Micro, 24(1):52--61, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Dharmapurikar, P. Krishnamurthy, and D. Taylor. Longest prefix matching using bloom filters. In ACM SIGCOMM, pages 201--212, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Douglis, J. Lavoie, J. M. Tracey, P. Kulkarni, and P. Kulkarni. Redundancy elimination within large collections of files. In In USENIX Annual Technical Conference, General Track, pages 59--72, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Fan, P. Cao, J. Almeida, and Z. Broder. Summary cache: a scalable wide area web cache sharing protocol. In IEEE/ACM Transaction on Networking, pages 281--293, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Feng, D. Kandlur, D. Sahu, and K. Shin. Stochastic fair blue: A queue management algorithm for enforcing fairness. In IEEE INFOCOM, pages 1520--1529, 2001.Google ScholarGoogle Scholar
  15. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In 25th International Conference on Very Large Databases (VLDB), pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Gopalan and J. Radhakrishnan. Finding duplicates in a data stream. In SODA, pages 402--411, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hofmann. Optimizing distributed joins using bloom filters. Distributed Computing and Internet technology (Springer / LNCS), 5375:145--156, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Hua and B. Xiao. A multi-attribute data structure with parallel bloom filters for network services. In International Conference on High Performance Computing, pages 277--288, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Jain, M. Dahlin, and R. Tewari. Taper: Tiered approach for eliminating redundancy in replica synchronization. In FAST, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. R. Kanth, D. Agrawal, and A. Singh. Dimensionality reduction for similarity searching in dynamic databases. In SIGMOD, pages 166--176, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Kirsch and M. Mitzenmacher. Distance-sensitive bloom filters. In Eighth Workshop on Algorithm Engineering & Experiments(ALENEX), 2006.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Kumar, J. Xu, J. Wang, O. Spatschek, and L. Li. Space-code bloom filter for efficient per-flow traffic measurement. In IEEE INFOCOM, pages 1762--1773, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  23. M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST, pages 111--123, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Little, N. Speirs, and S. Shrivastava. Using bloom filters to speed-up name lookup in distributed systems. The Computer Journal (Oxford University Press), 45(6):645--652, 2002.Google ScholarGoogle Scholar
  25. G. Manku, S. Rajagopalan, and B. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In SIGMOD, pages 426--435, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimations. In SIGMOD, pages 448--459, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Mitzenmacher. Compressed bloom filters. In IEEE/ACM Transaction on Networking, pages 604--612, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Mueen, E. Keogh, Q. Zhu, S. Cash, and B. Westover. Exact discovery of time series motifs. In Siam International Conference on Data Mining (SDM09), 2009.Google ScholarGoogle ScholarCross RefCross Ref
  29. F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST, pages 89--101, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Saar and M. Yossi. Spectral bloom filters. In ACM SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Tolia, M. Kozuch, M. Satyanarayanan, B. Karp, T. C. Bressoud, and A. Perrig. Opportunistic use of content addressable storage for distributed file systems. In USENIX Annual Technical Conference, General Track, pages 127--140, 2003.Google ScholarGoogle Scholar
  33. R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity search methods in high dimensional spaces. In 24th International Conference on Very Large Databases (VLDB), pages 194--205, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Zhu, K. Li, and R. H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, pages 269--282, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Zhu, H. Jiang, and J. Wang. Hierarchical bloom filter arrays (hba): A novel, scalable metadata management system for large cluster-based storage. In 5th IEEE International Conference on Cluster Computing (Cluster), pages 165--174, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Real-time approximate Range Motif discovery & data redundancy removal algorithm

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology
          March 2011
          587 pages
          ISBN:9781450305280
          DOI:10.1145/1951365

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 March 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate7of10submissions,70%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader