Elsevier

Knowledge-Based Systems

Volume 188, 5 January 2020, 104987
Knowledge-Based Systems

Probabilistic data structures for big data analytics: A comprehensive review

https://doi.org/10.1016/j.knosys.2019.104987Get rights and content

Abstract

An exponential increase in the data generation resources is widely observed in last decade, because of evolution in technologies such as-cloud computing, IoT, social networking, etc. This enormous and unlimited growth of data has led to a paradigm shift in storage and retrieval patterns from traditional data structures to Probabilistic Data Structures (PDS). PDS are a group of data structures that are extremely useful for Big data and streaming applications in order to avoid high-latency analytical processes. These data structures use hash functions to compactly represent a set of items in stream-based computing while providing approximations with error bounds so that well-formed approximations get built into data collections directly. Compared to traditional data structures, PDS use much less memory and constant time in processing complex queries. This paper provides a detailed discussion of various issues which are normally encountered in massive data sets such as-storage, retrieval, query,etc. Further, role of PDS in solving these issues is also discussed where these data structures are used as temporary accumulators in query processing. Several variants of existing PDS along with their application areas have also been explored which give a holistic view of domains where these data structures can be applied for efficient storage and retrieval of massive data sets. Mathematical proofs of various parameters considered in the PDS have also been discussed in the paper. Moreover, the relative comparison of various PDS with respect to various parameters is also explored.

Introduction

From the last few years, there is an exponential increase in the data. The amount of data being produced everyday from different sources such as-IoT sensors, social networks like Twitter, Instagram, WhatsApp, etc. has increased from terabytes to petabytes. This voluminous data growth abetted with efficient storage and retrieval poses a big challenge for industry as well as academia [1]. To handle this large volume of data, traditional algorithms cannot go beyond linear processing. Moreover, traditional approaches demand that entire data should be stored in a formatted manner. These massive datasets require architectures and tools for data storage, processing, mining, handling and leveraging of the information to offer better services.

In the age of in-stream data [2] and Internet of things (IoT) [3], there is no limit on the amount of data coming from varied sources. Moreover, the complexity of data and the amount of noise associated with the data is not predefined. Since the size of data is unknown, one cannot determine how much memory is required for storing the data. Moreover, the amount of data to be analyzed is in exabytes, which is too large to fit in the memory space provided with linear processing and actual storage of data is challenging. Thus, it is difficult to capture, store and process the incoming data within the stipulated time [4]. Data sets with such characteristics are typically referred to as Big data. Various definitions have been used to define Big data from different perspectives. Machine learning is used in number of applications for optimization [5]. Further, the trend of traditional data mining is shifting towards more complex task i.e. correlated utility-based pattern mining [6]. In this paper we try to define Big data’s most relevant characteristics from data analytics view, referred as 9 V’s model. The illustrative description about these V’s is depicted using Fig. 1.

Big data technologies are important in providing accurate analysis, leading to more concrete decision-making; resulting in greater operational efficiencies, cost reductions, and reduced risks for the business. To cope with Big data efficiently, new technologies appeared that enabled distributed data storage and parallel data processing. Various technologies from different vendors include MapReduce by Google which provides a new method of analyzing data that can be scaled up from single servers to thousands of high and low end machines; NoSQL Big data systems which are designed to take advantage of new cloud computing architectures to allow massive computations to be run inexpensively and efficiently; Amazon Azure, etc., which provide various tools to handle Big data. Along with the above mentioned technologies, Apache Hadoop (with its HDFS and MapReduce components) was a pioneering technology. Hadoop developed by Apache is an open source tool and most commonly used “Hadoop MapReduce” is based on the Google’s MapReduce combined with Hadoop. Hadoop is a package of many components, which come in various formats which include Apache hive: infrastructure for data warehousing, Apache oozie: for scheduling Hadoop job, Apache Pig: a data flow platform responsible for the execution of the MapReduce jobs, Apache Spark: an open source framework used for cluster computing, etc. Although Hadoop provides an overall package to the Big data analytics, with less technical background to operate, still there are some issues which need optimized solutions in Hadoop. In Hadoop with a parallel and distributed algorithm, the MapReduce process large data sets. Data is distributed and processed over the cluster in MapReduce leading to increase in the processing time and decrease om processing speed. Further, Hadoop supports batch processing only, it does not process streamed data, and hence overall performance is slower (Apache Spark supports stream processing). One of the major issues in Hadoop is that its programming model is quite restrictive which prevents modifications in inbuilt algorithms easily. The efficient analysis of in-stream data often requires powerful tools such as Apache Spark, Google BigQuery, High-Performance Computing Cluster (HPCC), etc. However, these tools are not suitable for real-time use cases where fast response is required such as-processing data in specific application domain, implementation of interactive jobs and models, etc. Recent research directions in the area of Big data processing, analysis and visualization clearly indicate the importance of Probabilistic Data Structures (PDS).

The use of deterministic data structures to perform analysis of in-stream data often include plenty of computational, space and time complexity. Probabilistic alternatives (Probabilistic Data Structures (PDS)) to deterministic data structures are better in terms of simplicity and constant factors involved in actual run-time. They are suitable for large data processing, approximate predications, fast retrieval and storing unstructured data, thus playing an important role in Big data processing.

PDS are, tautologically speaking, data structures having a probabilistic component [7]. These probabilistic components are used to reduce time or space trade offs. PDS cannot give a definite answer, instead they provide with a reasonable approximation of the answer and a way to approximate this estimation. They are useful for Big data and streaming applications because they can decrease the amount of memory needed (in comparison to data structures that give exact answers) [8]. Different variants of PDS are highlighted in Fig. 2. In majority of the cases, these data structures use hash functions to randomize the items. Because they ignore collisions so they keep the size constant, but this is also a reason why they cannot give exact values. Moreover, PDS offer several advantages which are as given below:

  • They use small amount of memory (one can control how much).

  • They are easily parallelizable (hashes are independent).

  • They have constant query time.

Major focus of this paper is on role of Probabilistic Data Structures (PDS) in the following scenarios:

  • Approximate Membership Query: Store bulk data in small space and respond to user’s membership query efficiently in the given space S.

  • Frequency Count: Find cardinality, i.e., number of cardinal (basic) members in a set in the massive data set.

  • Cardinality Estimate: Count the number of times a data item has arrived in the massive data sets.

  • Similarity Search: Identify similar items, i.e., find the approximately nearest neighbors (most similar) to the query in the available dataset.

Organization of paper: Section 2 provides detailed discussion of approximate membership query using the most frequently used PDS, Bloom Filter (BF) and its variant Quotient Filter (QF). Section 3 discusses how frequency count problem is solved efficiently by the PDS named Count Min Sketch (CMS). Section 4 provides an insight on cardinality estimation by using Hyper Log Log (HLL) counter along with a relative comparison and reviews of various variants of HLL. Section 5 discusses the PDS used for similarity search of massive Big data and provides a detailed discussion on Min-Hash and family of Locality Sensitive Hashing (LSH) (Various families of LSH, based on the distance matrix used, have been discussed in Appendix A). Section 6 summarizes the role of all the above mentioned PDS with respect to various parameters. Finally, Section 7 concludes the paper.

Section snippets

Approximate membership query

Given millions or even billions of data elements, developing efficient solutions for storing, updating, and querying them becomes different important especially when data is queried in some real time application. Using traditional data base approaches which include performing filtering and analysis after storing the data are not efficient for real time processing. Since the size of data is bulky, which require large data structures, retrieval cost of even a small query is very high. Above

Frequency count

Given a set of duplicated values, one needs to estimate the frequency of each value. The estimation for relatively rare values can be imprecise, however, frequent values and their absolute frequencies can be determined accurately. When frequency count needs to be solved in sub-linear space, some approximation in result is tolerable provided the processing is fast. In streaming data, frequent item counting is sometimes called ϵ- approximate frequent item counting which is defined in the next

Cardinality estimate

As the amount of data to be analyzed increases, determining the cardinality can be an important factor, especially when the incoming data is dynamic and amount of data is unknown. In multi-sets, determining exact cardinality is highly computation intensive process since it is proportional to the number of elements in the large data sets.

Probabilistic cardinality estimators used for determining approximate cardinality include LogLog [73], HyperLogLog [75], MinCount [74], Probabilistic counting,

Similarity search

Finding similar items in a set is a process of checking all items and identifying the closest one. To categorize the data set into particular class, one needs to find how much two items of data set are similar to each other. Problems related to finding of similar items are often solved by identifying nearest neighbors of object. Such type of problems have number of mathematical solutions in terms of distance measures like hamming distance, cosine and sine similarity measure, Jaccard’s

Discussion

In today’s world, data is originating from heterogeneous sources and current real world databases are severely susceptible to inconsistent, incomplete and noisy data [93]. In order to support data applications in different domains, data processing must be efficient and automated as much as possible. With an exponential increase in data, extraction of useful information from massive data, particularly for analytics is a daunting task [1]. Some of the applications which need special attention

Conclusion and future scope

This paper provides a comprehensive view of various prevalent PDS which can be used for storage, retrieval and mining of massive data sets. The data structures discussed in the paper can be used to store bulk data in minimum space, find cardinality of data sets, identify similar data sets in Big data and find the frequency of the elements in massive data. All the PDS are supported with their mathematical proofs i.e. mathematical analysis of BF and QF is provided in Sections 2.1, 2.2

References (118)

  • CormodeG. et al.

    An improved data stream summary: the count-min sketch and its applications

    J. Algorithms

    (2005)
  • FlajoletP. et al.

    Probabilistic counting algorithms for data base applications

    J. Comput. System Sci.

    (1985)
  • ZhouZ. et al.

    Per-flow Cardinality estimation based on virtual loglog sketching

  • GarcíaS. et al.

    Big data preprocessing: methods and prospects

    Big Data Anal.

    (2016)
  • RutkowskiL. et al.

    Basic concepts of data stream mining

  • SrinivasanC. et al.

    A review on the different types of internet of things (IoT)

    J. Adv. Res. Dyn. Control Syst.

    (2019)
  • SinghM.P. et al.

    Analysis of systems to process massive data stream

    CoRR

    (2016)
  • GakhovA.

    Probabilistic Data Structures and Algorithms for Big Data Applications

    (2019)
  • KatsovI.

    Probabilistic data structures for web analytics and data mining

    (2012)
  • BloomB.H.

    Space/time trade-offs in hash coding with allowable errors

    Commun. ACM

    (1970)
  • BenderM.A. et al.

    Don’t thrash: How to Cache your hash on flash

    Proc. VLDB Endow.

    (2012)
  • TarkomaS. et al.

    Theory and practice of bloom filters for distributed systems

    IEEE Commun. Surv. Tutor.

    (2012)
  • KirschA. et al.

    Distance-sensitive bloom filters

  • BruckJ. et al.

    Weighted bloom filter

  • FanL. et al.

    Summary Cache: A scalable wide-area web Cache sharing protocol

    IEEE/ACM Trans. Netw.

    (2000)
  • BonomiF. et al.

    An improved construction for counting bloom filters

  • GuoD. et al.

    The dynamic bloom filters

    IEEE Trans. Knowl. Data Eng.

    (2010)
  • DengF. et al.

    Approximately detecting duplicates for streaming data using stable bloom filters

  • KirschA. et al.

    Less hashing, same performance: Building a better bloom filter

    Random Struct. Algorithms

    (2008)
  • ChoiK.W. et al.

    Discovering mobile applications in cellular device-to-device communications: Hash function and bloom filter-based approach

    IEEE Trans. Mob. Comput.

    (2016)
  • VermaK. et al.

    Bloom-filter based IP-CHOCK detection scheme for denial of service attacks in VANET

    Secur. Commun. Netw.

    (2015)
  • GrozaB. et al.

    Efficient intrusion detection with bloom filtering in controller area networks

    IEEE Trans. Inf. Forensi. Secur.

    (2019)
  • ChengK.

    Hot spot tracking by time-decaying bloom filters and reservoir sampling

  • NajamM. et al.

    Pattern matching for DNA sequencing data using multiple bloom filters

    Biomed. Res. Int.

    (2019)
  • QuoraM.

    What are the best applications of Bloom filters?

    (2014)
  • SinghA. et al.

    Fuzzy-folded bloom filter-as-a-service for big data storage on cloud

    IEEE Trans. Ind. Inf.

    (2018)
  • LiuP. et al.

    ID bloom filter: Achieving faster multi-set membership query in network applications

  • LuJ. et al.

    Ultra-fast bloom filters using SIMD techniques

    IEEE Trans. Parallel Distrib. Syst.

    (2019)
  • PatgiriR. et al.

    rdbf: A r-Dimensional Bloom Filter for massive scale membership query

    J. Netw. Comput. Appl.

    (2019)
  • MitzenmacherM.

    Compressed bloom filters

    IEEE/ACM Trans. Netw.

    (2002)
  • CohenS. et al.

    Spectral bloom filters

  • KumarA. et al.

    Space-code bloom filter for efficient traffic flow measurement

  • GohE.-J.

    Secure Indexes

    (2003)
  • ShanmugasundaramK. et al.

    Payload attribution via hierarchical bloom filters

  • ChazelleB. et al.

    The bloomier filter: An efficient data structure for static support lookup tables

  • XiaoM.-Z. et al.

    Split bloom filter

    Tien Tzu Hsueh Pao/Acta Electron. Sin.

    (2004)
  • F. Chang, W. chang Feng, K. Li, Approximate caches for packet classification, in: Twenty-Third AnnualJoint Conference...
  • Y. Lu, B. Prabhakar, F. Bonomi, Bloom Filters: Design Innovations and Novel Applications, in: Proc. of the Forty-Third...
  • DonnetB. et al.

    Retouched bloom filters: Allowing networked applications to trade off selected false positives against false negatives

  • BruckJ. et al.

    Adaptive Bloom Filter

    (2006)
  • Cited by (0)

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.104987.

    View full text