Sampling strategies for extracting information from large data sets

https://doi.org/10.1016/j.datak.2018.01.002Get rights and content

Abstract

Getting information from large volumes of data is very expensive in terms of resources like CPU and memory, as well as computation time. The analysis of a small data set extracted from the original set is preferred. From this small set, called sample, approximate results can be obtained. The errors are acceptable given the reduced cost necessary for processing the data. Using sampling algorithms with small errors saves execution time and resources. This paper presents comparisons between sampling algorithms in order to determine which one performs better when taking into account set operations such as intersect, union and difference. The comparison focuses on the errors introduced by each algorithm for different sample sizes and on execution times.

Introduction

With the tremendous and continuous generation of new data each day from different sources, e.g. web, sensors, mobile operators, etc., data mining has become a pivotal trend that still proposes many challenges nowadays. In the literature, large data sets, either human or computer generate, have been classified into three classes [1]: i) structured data which refers to data with a predefined data model, ii) unstructured data which generally uses a flexible schema-free model and iii) hybrid data that combines the structured and unstructured data to correlate the information from both sources.

When dealing with large volumes of data, gathered from different sources, either structured or unstructured, human or computer generated, one of the main challenges is to determine patterns and to extract knowledge. In the literature, there have been proposed different techniques, i.e. process the entire data set which is an expensive operation even with the recent breakthroughs in the field of high performance computing [2,3], or data sampling, a technique that extracts subsets from the entire data set and analyzes them [4,5]. Moreover, a good sampling method will try to select the best instances, to have a good performance using a small quantity of memory and time [6]. In this paper, the emphasis is on determining which sampling algorithm has the best performance and accuracy when using a large structured data set.

Databases are frequently used to store large structured data sets. In order to extract data from the database, specific operations implemented in the database management system (DBMS) are performed. The execution time of retrieving information increases proportionally with the volume of data stored [7]. In such cases, it is preferable to reduce costs and lose as little precision as possible by estimating the query results.

Consider the following scenario. An advertiser wants to maximize the impact of commercials posted on websites. The data gathered for a specific user from a single website - email, birthday, country, sex, time spent on a page, clicked items, most viewed elements on the page - is not specific enough. Commercials should be even more suited to the user's profile. The goal would be to identify which users enter multiple websites, in order to narrow down his interests. Consequently, the commercials would be well targeted and the user would be more likely to click the advertisements. The reason an advertiser wants to target the users so precise is that he gets paid whenever a user clicks the advertisement. From an implementation perspective, finding out which users visit multiple websites means determining a set intersection - the users which are interested in both website A and website B. If the intersection of the sets is small and the sets are large, it would be a waste of computation power with little profitable results - the cost to advertise to so few users is not justified by the amount of computation resources utilized. So, it would be useful to know which sets have an intersection above a defined threshold. Intersecting all the sets would consume lots of resources, so a solution would be to approximate the intersection of these sets using sampling and then intersect the sets of interest. So, the accuracy of the sampling algorithm must be as high as possible in order for the approximation of the intersection to be close to the real intersection. A business which deals with large volumes of data and needs to approximate intersection, union, and difference before doing the actual computation could benefit from this research.

A solution to reduce costs would be to apply the query to a representative sample from the data instead of the entire data set. The results generated by operations performed on samples approximate the real results but they introduce a calculable error. In this paper, different sampling algorithms are compared in order to determine the best approach to minimize the error and maximize the execution time when dealing with large volumes of data.

This paper is structured as follows. Section 2 presents related work on sampling techniques. Section 3 describes the algorithms for sampling. Section 4 reviews the approximation techniques used for evaluating the algorithms. Section 5 presents the test environment, the experiments and discusses the results. Finally, the last section concludes with a summary and hints at future work.

Section snippets

Related work

There are many sampling algorithms proposed so far in the literature [8,9]. These algorithms have different applications in computer science field like databases, data mining, and randomized algorithms [10].

In general, random sampling is a fundamental problem in many practical problems including market surveys, online advertising, and statistics [11].

Through comparison, T. Wang et al. have found that the performance of sampling algorithms differs much in different kinds of data sets [12]. A.

Algorithm description

This section presents the investigated sampling algorithms. Temporal complexity varies, while space complexity is the same for all algorithms. It is interesting to see whether modifying the duration affects the quality of the sample.

When discussing the algorithms and their complexities, the following notations are used: n is the sample size and N is the size of the set.

Single set use case

The computation of different metrics on a column (e.g. the mean, standard deviation, etc.) of a table with millions of records is expensive in terms of CPU processing power and memory usage. These costs can be decreased using sampling, and the metrics, although not accurate, can be approximated with small error rates.

The mean μ is used in order to compute the error ɛ. The error introduced by the sampling can be calculated by comparing the mean of the set with the mean of the sample. Equation (1)

Experimental results

Input data sets consist of randomly generated UUIDs – version 4, RFC 4122. Each UUID identifies a user's visit on a website - email, birthday, country, sex, time spent on a page, clicked items, most viewed elements on a page. Firstly, we generated an initial set of UUIDs. From this set, we randomly took UUIDs from the initial set and populated the second set. Then, we completed the second set with more random UUIDs. The procedure was repeated for the third set by taking random UUIDs from both

Conclusions

In this paper, the execution time and accuracy for the sampling algorithms R, Z, D, Hash Mod, Random Sort, and Systematic Sampling are compared. The comparison is done for four case studies: single set sampling, set intersection sampling, set union sampling, and set difference sampling. Hash Mod algorithm has the smallest error rates, although the execution time is the worst among the studied algorithms.

For the single set sampling use case, it is advisable to use D algorithm, which has a low

References (31)

  • P.S. Efraimidis et al.

    Weighted random sampling with a reservoir

    Inf. Process. Lett.

    (2006)
  • R.J. Lipton et al.

    Query size estimation by adaptive sampling

    J. Comput. Syst. Sci.

    (1995)
  • M. Chen et al.

    Big data: a survey

    Mobile Network. Appl.

    (2014)
  • P. Zikopoulos et al.

    Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data

    (2011)
  • S. Garion et al.

    Big data analysis of cloud storage logs using spark

  • X. Wu et al.

    Data mining with big data

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • J.A.R. Rojas et al.

    Sampling techniques to improve big data exploration

  • A. Bifet

    Mining big data in real time

    Informatica

    (2013)
  • R. Singhal et al.

    Predicting sql query execution time for large data volume

  • Y. Tillé

    Sampling Algorithms

    (2006)
  • F. Yates

    A review of recent statistical developments in sampling and sampling surveys

    J. R. Stat. Soc.

    (1946)
  • P.S. Efraimidis

    Weighted random sampling over data streams

  • T. Wang et al.

    Understanding graph sampling algorithms for social network analysis

  • A. Rezvanian et al.

    Sampling algorithms for weighted networks

    Social Network Anal. Mining

    (2016)
  • J.S. Vitter

    Faster methods for random sampling

    Commun. ACM

    (1984)
  • Cited by (0)

    View full text