Sampling strategies for extracting information from large data sets
Introduction
With the tremendous and continuous generation of new data each day from different sources, e.g. web, sensors, mobile operators, etc., data mining has become a pivotal trend that still proposes many challenges nowadays. In the literature, large data sets, either human or computer generate, have been classified into three classes [1]: i) structured data which refers to data with a predefined data model, ii) unstructured data which generally uses a flexible schema-free model and iii) hybrid data that combines the structured and unstructured data to correlate the information from both sources.
When dealing with large volumes of data, gathered from different sources, either structured or unstructured, human or computer generated, one of the main challenges is to determine patterns and to extract knowledge. In the literature, there have been proposed different techniques, i.e. process the entire data set which is an expensive operation even with the recent breakthroughs in the field of high performance computing [2,3], or data sampling, a technique that extracts subsets from the entire data set and analyzes them [4,5]. Moreover, a good sampling method will try to select the best instances, to have a good performance using a small quantity of memory and time [6]. In this paper, the emphasis is on determining which sampling algorithm has the best performance and accuracy when using a large structured data set.
Databases are frequently used to store large structured data sets. In order to extract data from the database, specific operations implemented in the database management system (DBMS) are performed. The execution time of retrieving information increases proportionally with the volume of data stored [7]. In such cases, it is preferable to reduce costs and lose as little precision as possible by estimating the query results.
Consider the following scenario. An advertiser wants to maximize the impact of commercials posted on websites. The data gathered for a specific user from a single website - email, birthday, country, sex, time spent on a page, clicked items, most viewed elements on the page - is not specific enough. Commercials should be even more suited to the user's profile. The goal would be to identify which users enter multiple websites, in order to narrow down his interests. Consequently, the commercials would be well targeted and the user would be more likely to click the advertisements. The reason an advertiser wants to target the users so precise is that he gets paid whenever a user clicks the advertisement. From an implementation perspective, finding out which users visit multiple websites means determining a set intersection - the users which are interested in both website A and website B. If the intersection of the sets is small and the sets are large, it would be a waste of computation power with little profitable results - the cost to advertise to so few users is not justified by the amount of computation resources utilized. So, it would be useful to know which sets have an intersection above a defined threshold. Intersecting all the sets would consume lots of resources, so a solution would be to approximate the intersection of these sets using sampling and then intersect the sets of interest. So, the accuracy of the sampling algorithm must be as high as possible in order for the approximation of the intersection to be close to the real intersection. A business which deals with large volumes of data and needs to approximate intersection, union, and difference before doing the actual computation could benefit from this research.
A solution to reduce costs would be to apply the query to a representative sample from the data instead of the entire data set. The results generated by operations performed on samples approximate the real results but they introduce a calculable error. In this paper, different sampling algorithms are compared in order to determine the best approach to minimize the error and maximize the execution time when dealing with large volumes of data.
This paper is structured as follows. Section 2 presents related work on sampling techniques. Section 3 describes the algorithms for sampling. Section 4 reviews the approximation techniques used for evaluating the algorithms. Section 5 presents the test environment, the experiments and discusses the results. Finally, the last section concludes with a summary and hints at future work.
Section snippets
Related work
There are many sampling algorithms proposed so far in the literature [8,9]. These algorithms have different applications in computer science field like databases, data mining, and randomized algorithms [10].
In general, random sampling is a fundamental problem in many practical problems including market surveys, online advertising, and statistics [11].
Through comparison, T. Wang et al. have found that the performance of sampling algorithms differs much in different kinds of data sets [12]. A.
Algorithm description
This section presents the investigated sampling algorithms. Temporal complexity varies, while space complexity is the same for all algorithms. It is interesting to see whether modifying the duration affects the quality of the sample.
When discussing the algorithms and their complexities, the following notations are used: n is the sample size and N is the size of the set.
Single set use case
The computation of different metrics on a column (e.g. the mean, standard deviation, etc.) of a table with millions of records is expensive in terms of CPU processing power and memory usage. These costs can be decreased using sampling, and the metrics, although not accurate, can be approximated with small error rates.
The mean μ is used in order to compute the error ɛ. The error introduced by the sampling can be calculated by comparing the mean of the set with the mean of the sample. Equation (1)
Experimental results
Input data sets consist of randomly generated UUIDs – version 4, RFC 4122. Each UUID identifies a user's visit on a website - email, birthday, country, sex, time spent on a page, clicked items, most viewed elements on a page. Firstly, we generated an initial set of UUIDs. From this set, we randomly took UUIDs from the initial set and populated the second set. Then, we completed the second set with more random UUIDs. The procedure was repeated for the third set by taking random UUIDs from both
Conclusions
In this paper, the execution time and accuracy for the sampling algorithms R, Z, D, Hash Mod, Random Sort, and Systematic Sampling are compared. The comparison is done for four case studies: single set sampling, set intersection sampling, set union sampling, and set difference sampling. Hash Mod algorithm has the smallest error rates, although the execution time is the worst among the studied algorithms.
For the single set sampling use case, it is advisable to use D algorithm, which has a low
References (31)
- et al.
Weighted random sampling with a reservoir
Inf. Process. Lett.
(2006) - et al.
Query size estimation by adaptive sampling
J. Comput. Syst. Sci.
(1995) - et al.
Big data: a survey
Mobile Network. Appl.
(2014) - et al.
Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data
(2011) - et al.
Big data analysis of cloud storage logs using spark
- et al.
Data mining with big data
IEEE Trans. Knowl. Data Eng.
(2014) - et al.
Sampling techniques to improve big data exploration
Mining big data in real time
Informatica
(2013)- et al.
Predicting sql query execution time for large data volume
Sampling Algorithms
(2006)