Keywords

1 Introduction

1.1 Related Works

The topic of selecting a representative subset of objects from a large data set is becoming very popular in the era of big data. One could find many publications on this topic related to specific machine learning problems and specific domains. This topic in literature is often named ‘sampling’. The well-known report [6] presents variants of sampling. The following two principal types of random sampling are domain independent: case-based sampling and group-based sampling.

The typical problem of case-based sampling big data consists in searching the minimal number of objects to evaluate the number of case-similar objects in the whole data set. Algorithms of case-based sampling consist in the calculation of proximity between objects and a given case (pattern), evaluation of distribution of case-similar objects inside the whole data collection, and application of statistics formulas taking into account the number of objects in a subset and in the whole object set. The survey of case-based sampling techniques is presented in [2].

Group-based sampling is the most popular type of sampling. The typical problem of group-based sampling big data consists in searching the minimal number of objects, whose structure is close (similar) to the structure of the whole data set. One of the first algorithms of group-based sampling is described in [3]. Here the number of group centers is fixed and only these centers are considered as the representative objects. The search algorithm determines the location of these centers taking into account the density of the objects distribution in a given space of parameters. The same approach was used in our work [5]. Here the whole object set was grouped with the method MajorClust [7]. Unlike the previous case, this method determines the number of clusters automatically, and this circumstance makes this method one of the most effective for cluster analysis. Each cluster is presented by 3 objects (semantic descriptors), reflecting relation to objects inside this cluster and outside: this cluster. These descriptors from all of the clusters are considered as the representative subset. Finally we present a recently published paper [9]. Here, the algorithm k-means forms clusters using n objects and finds m objects near each center. Besides the algorithm randomly selects dxk other objects. The values k, m and d are supposed to be assigned by a user, where m ≪ n and d ≪ n. These (m + d)xk objects is the so-called T-representative subset if the quality of clustering this subset exceeds a given threshold T. Otherwise m and d are increased.

The main disadvantage of group-based sampling unlike case-based sampling is the necessity to calculate pairwise distances that needs time and memory when working with big data. Indeed, for N = 103 objects, we have ~106/2 calculations, for N = 106 objects, we have already ~1012/2 calculations, etc.

Sampling presented in the paper aims to select the minimal number of objects, whose parameters distributions should be close (similar) to those in the whole data set. With such a sample it would be possible easy to find given cases and easy to reveal data structure. This problem reduces to the well-known testing hypothesis about uniformity of two data sets [4]. The corresponding algorithm forms an increasing subset of objects until the taken hypothesis of uniformity will be accepted for all the objects parameters. However such a simple algorithm doesn’t take into account both independence of data and ambiguity related to the acceptance of hypothesis. The proposed method tries to avoid these disadvantages.

1.2 Problem Setting

According the goal formulated above we need to form a representative subset of objects under the following conditions:

  1. (1)

    The subset should reflect the parameters distributions of the whole data set;

  2. (2)

    The subset should not depend on objects location in the data set;

  3. (3)

    The subset should reduce ambiguity related to acceptance of statistical hypothesis;

  4. (4)

    The subset should have minimum volume.

The problems solution consists in testing several statistical hypotheses covering the requirements (1), (2) and (3). The step-by-step procedure with an increasing number of objects allows to satisfy the requirement (4). The proposed method is tested on two data sets related to the companies of mobile communication in Russia and to the intercity autobuses communication in Peru.

The other sections of the paper are organized as follows. Section 2 describes the complex hypotheses. Section 3 presents the steps of the algorithms. Section 4 demonstrates the results of the experiments. Section 5 concludes the paper.

2 Testing Complex Hypotheses

2.1 Testing Hypothesis in Hard and Soft Modes

The procedure of testing the similarity of two sets of random values is well known in statistics and we only remind it [4].

Let us have two sets of objects each presented by one parameter. If both sets come from a normal distribution with an unknown mean and dispersion, then we form the so-called z-statistics and compare it with the critical value zcrit(α), where a is the accepted level of the statistical type 1 error. If the law of distribution is unknown then one should use any non-parametric test. In our paper, we apply the Kolmogorov-Smirnov test, where the tabulated α-dependent critical value is used.

Let we have two sets of objects presented by n parameters each. We can consider two modes: hard mode and soft mode. The hard mode considers two multi-dimensional sets as similar ones if all n parameters of the 1-st set have the same distribution as the corresponding parameters of the 2-nd set. The similarity of each parameter can be tested using z-statistics or the Kolmogorov-Smirnov test mentioned above. The soft mode considers two multi-dimensional sets as the similar ones if at least one parameter of the 1-st set has the same distribution as the corresponding parameter of the 2-nd set.

2.2 Null-Hypothesis and Artificial Contrary Hypothesis

Classical statistics says that the acceptance of any null hypothesis means only the absence of a significant contradiction between a given hypothesis and existing data. But it doesn’t mean the hypothesis is correct [4].

In our practice such a situation occurred for the first time almost 15 years ago, when objects were classified by using the statistical measure of proximity between objects and given classes [1]. This measure consisted in testing the null-hypothesis and it proved that objects often belonged simultaneously to several classes.

In our case, the experiments show the following: with limited volume of data any hypothesis concerning data distribution even the most unlikely one can be accepted. Then with increasing the volume of data the discrepancies between the data and the hypothesis become more and more apparent and finally the hypothesis may be rejected. Therefore, at the initial stage of selection with a very small number of objects there is a high probability to accept any hypothesis, which leads to the statistical type II error. To reduce this effect we propose to use a contrary artificial hypothesis: when this hypothesis is accepted then the acceptance of the basic null-hypothesis should be reevaluated.

The proposed artificial hypothesis concerns the uniform distribution of parameters in the subset under consideration. Such a solution can be supported by the assumption that such distribution is rarely used for practically useful applications and excluded from consideration in algorithms with object selection. It is necessary emphasize that the uniform distribution may be not the best one to be fit for our case. But at least it serves as a filter for the type II errors for the null-hypothesis. To test the hypothesis about the uniform distribution we use the Kolmogorov-Smirnov test.

3 Method

3.1 Steps of Algorithm

We build the algorithm according the problem setting described in the Sect. 1.2. The algorithm includes two stages: stage 1 (preprocessing) and stage 2 (processing). Preliminary we fix the total number of objects N, the initial number of representative objects M, the step for increasing the number of representative objects m, the number of parameters to be considered k. We fix also the probability of the type I error for the main hypothesis α, and the probability of the type I error for the artificial hypothesis β. In each step we use 3 sets of objects: the increasing representative subset, an additional random subset of the same volume, and the original object set.‘‘

Stage 1: preprocessing

  1. (1)

    Calculation of statistics for the whole data collection, such as the mean, dispersion or empirical function of distribution for the formulae from Sect. 2.1.

  2. (2)

    The data collection is randomly intermixed to avoid any influence of the initial object distribution.

Stage 2: processing

  1. (1)

    Selection of the initial representative subset of M objects. It can be the first objects from the whole object set.

  2. (2)

    Testing the complex hypothesis of similarity between the representative subset and the whole object set (the hypothesis-1). Here the formulae from Sect. 2 are used. The null-hypothesis is accepted in hard mode.

  3. (3)

    Testing the hypothesis of similarity between the representative subset and the additional random subset (the hypothesis-2). We use here the formulae from Sect. 2. The volume of both subsets must be equal. The null-hypothesis is accepted in the hard mode.

  4. (4)

    Testing the artificial hypothesis. For this, the concordance between the representative subset and the uniform distribution is checked. We use here the formulae from Sect. 2. The volume of the subset and the theoretical sample must be equal. The null-hypothesis is accepted in the soft mode.

  5. (5)

    If the hypotheses about similarity between the representative subset, the additional subset and the whole data set are accepted and the artificial hypothesis is rejected then the process interrupts: the current representative subset can be considered as the final solution. Otherwise a new portion of objects are added to the current representative subset M = M + m and the algorithm returns to the step (2).

3.2 Evaluation of Results

We use the approach proposed in [9]. This approach can be presented in the following simplified form. Let D n and D ν be the whole data set and a sample from this data set containing n and ν objects respectively, here ν < n. Let P, Q(P, D ν ) and T be the problem, the quality of problem solution with D ν and the threshold of quality respectively.

Definition 1.

The value ν is the T-Critical Sampling Size if (a) for any k ≥ ν the quality Q(P, D k ) ≥ T, (b) there is k < ν that Q(P, D k ) < T.

Definition 2.

The value Q s  = ν/n is the Sample Quality of D n , where ν is the Critical Sampling Size of D n.

So, to determine practically the T-Critical Sampling Size for a given problem P one needs to complete the following: to fix T and repeat several experiments with different subsets D k for different k. Then one needs to select cases Q ≥ T and take the maximal value k in these cases. We used this method in our experiments.

4 Experiments

4.1 Selecting Companies of Mobile Communication (Russia)

The Russian market of the companies of mobile communication includes about 600 companies (data of 2015). The problem consists in the selection of a representative subset from the list of companies in order to approximately evaluate its attraction for investments. Such problem was considered by master students from the Russian Presidential Academy of National Economy and Public Administration. This research was supported by the Academy.

For the analysis the following 4 parameters of company activity were taken: Profitability of Production (PP), Return on Assets (RA), Return on Equity (RE), Coefficient of Financial Stability (FS). The normalized data of the first 3 companies are presented in Table 1.

Table 1. Characteristics of companies (first 3 of 600)

In the experiment, we varied: (1) the probability of the type I error α for the hypothesis-1 and for the hypothesis-2, and (b) the probability of type I error β for the artificial hypothesis. The initial subset included 5 objects. The step was also equal 5 objects. We completed 10 experiments for each combination (α,β) and fixed the maximal number of the selected objects for these combinations. Therefore we revealed the so-called (α,β)-critical sampling sizes in terms of Sect. 3.2. The results are presented in the Table 2. The sign ‘−’ means the artificial hypothesis was not used. The most significant combination here is (0.05, 0.01) for which the sample quality equals Q s  = 55/600 ~ 0.1.

Table 2. The critical sampling sizes (companies).

We evaluated the quality of sampling on the basis of results of clustering. For this we considered the mentioned (0.05, 0.01)-combination and completed the following procedures:

  1. (1)

    Clustering the whole data set, whose clusters were considered as classes in cluster F-measure (see below);

  2. (2)

    Clustering 55 random objects taken from the whole data set, this procedure was repeated 10 times;

  3. (3)

    Comparison of both results using cluster F-measure introduced in [8]. This measure reflects the correspondence between the classes and clusters revealed on steps (1) and (2). Theoretically the best and the worst values of cluster F-measure are equal 1 and 0 respectively.

In this experiment we used k-means method with k = 3. The value k was assigned by experts. The resulting F-measure proved to belong the interval [0.45, 0.65].

4.2 Selecting Modes of Movements in Intercity Autobus Communication (Peru)

Daily there are approximately 6000 intercity bus trips in Peru. They connect about 100 destinations (data of 2015). All movements of autobuses are registered by GPS in special databases. The frequency of registration is 1 min. We introduce the notion ‘object of movement’. It is the distance, which is covered by a vehicle in 10 min. Therefore, 10 sequential records form one object of movement. We collected about 10000 records reflecting 24 bus trips in Peru and collocated them sequentially one bus trip after the other one. These records can be considered as 10000/10 = 1000 objects of movement. The problem consists in the selection of a representative subset of this object set for data mining the Peruvian intercity autobuses communication. The work was completed in the Computer Science Department of the Catholic University of San Pablo in the framework of the project with a Peruvian company responsible for monitoring the intercity autobus communication.

For the analysis we used the following 3 parameters for each object: average velocity (AV km/h), variation of velocity (VV), and deviation of direction (DD degrees). The normalized data of the first 3 objects on the route Lima-Arequipa is presented in Table 3.

Table 3. Characteristics of objects of movement (first 3 of 1000).

We repeated the experiment concerning various combinations (α,β) with objects of movement. The results are presented in the Table 4. The most significant combination here is also (0.05, 0.01), for which the sample quality equals Qs = 20/1000 = 0.02.

Table 4. The critical sampling sizes (objects of movement).

We evaluated the quality of sampling on the basis of results of clustering as we did for the companies. In this procedure we used 20 and 1000 objects. The number of clusters equaled 4 that corresponded to the best value of Dunn-index [8]. The resulting F-measure proved to belong to the interval [0.85, 1.00].

5 Conclusions

In the paper, we propose the method for selecting representative subset of objects from a large object set, which provides:

  • forecast of parameters distributions in the whole object set on the basis of their distributions in the subset;

  • independence of the selection on the objects collocation in the object set;

  • increased accuracy by means of disambiguation related to acceptance of null-hypotheses.

These advantages are reached on the basis of using complex statistical hypotheses. The method has lineal complexity and can be used with any distributions unknown in advance. The method was checked on two real data sets and demonstrated promised results.

In future we intend to consider the following applications of the method: selection of representative subsets of universities and colleges in given countries, selection of representative subsets of students in given universities and colleges, other applications concerning sciences and education.