Elsevier

Knowledge-Based Systems

Volume 189, 15 February 2020, 105140
Knowledge-Based Systems

Active learning through label error statistical methods

https://doi.org/10.1016/j.knosys.2019.105140Get rights and content

Highlights

  • We define two label error statistics functions and build clustering-based practical statistical models to guide block splitting.

  • We propose a center-and-edge instance selection strategy to choose critical instances.

  • We design an algorithm called active learning through label error statistical methods (ALSE).

  • Results of significance test verify the superiority of ALSE to state-of-the-art algorithms.

Abstract

Clustering-based active learning splits data into a number of blocks and queries the labels of the most critical instances. An active learner must decide how to choose these critical instances and how to split the blocks. In this paper, we present theoretical and practical statistical methods for analyzing the relationship between the label error and the neighbor radius, and design new split and selection strategies to handle these two issues. First, we define statistical functions for the label error based on a single instance and instance pairs. Second, we build practical statistical models, calculate empirical label errors, and guide the block splitting process. Third, using these practical models, we develop a center-and-edge instance selection strategy for choosing critical instances. Fourth, we design a new algorithm called active learning through label error statistical methods (ALSE). Learning experiments were performed with 20 datasets from various domains. The results of significance tests verify the effectiveness of ALSE and its superiority over state-of-the-art active learning algorithms.

Introduction

Active learning [1], [2] is a subfield of machine learning in which the algorithm is able to interactively query an oracle to obtain the desired data. The objective is to train an accurate prediction model with the minimum cost by labeling the most informative instances [3]. As obtaining class labels are expensive and time consuming, it is reasonable to select instances whose labels will shrink the version space as fast as possible [4], [5], [6]. The most popular approach is to query the most informative instances, such as query-by-committee [5], uncertainty sampling [6], and optimal experimental design [7]. The main weakness of these approaches is that they are unable to exploit the abundance of unlabeled data and are prone to sample bias [8].

Clustering-based active learning [8], [9], [10] explores the clustering structure of unlabeled data and designs appropriate clustering strategies. One approach is to pre-cluster the data and query the cluster centroids to form the initial training set [11]. This is a form of “warm start” active learning, instead of beginning with a random sample. Another approach is to iteratively cluster the most informative instances and query the most representative among them [9], [12]. During the active learning process, the clustering is adjusted using the coarse-to-fine strategy. For example, Dasgupta et al. [9] constructed a hierarchical clustering tree and adopted a pruning strategy to iteratively refine the clustering. However, the clusters may not actually correspond to the hidden class labels. Moreover, it is difficult to determine the block granularity that is suitable for active learning. To explore a more appropriate cluster structure that is as consistent as possible with the class labels, we should consider the following questions: (1) How can we determine the relationship between the cluster distribution and the label errors? (2) How can we determine the most appropriate diameter for the cluster block? and (3) How can we develop a suitable critical instance selection strategy?

In this paper, we present theoretical and practical statistical methods for analyzing the relationship between the label error and the neighbor radius, and design new split and selection strategies to handle these two issues. There are four main contributions of this study. First, we propose a theoretical method for analyzing the relationship between the statistical label error and the neighbor radius. The closer any two instances are, the less likely they are to have different labels. With this intrinsic feature of the data distribution, we define two new label error statistical functions based on single instances and instance pairs. These functions are used to quantitatively calculate label inconsistencies according to the neighborhood radius.

Second, we propose a practical statistical method for obtaining the empirical label error of the cluster block. With any clustering algorithm, this method can guide the splitting of the blocks. We use the clustering by fast search and find of density peaks (CFDP) algorithm to cluster the training datasets. For each cluster block, we use statistical functions for single instances and instance pairs to calculate the label error. Finally, we use curve fitting techniques to obtain the corresponding empirical label error curves. In fact, there are some existing works about statistical active learning. For example, Gaussian statistical active learning [13], [14] explores unlabeled data spatial connectivity to build an instance subset, from which critical instances are selected. In contrast, we take advantages of clustering algorithms to form blocks, and explore the label distribution in each block to guide the splitting process.

Third, we propose a new center-and-edge instance selection strategy for choosing the critical instances. This strategy takes into account the representativeness and information of each instance. Edge instances are used to reduce the uncertainty of the model, and central instances are used to represent the overall features of all unlabeled data. The strategy is based on the principle of maximum error, thus guaranteeing the accuracy of classification.

Fourth, we design a new algorithm called active learning through label error statistical model (ALSE). Fig. 1 illustrates the ALSE process using a running example. Part 1 is the input, which contains two types of datasets: Iris (DB<1.2) and Sonar (DB>1.2). Part 2 is the theoretical and practical label error statistical methods. Theoretical label error statistical methods present the single-instance label error statistical function es(λs) and instance-pair statistical function ep(λp). Practical label error statistical models provide two empirical label error functions φ(λs) and φ(λp) obtained using statistical method. Part 3 is an example of iterative query, split and prediction using the Iris dataset. Three different sizes of cluster sub-blocks are obtained by clustering. The diameters of the three clusters are λ, λ and λ, respectively. For block 1, φ(λ)<ε, we select representative instances 5, 23, 41. Since l(5)=l(23)=l(41)=1, block 1 is pure and we will predict all remaining instances. For block 2, φ(λ)<ε, while judging that block 2 is impure, we need to split the block. For block 3, φ(λ)>ε, we will split the block directly. In this way, the ALSE algorithm iteratively queries, splits, and predicts until all instances obtain the label. Part 4 is the output.

Experiments are undertaken on 20 datasets with 150–1,025,010 instances, 2–16,833 attributes, and 2–31 classes. These datasets include continuous, discrete, and mixed attribute types. We compare the ALSE algorithm with popular classifiers and state-of-the-art active learning algorithms. The Friedman test and Nemenyi post-hoc test are used to verify the significance of the differences between ALSE and the other algorithms. The results show that ALSE outperforms all of these competing algorithms in terms of classification accuracy.

The remainder of this paper is organized as follows. In Section 2, we briefly review active learning and cluster-based active learning. Section 3 presents the theoretical label error statistical methods. Section 4 illustrates the construction of practical label error statistical models. Section 5 introduces the pseudocode of the ALSE algorithm and analyzes its two key issues. Section 6 discusses the experimental results and Section 7 presents our conclusions.

Section snippets

Preliminaries

In this section, we describe several preliminary concepts, including active learning, clustering-based active learning, and the probabilistic Lipschitzness condition.

Theoretical label error statistical method

In this section, we define two new label error statistical functions. These functions are based on single instances and instance pairs, respectively. Using these statistical functions, we analyze the relationship between the statistical label error and the neighbor radius.

Practical label error statistical models

In this section, we present a statistical method for obtaining the cluster block empirical label error. The empirical label error is data- and algorithm-dependent. Therefore, we use a statistical method to obtain it from actual data.

Algorithm description

This section presents the ALSE algorithm, which uses CFDP clustering and the two empirical label error functions. First, we describe the ALSE algorithm. Second, we elaborate on two key sub-problems. Finally, we analyze the time complexity.

Experiments

We conducted experiments to analyze the effectiveness of the ALSE algorithm and to answer the following questions:

  • (1)

    Is the ALSE algorithm more accurate than other supervised classification algorithms, such as C4.5, Naïve Bayes, and Bagging?

  • (2)

    Is the ALSE algorithm more accurate than state-of-the-art active learning algorithms such as QBC, KQBC, QUIRE, and ALEC?

The computations were performed on a Windows 10 64-bit operating system with 8 GB RAM and Intel (R) Core 2Quad CPU [email protected] GHz processors

Conclusions and future work

Finding a cluster structure that is as consistent as possible with the real label distribution is a challenge for cluster-based active learning. In this paper, we have introduced reasonable assumptions and conducted theoretical and practical research. In terms of the theory, we defined two label error statistical functions; in practice, we obtained the empirical label error distribution. Using the empirical label error distribution, we obtained the appropriate cluster block diameter. Under the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61773208); the Sichuan Province Youth Science and Technology Innovation Team, China (2019JDTD0017); and the Ministry of Education Innovation Project, China (201801140013, 201801006094).

References (62)

  • ZhangC.-S. et al.

    Multi-imbalance: an open-source software for multi-class imbalance learning

    Knowl.-Based Syst.

    (2019)
  • ReyesO. et al.

    Statistical comparisons of active learning strategies over multiple datasets

    Knowl.-Based Syst.

    (2018)
  • CohnD.A. et al.

    Active learning with statistical models

    Comput. Sci.

    (1996)
  • SettlesB.

    Active learning literature survey

    (2009)
  • DasguptaS. et al.

    A general agnostic active learning algorithm

  • TongS. et al.
  • FedorovV.

    Optimal experimental design

    Wiley Interdiscip. Rev. Comput. Stat.

    (2010)
  • HuangS.-J. et al.

    Active learning by querying informative and representative examples

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • DasguptaS. et al.

    Hierarchical sampling for active learning

  • KangJ. et al.

    Using cluster-based sampling to select initial training set for active learning in text classification

  • NguyenH.T. et al.

    Active learning using pre-clustering

  • ZhouJ. et al.

    Active learning of gaussian processes with manifold-preserving graph reduction

    Neural Comput. Appl.

    (2014)
  • SeungH.S. et al.

    Query by committee

  • TongS. et al.

    Support vector machine active learning with applications to text classification

    J. Mach. Learn. Res.

    (2002)
  • G.B. Ran, A. Navot, N. Tishby, Kernel query by committee (KQBC), Bachrach,...
  • YuK. et al.

    Active learning via transductive experimental design

  • WangX.-Z. et al.

    Discovering the relationship between generalization and uncertainty by incorporating complexity of classification

    IEEE Trans. Syst. Man Cybern.

    (2018)
  • DuB. et al.

    Exploring representativeness and informativeness for active learning

    IEEE Trans. Syst. Man Cybern.

    (2017)
  • JiaX.-Y. et al.

    Similarity-based attribute reduction in rough set theory: a clustering perspective

    Int. J. Mach. Learn. Cybern.

    (2019)
  • WangR. et al.

    Incorporating diversity and informativeness in multiple-instance active learning

    IEEE Trans. Fuzzy Syst.

    (2017)
  • XuZ.-B. et al.

    Incorporating diversity and density in active learning for relevance feedback

  • Cited by (22)

    • Active learning with missing values considering imputation uncertainty

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Representativeness-based query selection considers the extent to which an instance is representative of the data distribution. The most common strategy is to employ a clustering algorithm to identify representative instances [27–31]. Another strategy is to impose representativeness constraints on informativeness-based query selection, such as information density weighting [20] and diversity maximization [32].

    • Balancing Exploration and Exploitation: A novel active learner for imbalanced data

      2020, Knowledge-Based Systems
      Citation Excerpt :

      In the membership query synthesis active learning scenario, the learner generates a synthetic instance in the input space and then requests a label for it. However, some of these artificial instances could be impossible to label in a reasonable way [6,9,10]. In contrast, in the pool-based sampling approach, the trained active learner is used for evaluating some/all instances in the large set/pool of already existing unlabeled data.

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105140.

    View full text