Active learning through label error statistical methods

doi:10.1016/j.knosys.2019.105140

Knowledge-Based Systems

Volume 189, 15 February 2020, 105140

https://doi.org/10.1016/j.knosys.2019.105140 Get rights and content

Highlights

•
We define two label error statistics functions and build clustering-based practical statistical models to guide block splitting.
•
We propose a center-and-edge instance selection strategy to choose critical instances.
•
We design an algorithm called active learning through label error statistical methods (ALSE).
•
Results of significance test verify the superiority of ALSE to state-of-the-art algorithms.

Abstract

Clustering-based active learning splits data into a number of blocks and queries the labels of the most critical instances. An active learner must decide how to choose these critical instances and how to split the blocks. In this paper, we present theoretical and practical statistical methods for analyzing the relationship between the label error and the neighbor radius, and design new split and selection strategies to handle these two issues. First, we define statistical functions for the label error based on a single instance and instance pairs. Second, we build practical statistical models, calculate empirical label errors, and guide the block splitting process. Third, using these practical models, we develop a center-and-edge instance selection strategy for choosing critical instances. Fourth, we design a new algorithm called active learning through label error statistical methods (ALSE). Learning experiments were performed with 20 datasets from various domains. The results of significance tests verify the effectiveness of ALSE and its superiority over state-of-the-art active learning algorithms.

Introduction

Active learning [1], [2] is a subfield of machine learning in which the algorithm is able to interactively query an oracle to obtain the desired data. The objective is to train an accurate prediction model with the minimum cost by labeling the most informative instances [3]. As obtaining class labels are expensive and time consuming, it is reasonable to select instances whose labels will shrink the version space as fast as possible [4], [5], [6]. The most popular approach is to query the most informative instances, such as query-by-committee [5], uncertainty sampling [6], and optimal experimental design [7]. The main weakness of these approaches is that they are unable to exploit the abundance of unlabeled data and are prone to sample bias [8].

Clustering-based active learning [8], [9], [10] explores the clustering structure of unlabeled data and designs appropriate clustering strategies. One approach is to pre-cluster the data and query the cluster centroids to form the initial training set [11]. This is a form of “warm start” active learning, instead of beginning with a random sample. Another approach is to iteratively cluster the most informative instances and query the most representative among them [9], [12]. During the active learning process, the clustering is adjusted using the coarse-to-fine strategy. For example, Dasgupta et al. [9] constructed a hierarchical clustering tree and adopted a pruning strategy to iteratively refine the clustering. However, the clusters may not actually correspond to the hidden class labels. Moreover, it is difficult to determine the block granularity that is suitable for active learning. To explore a more appropriate cluster structure that is as consistent as possible with the class labels, we should consider the following questions: (1) How can we determine the relationship between the cluster distribution and the label errors? (2) How can we determine the most appropriate diameter for the cluster block? and (3) How can we develop a suitable critical instance selection strategy?

In this paper, we present theoretical and practical statistical methods for analyzing the relationship between the label error and the neighbor radius, and design new split and selection strategies to handle these two issues. There are four main contributions of this study. First, we propose a theoretical method for analyzing the relationship between the statistical label error and the neighbor radius. The closer any two instances are, the less likely they are to have different labels. With this intrinsic feature of the data distribution, we define two new label error statistical functions based on single instances and instance pairs. These functions are used to quantitatively calculate label inconsistencies according to the neighborhood radius.

Second, we propose a practical statistical method for obtaining the empirical label error of the cluster block. With any clustering algorithm, this method can guide the splitting of the blocks. We use the clustering by fast search and find of density peaks (CFDP) algorithm to cluster the training datasets. For each cluster block, we use statistical functions for single instances and instance pairs to calculate the label error. Finally, we use curve fitting techniques to obtain the corresponding empirical label error curves. In fact, there are some existing works about statistical active learning. For example, Gaussian statistical active learning [13], [14] explores unlabeled data spatial connectivity to build an instance subset, from which critical instances are selected. In contrast, we take advantages of clustering algorithms to form blocks, and explore the label distribution in each block to guide the splitting process.

Third, we propose a new center-and-edge instance selection strategy for choosing the critical instances. This strategy takes into account the representativeness and information of each instance. Edge instances are used to reduce the uncertainty of the model, and central instances are used to represent the overall features of all unlabeled data. The strategy is based on the principle of maximum error, thus guaranteeing the accuracy of classification.

Fourth, we design a new algorithm called active learning through label error statistical model (ALSE). Fig. 1 illustrates the ALSE process using a running example. Part 1 is the input, which contains two types of datasets: Iris ( $D B < 1.2$ ) and Sonar ( $D B > 1.2$ ). Part 2 is the theoretical and practical label error statistical methods. Theoretical label error statistical methods present the single-instance label error statistical function $e_{s} (λ_{s})$ and instance-pair statistical function $e_{p} (λ_{p})$ . Practical label error statistical models provide two empirical label error functions $φ (λ_{s})$ and $φ (λ_{p})$ obtained using statistical method. Part 3 is an example of iterative query, split and prediction using the Iris dataset. Three different sizes of cluster sub-blocks are obtained by clustering. The diameters of the three clusters are $λ^{'}$ , $λ^{''}$ and $λ^{'''}$ , respectively. For block 1, $φ (λ^{'}) < ε$ , we select representative instances 5, 23, 41. Since $l (5) = l (23) = l (41) = 1$ , block 1 is pure and we will predict all remaining instances. For block 2, $φ (λ^{''}) < ε$ , while judging that block 2 is impure, we need to split the block. For block 3, $φ (λ^{'''}) > ε$ , we will split the block directly. In this way, the ALSE algorithm iteratively queries, splits, and predicts until all instances obtain the label. Part 4 is the output.

Experiments are undertaken on 20 datasets with 150–1,025,010 instances, 2–16,833 attributes, and 2–31 classes. These datasets include continuous, discrete, and mixed attribute types. We compare the ALSE algorithm with popular classifiers and state-of-the-art active learning algorithms. The Friedman test and Nemenyi post-hoc test are used to verify the significance of the differences between ALSE and the other algorithms. The results show that ALSE outperforms all of these competing algorithms in terms of classification accuracy.

The remainder of this paper is organized as follows. In Section 2, we briefly review active learning and cluster-based active learning. Section 3 presents the theoretical label error statistical methods. Section 4 illustrates the construction of practical label error statistical models. Section 5 introduces the pseudocode of the ALSE algorithm and analyzes its two key issues. Section 6 discusses the experimental results and Section 7 presents our conclusions.

Section snippets

Preliminaries

In this section, we describe several preliminary concepts, including active learning, clustering-based active learning, and the probabilistic Lipschitzness condition.

Theoretical label error statistical method

In this section, we define two new label error statistical functions. These functions are based on single instances and instance pairs, respectively. Using these statistical functions, we analyze the relationship between the statistical label error and the neighbor radius.

Practical label error statistical models

In this section, we present a statistical method for obtaining the cluster block empirical label error. The empirical label error is data- and algorithm-dependent. Therefore, we use a statistical method to obtain it from actual data.

Algorithm description

This section presents the ALSE algorithm, which uses CFDP clustering and the two empirical label error functions. First, we describe the ALSE algorithm. Second, we elaborate on two key sub-problems. Finally, we analyze the time complexity.

Experiments

We conducted experiments to analyze the effectiveness of the ALSE algorithm and to answer the following questions:

(1)
Is the ALSE algorithm more accurate than other supervised classification algorithms, such as C4.5, Naïve Bayes, and Bagging?
(2)
Is the ALSE algorithm more accurate than state-of-the-art active learning algorithms such as QBC, KQBC, QUIRE, and ALEC?

The computations were performed on a Windows 10 64-bit operating system with 8 GB RAM and Intel (R) Core 2Quad CPU [email protected] GHz processors

Conclusions and future work

Finding a cluster structure that is as consistent as possible with the real label distribution is a challenge for cluster-based active learning. In this paper, we have introduced reasonable assumptions and conducted theoretical and practical research. In terms of the theory, we defined two label error statistical functions; in practice, we obtained the empirical label error distribution. Using the empirical label error distribution, we obtained the appropriate cluster block diameter. Under the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61773208); the Sichuan Province Youth Science and Technology Innovation Team, China (2019JDTD0017); and the Ministry of Education Innovation Project, China (201801140013, 201801006094).

References (62)

LughoferE.
On-line active learning: a new paradigm to improve practical useability of data stream modeling methods
Inform. Sci.
(2017)
SunS.-L. et al.
Active learning with extremely sparse labeled examples
Neurocomputing
(2010)
WangM. et al.
Active learning through density clustering
Expert Syst. Appl.
(2017)
ZhouJ. et al.
Gaussian process versus margin sampling active learning
Neurocomputing
(2015)
TranV.C. et al.
A combination of active learning and self-learning for named entity recognition on twitter using conditional random fields
Knowl.-Based Syst.
(2017)
ZhaoS. et al.
Relational granulation method based on quotient space theory for maximum flow problem
Inform. Sci.
(2020)
ReyesO. et al.
Effective active learning strategy for multi-label learning
Neurocomputing
(2018)
LughoferE.
Hybrid active learning for reducing the annotation effort of operators in classification systems
Pattern Recognit.
(2012)
WangM. et al.
Cost-sensitive active learning through statistical methods
Inform. Sci.
(2019)
LiH.-X. et al.
Cost-sensitive sequential three-way decision modeling using a deep neural network
Internat. J. Approx. Reason.
(2017)

ZhangC.-S. et al.

Multi-imbalance: an open-source software for multi-class imbalance learning

Knowl.-Based Syst.

(2019)

ReyesO. et al.

Statistical comparisons of active learning strategies over multiple datasets

Knowl.-Based Syst.

(2018)

CohnD.A. et al.

Active learning with statistical models

Comput. Sci.

(1996)

SettlesB.

Active learning literature survey

(2009)

DasguptaS. et al.

A general agnostic active learning algorithm

TongS. et al.

FedorovV.

Optimal experimental design

Wiley Interdiscip. Rev. Comput. Stat.

(2010)

HuangS.-J. et al.

Active learning by querying informative and representative examples

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

DasguptaS. et al.

Hierarchical sampling for active learning

KangJ. et al.

Using cluster-based sampling to select initial training set for active learning in text classification

NguyenH.T. et al.

Active learning using pre-clustering

ZhouJ. et al.

Active learning of gaussian processes with manifold-preserving graph reduction

Neural Comput. Appl.

(2014)

SeungH.S. et al.

Query by committee

TongS. et al.

Support vector machine active learning with applications to text classification

J. Mach. Learn. Res.

(2002)

G.B. Ran, A. Navot, N. Tishby, Kernel query by committee (KQBC), Bachrach,...

YuK. et al.

Active learning via transductive experimental design

WangX.-Z. et al.

Discovering the relationship between generalization and uncertainty by incorporating complexity of classification

IEEE Trans. Syst. Man Cybern.

(2018)

DuB. et al.

Exploring representativeness and informativeness for active learning

IEEE Trans. Syst. Man Cybern.

(2017)

JiaX.-Y. et al.

Similarity-based attribute reduction in rough set theory: a clustering perspective

Int. J. Mach. Learn. Cybern.

(2019)

WangR. et al.

Incorporating diversity and informativeness in multiple-instance active learning

IEEE Trans. Fuzzy Syst.

(2017)

XuZ.-B. et al.

Incorporating diversity and density in active learning for relevance feedback

Cited by (22)

Open set transfer learning through distribution driven active learning
2024, Pattern Recognition
Domain adaptation enables effective transfer between source and target domains with different distributions. The latest research focuses on open set domain adaptation; that is, the target domain contains unknown categories that do not exist in the source domain. The existing open set domain adaptation cannot realize the fine-grained recognition of unknown categories. In this paper, we propose an uncertainty analysis evidence model and design a distribution driven active transfer learning (DATL) algorithm. DATL realizes fine-grained recognition of unknown categories with no requirements on the source domain to contain the unknown categories. To explore unknown distributions, the uncertainty analysis evidence model was adopted to divide the high uncertainty space. To select critical instances, a cluster-diversity query strategy was proposed to identify new categories. To enrich the label categories of the source domain, a global dynamic alignment strategy was designed to avoid negative transfers. Comparative experiments with state-of-the-art methods on the standard Office-31/Office-Home/Office-Caltech10 benchmarks showed that the DATL algorithm: (1) outperformed its competitors; (2) realized accurate identification of unknown subcategories from a fine-grained perspective; and (3) achieved outstanding performance even with a very high degree of openness.
Open world long-tailed data classification through active distribution optimization
2023, Expert Systems with Applications
Real-world data exhibits a long-tailed label distribution, which leads to classification bias. Popular re-sampling or re-weighting methods usually require known category information. However, learning from long-tailed data with open categories is a challenging issue. In this paper, we propose an active distribution optimization algorithm (DALC) to handle the interesting issue. Through clustering, querying and classification iterations, we explore new categories and balance label distribution. For clustering, we present an exploration technique that adaptively obtains optimal data distribution with minimal total distance/cost. For each query, we design a critical instance selection strategy with the cluster information. For classification, we establish an ensemble model to continuously balance the label distribution. We conducted experiments on synthetic, benchmark and domain datasets. The results of the significance test verified the effectiveness of DALC and its superiority over state-of-the-art long-tailed data classification and open set classification algorithms.
Active learning with missing values considering imputation uncertainty
2021, Knowledge-Based Systems
Citation Excerpt :
Representativeness-based query selection considers the extent to which an instance is representative of the data distribution. The most common strategy is to employ a clustering algorithm to identify representative instances [27–31]. Another strategy is to impose representativeness constraints on informativeness-based query selection, such as information density weighting [20] and diversity maximization [32].
To perform active learning with incomplete datasets containing missing values, conventional methods first impute the missing values via missing value imputation. Unlabeled instances are then sequentially selected to be labeled based on an acquisition function towards improving the prediction model. However, inaccurate imputations would negatively affect the performance of the prediction model. In this study, we propose a novel query selection method that considers the imputation uncertainty for active learning with missing values. We quantify the imputation uncertainty of each instance using multiple imputation. For query selection, the acquisition function is penalized by the imputation uncertainty. Consequently, inaccurately imputed instances are less likely to be selected for labeling, thereby avoiding superfluous labeling costs. We verified the effectiveness of the proposed method on twenty benchmark datasets using various missing rates and prediction models.
Balancing Exploration and Exploitation: A novel active learner for imbalanced data
2020, Knowledge-Based Systems
Citation Excerpt :
In the membership query synthesis active learning scenario, the learner generates a synthetic instance in the input space and then requests a label for it. However, some of these artificial instances could be impossible to label in a reasonable way [6,9,10]. In contrast, in the pool-based sampling approach, the trained active learner is used for evaluating some/all instances in the large set/pool of already existing unlabeled data.
Active learning receives great interest from researchers with the aim of reducing the amount of time, cost, and efforts for labeling data in many applications. Active learning aims to generate/select the smallest possible amount of training data that ensures strong classification performance in the test phase. An active learner carries out two main steps: (i) selecting a set of promising queries from unlabeled data, and (ii) annotating the selected queries. Most active learners choose either the most informative or representative instances for annotation. In this paper, we combined these two criteria for query selection. First, in the exploration phase, the proposed algorithm explores the search space and tries in each iteration to visit new regions for better exploration. This improves the capability of exploring the space of minority classes with imbalanced data. Second, in the exploitation phase, the goal is to generate a new point in an uncertain region, which is expected to be around the decision boundaries of the target functions. Some variants of the proposed algorithm do not require any labeled or unlabeled data in advance. There is only comparably few existing work which addresses this scenario. Experiments on synthetic and real datasets with different dimensions and imbalance ratios indicate that the proposed algorithm has significant advantages compared to various well-known active learners.
Discover unknown fault categories through active query evidence model
2023, Applied Intelligence
Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages
2023, arXiv

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105140.

View full text

Active learning through label error statistical methods☆

Highlights

Abstract

Introduction

Section snippets

Preliminaries

Theoretical label error statistical method

Practical label error statistical models

Algorithm description

Experiments

Conclusions and future work

Declaration of Competing Interest

Acknowledgments

Inform. Sci.

Neurocomputing

Expert Syst. Appl.

Neurocomputing

Knowl.-Based Syst.

Inform. Sci.

Neurocomputing

Pattern Recognit.

Inform. Sci.

Internat. J. Approx. Reason.

Knowl.-Based Syst.

Knowl.-Based Syst.

Active learning with statistical models

Comput. Sci.

Active learning literature survey

A general agnostic active learning algorithm

Optimal experimental design

Wiley Interdiscip. Rev. Comput. Stat.

Active learning by querying informative and representative examples

IEEE Trans. Pattern Anal. Mach. Intell.

Hierarchical sampling for active learning

Using cluster-based sampling to select initial training set for active learning in text classification

Active learning using pre-clustering

Active learning of gaussian processes with manifold-preserving graph reduction

Neural Comput. Appl.

Query by committee

Support vector machine active learning with applications to text classification

J. Mach. Learn. Res.

Active learning via transductive experimental design

Discovering the relationship between generalization and uncertainty by incorporating complexity of classification

IEEE Trans. Syst. Man Cybern.

Exploring representativeness and informativeness for active learning

IEEE Trans. Syst. Man Cybern.

Similarity-based attribute reduction in rough set theory: a clustering perspective

Int. J. Mach. Learn. Cybern.

Incorporating diversity and informativeness in multiple-instance active learning

IEEE Trans. Fuzzy Syst.

Incorporating diversity and density in active learning for relevance feedback