Active learning through label error statistical methods☆
Introduction
Active learning [1], [2] is a subfield of machine learning in which the algorithm is able to interactively query an oracle to obtain the desired data. The objective is to train an accurate prediction model with the minimum cost by labeling the most informative instances [3]. As obtaining class labels are expensive and time consuming, it is reasonable to select instances whose labels will shrink the version space as fast as possible [4], [5], [6]. The most popular approach is to query the most informative instances, such as query-by-committee [5], uncertainty sampling [6], and optimal experimental design [7]. The main weakness of these approaches is that they are unable to exploit the abundance of unlabeled data and are prone to sample bias [8].
Clustering-based active learning [8], [9], [10] explores the clustering structure of unlabeled data and designs appropriate clustering strategies. One approach is to pre-cluster the data and query the cluster centroids to form the initial training set [11]. This is a form of “warm start” active learning, instead of beginning with a random sample. Another approach is to iteratively cluster the most informative instances and query the most representative among them [9], [12]. During the active learning process, the clustering is adjusted using the coarse-to-fine strategy. For example, Dasgupta et al. [9] constructed a hierarchical clustering tree and adopted a pruning strategy to iteratively refine the clustering. However, the clusters may not actually correspond to the hidden class labels. Moreover, it is difficult to determine the block granularity that is suitable for active learning. To explore a more appropriate cluster structure that is as consistent as possible with the class labels, we should consider the following questions: (1) How can we determine the relationship between the cluster distribution and the label errors? (2) How can we determine the most appropriate diameter for the cluster block? and (3) How can we develop a suitable critical instance selection strategy?
In this paper, we present theoretical and practical statistical methods for analyzing the relationship between the label error and the neighbor radius, and design new split and selection strategies to handle these two issues. There are four main contributions of this study. First, we propose a theoretical method for analyzing the relationship between the statistical label error and the neighbor radius. The closer any two instances are, the less likely they are to have different labels. With this intrinsic feature of the data distribution, we define two new label error statistical functions based on single instances and instance pairs. These functions are used to quantitatively calculate label inconsistencies according to the neighborhood radius.
Second, we propose a practical statistical method for obtaining the empirical label error of the cluster block. With any clustering algorithm, this method can guide the splitting of the blocks. We use the clustering by fast search and find of density peaks (CFDP) algorithm to cluster the training datasets. For each cluster block, we use statistical functions for single instances and instance pairs to calculate the label error. Finally, we use curve fitting techniques to obtain the corresponding empirical label error curves. In fact, there are some existing works about statistical active learning. For example, Gaussian statistical active learning [13], [14] explores unlabeled data spatial connectivity to build an instance subset, from which critical instances are selected. In contrast, we take advantages of clustering algorithms to form blocks, and explore the label distribution in each block to guide the splitting process.
Third, we propose a new center-and-edge instance selection strategy for choosing the critical instances. This strategy takes into account the representativeness and information of each instance. Edge instances are used to reduce the uncertainty of the model, and central instances are used to represent the overall features of all unlabeled data. The strategy is based on the principle of maximum error, thus guaranteeing the accuracy of classification.
Fourth, we design a new algorithm called active learning through label error statistical model (ALSE). Fig. 1 illustrates the ALSE process using a running example. Part 1 is the input, which contains two types of datasets: Iris () and Sonar (). Part 2 is the theoretical and practical label error statistical methods. Theoretical label error statistical methods present the single-instance label error statistical function and instance-pair statistical function . Practical label error statistical models provide two empirical label error functions and obtained using statistical method. Part 3 is an example of iterative query, split and prediction using the Iris dataset. Three different sizes of cluster sub-blocks are obtained by clustering. The diameters of the three clusters are , and , respectively. For block 1, , we select representative instances 5, 23, 41. Since , block 1 is pure and we will predict all remaining instances. For block 2, , while judging that block 2 is impure, we need to split the block. For block 3, , we will split the block directly. In this way, the ALSE algorithm iteratively queries, splits, and predicts until all instances obtain the label. Part 4 is the output.
Experiments are undertaken on 20 datasets with 150–1,025,010 instances, 2–16,833 attributes, and 2–31 classes. These datasets include continuous, discrete, and mixed attribute types. We compare the ALSE algorithm with popular classifiers and state-of-the-art active learning algorithms. The Friedman test and Nemenyi post-hoc test are used to verify the significance of the differences between ALSE and the other algorithms. The results show that ALSE outperforms all of these competing algorithms in terms of classification accuracy.
The remainder of this paper is organized as follows. In Section 2, we briefly review active learning and cluster-based active learning. Section 3 presents the theoretical label error statistical methods. Section 4 illustrates the construction of practical label error statistical models. Section 5 introduces the pseudocode of the ALSE algorithm and analyzes its two key issues. Section 6 discusses the experimental results and Section 7 presents our conclusions.
Section snippets
Preliminaries
In this section, we describe several preliminary concepts, including active learning, clustering-based active learning, and the probabilistic Lipschitzness condition.
Theoretical label error statistical method
In this section, we define two new label error statistical functions. These functions are based on single instances and instance pairs, respectively. Using these statistical functions, we analyze the relationship between the statistical label error and the neighbor radius.
Practical label error statistical models
In this section, we present a statistical method for obtaining the cluster block empirical label error. The empirical label error is data- and algorithm-dependent. Therefore, we use a statistical method to obtain it from actual data.
Algorithm description
This section presents the ALSE algorithm, which uses CFDP clustering and the two empirical label error functions. First, we describe the ALSE algorithm. Second, we elaborate on two key sub-problems. Finally, we analyze the time complexity.
Experiments
We conducted experiments to analyze the effectiveness of the ALSE algorithm and to answer the following questions:
- (1)
Is the ALSE algorithm more accurate than other supervised classification algorithms, such as C4.5, Naïve Bayes, and Bagging?
- (2)
Is the ALSE algorithm more accurate than state-of-the-art active learning algorithms such as QBC, KQBC, QUIRE, and ALEC?
The computations were performed on a Windows 10 64-bit operating system with 8 GB RAM and Intel (R) Core 2Quad CPU [email protected] GHz processors
Conclusions and future work
Finding a cluster structure that is as consistent as possible with the real label distribution is a challenge for cluster-based active learning. In this paper, we have introduced reasonable assumptions and conducted theoretical and practical research. In terms of the theory, we defined two label error statistical functions; in practice, we obtained the empirical label error distribution. Using the empirical label error distribution, we obtained the appropriate cluster block diameter. Under the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61773208); the Sichuan Province Youth Science and Technology Innovation Team, China (2019JDTD0017); and the Ministry of Education Innovation Project, China (201801140013, 201801006094).
References (62)
On-line active learning: a new paradigm to improve practical useability of data stream modeling methods
Inform. Sci.
(2017)- et al.
Active learning with extremely sparse labeled examples
Neurocomputing
(2010) - et al.
Active learning through density clustering
Expert Syst. Appl.
(2017) - et al.
Gaussian process versus margin sampling active learning
Neurocomputing
(2015) - et al.
A combination of active learning and self-learning for named entity recognition on twitter using conditional random fields
Knowl.-Based Syst.
(2017) - et al.
Relational granulation method based on quotient space theory for maximum flow problem
Inform. Sci.
(2020) - et al.
Effective active learning strategy for multi-label learning
Neurocomputing
(2018) Hybrid active learning for reducing the annotation effort of operators in classification systems
Pattern Recognit.
(2012)- et al.
Cost-sensitive active learning through statistical methods
Inform. Sci.
(2019) - et al.
Cost-sensitive sequential three-way decision modeling using a deep neural network
Internat. J. Approx. Reason.
(2017)
Multi-imbalance: an open-source software for multi-class imbalance learning
Knowl.-Based Syst.
Statistical comparisons of active learning strategies over multiple datasets
Knowl.-Based Syst.
Active learning with statistical models
Comput. Sci.
Active learning literature survey
A general agnostic active learning algorithm
Optimal experimental design
Wiley Interdiscip. Rev. Comput. Stat.
Active learning by querying informative and representative examples
IEEE Trans. Pattern Anal. Mach. Intell.
Hierarchical sampling for active learning
Using cluster-based sampling to select initial training set for active learning in text classification
Active learning using pre-clustering
Active learning of gaussian processes with manifold-preserving graph reduction
Neural Comput. Appl.
Query by committee
Support vector machine active learning with applications to text classification
J. Mach. Learn. Res.
Active learning via transductive experimental design
Discovering the relationship between generalization and uncertainty by incorporating complexity of classification
IEEE Trans. Syst. Man Cybern.
Exploring representativeness and informativeness for active learning
IEEE Trans. Syst. Man Cybern.
Similarity-based attribute reduction in rough set theory: a clustering perspective
Int. J. Mach. Learn. Cybern.
Incorporating diversity and informativeness in multiple-instance active learning
IEEE Trans. Fuzzy Syst.
Incorporating diversity and density in active learning for relevance feedback
Cited by (22)
Open set transfer learning through distribution driven active learning
2024, Pattern RecognitionOpen world long-tailed data classification through active distribution optimization
2023, Expert Systems with ApplicationsActive learning with missing values considering imputation uncertainty
2021, Knowledge-Based SystemsCitation Excerpt :Representativeness-based query selection considers the extent to which an instance is representative of the data distribution. The most common strategy is to employ a clustering algorithm to identify representative instances [27–31]. Another strategy is to impose representativeness constraints on informativeness-based query selection, such as information density weighting [20] and diversity maximization [32].
Balancing Exploration and Exploitation: A novel active learner for imbalanced data
2020, Knowledge-Based SystemsCitation Excerpt :In the membership query synthesis active learning scenario, the learner generates a synthetic instance in the input space and then requests a label for it. However, some of these artificial instances could be impossible to label in a reasonable way [6,9,10]. In contrast, in the pool-based sampling approach, the trained active learner is used for evaluating some/all instances in the large set/pool of already existing unlabeled data.
Discover unknown fault categories through active query evidence model
2023, Applied Intelligence
- ☆
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.105140.