Bag encoding strategies in multiple instance learning problems
Introduction
Classification, one of the important class of supervised learning problems, vastly takes place in data mining tasks. In traditional classification tasks, each object is represented with a feature vector, and the aim is to predict the label of the object given some training data. However, this representation is not flexible when the data has a certain structure. For example, in image classification, images are segmented into patches and instead of a single feature vector, each image is represented by a set of feature vectors derived from the patches. This way, important information regarding the certain invariances such as location and scale can be taken into account [4]. Change of object representation provides benefits for a wide range of applications such as bioinformatics [15], document retrieval [3], computer vision [18] and etc. This type of applications fits well to Multiple Instance Learning (MIL) setting where each object is referred to as bag and each bag contains certain number of instances.
Most of the MIL approaches generally solve the binary classification problem, where bags are labeled as either positive, or negative [15], [27], [45], [47]. The firstly described formal MIL problem is a drug activity prediction problem, which considers molecules as bags and distinct shapes of the same molecule as instances [15]. A molecule is positively labeled if it includes at least one effective shape, otherwise it is negatively labeled. In text categorization problems [50], each document can be considered as a bag and its instances are the collection of relevant passages inside it. In all these applications, training bags are labeled and instances belonging to each bag do not necessarily have labels. The aim of MIL is to learn a classifier on the training bags to predict the label of a test bag.
Ambiguity about the instance labels has made researchers focus on certain assumptions regarding the instance labels. The so called standard MIL assumption is given as: if a bag is labeled positive, then at least one instance in that bag is labeled as positive; otherwise, labels of all instances in negative bags are negative [15]. It is obvious that when a bag is known to be positively labeled, the labels of its instances are not completely known. Standard MIL assumption is too restrictive to handle real-life problems. For example, optimal combination therapy is used in cancer treatment to overcome drug resistance. An optimal combination of drugs is considered to be capable of circumventing drug resistance among individual patients. Since there exists enormous number of possible drug combinations, the prediction problem of optimal combinatorial therapy can be modeled as a MIL problem where drugs are the instances, and collections of drugs are the bags. A bag is positive if a subset of its instances forms an effective drug combination, otherwise the bag is negative. Optimal combination therapy discovers an effective combination of drugs, rather than identifying a single type of drug that supports the treatment. Instead of a single positive instance, this MIL problem searches for a combination of multiple instances referring to various drugs.
Criticizing the potential problems with the standard MIL assumption, MIL problems are categorized as presence-based, threshold-based and count-based MIL problems [43]. A specific region of feature space where the positive instances are located are referred to as a concept by Weidmann et al. [43]. Presence-based MIL has the standard MIL assumption for multiple concepts, whereas threshold-based MIL forces a lower bound on the number of necessary instances of each concept. Finally, in addition to the previous lower bound, count-based MIL also requires an extra upper bound on the number of necessary instances from each concept. Extensions and variations of the described categorization of generalized MIL problems are also presented in [1], [2], [19]. Based on the experiments on synthetic and real datasets following different assumptions, the bag-level classification is indicated to be successful on datasets from different categories [1]. These approaches require each bag to be represented with a feature vector that will summarize the instance level information. Since bag-level methods are competitive, we focus on bag classification by representing each bag with a single feature vector in this study.
Earlier, many approaches from the computer vision literature utilized the well-known Bag-of-Features (BoF) or Bag-of-Words (BoW) representations to perform similar tasks. After clustering the patches (i.e. instances), the image (i.e. bag) is represented by the frequency of cluster assignments of the corresponding instances in the simple BoW setting [12]. These approaches implicitly transform the instance-level probabilistic distribution information to a bag-level summary [27], [42]. Recently, [10] has approached the problem by considering the geometric view of the instance space and obtain a bag-level summary using the similarities between the instances. Motivated by the success of the bag-level representations and their robustness to the MIL assumptions, this study proposes bag encoding strategies for MIL problems. Fig. 2 presents a summary of the bag representation algorithms, each of which will be discussed in detail in Section 4.
Most of the existing proposals to obtain bag-level summary require numerical features as an input since they involve transformations such as principal component analysis [42], density estimation [27] or distance calculations [10], [42]. However, a MIL dataset can have features other than numeric. When there are categorical features, dummy variables are required to be introduced. Moreover, standardization/normalization is required but standardization of the dummy variables introduced to represent categorical variables is not well-defined. Hence, an approach that can treat each variable without any modification may be required for certain applications. Considering this fact, our approach utilizes tree-based ensembles to partition the instance feature space. A tree learner trained on the raw data assigns each instance to a terminal node of the tree.
Use of trees for feature induction is a relatively new research direction, which is also named as hashing [40]. This method transforms each node in the tree to a feature. Moreover, the new representation is easy to be modified by changing the tree parameters. Each level of the tree provides a different partition of the instance feature space as they imply simple splitting rules on the features. An instance traverses the tree based on the splitting rules (i.e. follows a path in the constructed tree). The path followed by an instance implies regions of the feature space an instance belongs to and it provides an hierarchical information regarding the feature space an instance resides. Tree-based encoding of the feature space does not require scaling of the data as opposed to the approaches requiring distance calculations or density estimation. Fig. 1 illustrates the path-encoding of an instance. Next to each tree in Fig. 1, the traversed path by an instance is detected, and a binary vector is encoded conceiving whether a node is on that path, or not. Thus, these paths can be used to learn a BoW type representation. Our approach inherits the properties of tree-based learners. That is, it can handle numerical or categorical data. Besides, tree-based encoding is scale invariant and robust to missing values. The same tree can be used to encode the instances based only on terminal nodes. Earlier, Moosmann et al. [29] used a similar strategy for image classification problems using supervised randomized trees and has shown to provide successful results. This strategy has potential to lose information since two instances residing at sibling terminal nodes follow the same path, and therefore, they are closely similar to each other.
The first implementation of tree-based encoding to generate bag representations is proposed by TLC [43] in MIL setting. However, TLC builds a supervised tree on all instances in the training data assuming that the instances share the same label with their owner bags. This strong assumption regarding the instance labels and greediness of a single tree is problematic. To avoid potential problems with these assumptions, we propose randomized tree ensembles to convert MIL problem into a supervised learning problem. To best of our knowledge, this is the first study exploring the use of multiple unsupervised trees together with path-encoding to solve MIL problems. As shown by the seminal work by Criminisi et al. [13], unsupervised randomized trees are a generalization of Gaussian Mixture Models (GMM), where each leaf of a randomized clustering tree is considered as a Gaussian component. Hence, our representation implicitly takes the density information into account. This way, parametrized optimization processes that are common in generative learning models are avoided in tree-based encoding. The proposed approach scales well with large datasets and it is embarrassingly parallel. Once the bags are encoded, a supervised learning algorithm can be trained on the new representation. Our experiments on a large database of MIL problems show that performance of the proposed representations is competitive with the state-of-the-art algorithms. Classification of bags instead of individual instances is reasonable while solving MIL problem on large datasets. We also present experimental results on PASCAL Visual Object Classes (VOC) 2007 dataset [18] to verify the scalability of our proposed bag encoding algorithms.
The remainder of this paper is organized as follows: Section 2 summarizes and compares the MIL methods in the literature. Section 3 formalizes the MIL problem and provides the necessary background. Bag representation schemes and the proposed solution algorithms are introduced in Section 4. Density modeling success of randomized trees are discussed in Section 5. Description of real world datasets and the results of the carried out experiments followed by parametric and computational analysis are demonstrated in Section 6. Finally, Section 7 draws the conclusions.
Section snippets
Related work
The earliest MIL algorithm [15] maximizes the number of positive instances residing in a single axis parallel rectangle (APR), and minimizes the number of negative instances inside APR. Then, Diverse Density (DD) algorithm is proposed in [27], where positive instances are assumed to follow a Gaussian distribution. In DD and its variant EM-DD [47], gradient descent with multiple starts is employed to maximize the diverse density, which is the aggregation of closeness of instances to every
Background
Let xi be a d-dimensional feature vector of instance i. Then, forms the set of instances. A bag Bj is a set of xi’s where nj is the number of the instances in Bj. Therefore, is a training bag set containing instances in the bags and a single, discrete-valued feature, specifically a label lj of each bag Bj. Let g(Bj) → lj be the function of a bag-based single classifier that we are looking for. Another representation of bag Bj is formed by a k-dimensional
Multiple instance learning with bag encoding
The proposed approach has two main stages: bag encoding and bag-level classification. The underlying technical stages of bag encoding are illustrated in Fig. 2, consisting of k-means-encoding, path-encoding and terminal node-encoding. Initially, all of our proposals for bag encoding simply partition the feature space of instances as shown in stage one of Fig. 2. In k-means-encoding, k-means algorithm is utilized to cluster the instances. In RT-based encoding, multiple randomized decision trees
Comparison of miFV and RT-encoding in density estimation
In miFV [42], density of the instance feature space is modeled by a GMM. The parameters of Gaussian components are estimated to obtain the Fisher vector (FV) representation. Similarly, unsupervised RTs perform instance partitioning to represent the data, where each leaf implies a Gaussian component. We provide a discussion on parameter insensitivity and density modeling success of randomized trees on a toy example in this section. As mentioned in [13], density forests are generalizations of GMM
Experiments
We test our proposals on a wide range of MIL datasets from different categories to avoid the application bias. Unfortunately, most of the MIL studies follow different strategies for experimentation, which complicates the comparisons. For instance, some studies do not report performance on certain datasets because of its computational requirements, others use different settings for the cross-validation or split train data randomly and report test performance. Due to this fact, a comprehensive
Conclusions and future work
Use of bag-level representations for MIL has been shown to provide successful results in the literature. This paper proposes a robust framework that uses random trees to partition the feature space together with either a terminal node, or a path-based representation. Our encoding implicitly learns generalized Gaussian Mixture Model (GMM) on the instance feature space and this information is transformed into a bag-level summary. Proposed representations provide very fast and competitive results
References (50)
- et al.
Single-vs. multiple-instance classification
Pattern Recognit.
(2015) Multiple instance classification: review, taxonomy and comparative study
Artif. Intell.
(2013)- et al.
Robust multiple-instance learning ensembles using random subspace instance selection
Pattern Recognit.
(2016) - et al.
Multiple instance learning with bag dissimilarities
Pattern Recognit.
(2015) - et al.
Solving the multiple instance problem with axis-parallel rectangles
Artif. Intell.
(1997) - et al.
Diversified dictionaries for multi-instance learning
Pattern Recognit.
(2017) - et al.
Support vector machines for multiple-instance learning
Advances in Neural Information Processing Systems 15
(2003) - et al.
A bag-of-features framework to classify time series
Pattern Anal. Mach. Intell., IEEE Trans.
(2013) Random forests
Mach. Learn.
(2001)- et al.
Classification and Regression Trees
(1984)
Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach
J. Acoust. Soc. Am.
Miles: multiple-instance learning via embedded instance selection
Pattern Anal. Mach. Intell., IEEE Trans.
Dissimilarity-based ensembles for multiple instance learning
IEEE Trans. Neural Netw. Learn. Syst.
Learning feature representations with k-means
Neural Networks: Tricks of the Trade: Second Edition
Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning
Found. Trends Comput. Graph. Vision
Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
Multiple-instance learning with instance selection via dominant sets
Similarity-Based Pattern Recognition
The pascal visual object classes (voc) challenge
Int. J. Comput. Vis.
A review of multi-instance learning assumptions
Knowl. Eng. Rev.
A comparison of alternative tests of significance for the problem of m rankings
Ann. Math. Stat.
Milis: multiple instance learning with instance selection
Pattern Anal. Mach. Intell., IEEE Trans.
Variational bayesian multiple instance learning with gaussian processes
CVPR
Empowering multiple instance histopathology cancer diagnosis by cell graphs
Medical Image Computing and Computer-Assisted Intervention–MICCAI 2014
Cited by (8)
Multi-instance learning with application to the profiling of multi-victim homicides
2024, Expert Systems with ApplicationsMulti-instance attention network for few-shot learning
2022, Information SciencesCitation Excerpt :Wei et al.[38] uses a vector of locally aggregated descriptors and fisher vector to encode bags into low-dimensional features, making the proposed methods suitable for large scale data. Küçükaşcı et al. [18] proposes a hash encoding module to obtain the bag feature. With the success of deep learning in various fields, Chi et al. [5] proposed two multilayer perceptron modules, which are instance feature block and bag classifier block to exploit the information in the negative bags.
Multiple instance classification: Bag noise filtering for negative instance noise cleaning
2021, Information SciencesCitation Excerpt :Bag space paradigm [6]: the classifier works with the whole bag by means of similarity functions. Embedded space paradigm [25]: the classifier transform the original space to a new embedded space, where the bags are represented as single attribute vectors. In this study we will select a set of representative classifiers from different categories from the aforementioned taxonomy, summarized as follows:
Concentration or distraction? A synergetic-based attention weights optimization method
2023, Complex and Intelligent SystemsOn Sparse Modern Hopfield Model
2023, arXiv