Elsevier

Information Sciences

Volume 467, October 2018, Pages 559-578
Information Sciences

Bag encoding strategies in multiple instance learning problems

https://doi.org/10.1016/j.ins.2018.08.020Get rights and content

Highlights

  • We propose novel bag encoding strategies that are robust to MIL assumptions.

  • We compared our proposals with recent MIL algorithms on 71 famous MIL datasets.

  • Our approach for MIL provides fast and effective bag classification as confirmed by the experiments on PASCAL VOC 2017 dataset.

  • The proposed algorithms are insensitive to both method and problem parameters.

  • To promote reproducible research, our and competitors implementations, datasets, results and the information regarding the cross-validation indices are made available on our supporting page. To the best of our knowledge, this study considers the largest database in the evaluation of the approaches.

Abstract

Multiple instance learning (MIL) deals with supervised learning tasks, where the aim is to learn from a set of labeled bags containing certain number of instances. In MIL setting, instance label information is unavailable, which makes it difficult to apply regular supervised learning. To resolve this problem, researchers devise methods focusing on certain assumptions regarding the instance labels. However, it is not a trivial task to determine which assumption holds for a new type of MIL problem. A bag-level representation based on instance characteristics does not require assumptions about the instance labels and is shown to be successful in MIL tasks. These approaches mainly encode bag vectors using bag-of-features representations. In this paper, we propose tree-based encoding strategies that partition the instance feature space and represent the bags using the frequency of instances residing at each partition. Our encoding implicitly learns generalized Gaussian Mixture Model (GMM) on the instance feature space and transforms this information into a bag-level summary. We show that bag representation using tree ensembles provides fast, accurate and robust representations. Our experiments on a large database of MIL problems show that tree-based encoding is highly scalable, and its performance is competitive with the state-of-the-art algorithms.

Introduction

Classification, one of the important class of supervised learning problems, vastly takes place in data mining tasks. In traditional classification tasks, each object is represented with a feature vector, and the aim is to predict the label of the object given some training data. However, this representation is not flexible when the data has a certain structure. For example, in image classification, images are segmented into patches and instead of a single feature vector, each image is represented by a set of feature vectors derived from the patches. This way, important information regarding the certain invariances such as location and scale can be taken into account [4]. Change of object representation provides benefits for a wide range of applications such as bioinformatics [15], document retrieval [3], computer vision [18] and etc. This type of applications fits well to Multiple Instance Learning (MIL) setting where each object is referred to as bag and each bag contains certain number of instances.

Most of the MIL approaches generally solve the binary classification problem, where bags are labeled as either positive, or negative [15], [27], [45], [47]. The firstly described formal MIL problem is a drug activity prediction problem, which considers molecules as bags and distinct shapes of the same molecule as instances [15]. A molecule is positively labeled if it includes at least one effective shape, otherwise it is negatively labeled. In text categorization problems [50], each document can be considered as a bag and its instances are the collection of relevant passages inside it. In all these applications, training bags are labeled and instances belonging to each bag do not necessarily have labels. The aim of MIL is to learn a classifier on the training bags to predict the label of a test bag.

Ambiguity about the instance labels has made researchers focus on certain assumptions regarding the instance labels. The so called standard MIL assumption is given as: if a bag is labeled positive, then at least one instance in that bag is labeled as positive; otherwise, labels of all instances in negative bags are negative [15]. It is obvious that when a bag is known to be positively labeled, the labels of its instances are not completely known. Standard MIL assumption is too restrictive to handle real-life problems. For example, optimal combination therapy is used in cancer treatment to overcome drug resistance. An optimal combination of drugs is considered to be capable of circumventing drug resistance among individual patients. Since there exists enormous number of possible drug combinations, the prediction problem of optimal combinatorial therapy can be modeled as a MIL problem where drugs are the instances, and collections of drugs are the bags. A bag is positive if a subset of its instances forms an effective drug combination, otherwise the bag is negative. Optimal combination therapy discovers an effective combination of drugs, rather than identifying a single type of drug that supports the treatment. Instead of a single positive instance, this MIL problem searches for a combination of multiple instances referring to various drugs.

Criticizing the potential problems with the standard MIL assumption, MIL problems are categorized as presence-based, threshold-based and count-based MIL problems [43]. A specific region of feature space where the positive instances are located are referred to as a concept by Weidmann et al. [43]. Presence-based MIL has the standard MIL assumption for multiple concepts, whereas threshold-based MIL forces a lower bound on the number of necessary instances of each concept. Finally, in addition to the previous lower bound, count-based MIL also requires an extra upper bound on the number of necessary instances from each concept. Extensions and variations of the described categorization of generalized MIL problems are also presented in [1], [2], [19]. Based on the experiments on synthetic and real datasets following different assumptions, the bag-level classification is indicated to be successful on datasets from different categories [1]. These approaches require each bag to be represented with a feature vector that will summarize the instance level information. Since bag-level methods are competitive, we focus on bag classification by representing each bag with a single feature vector in this study.

Earlier, many approaches from the computer vision literature utilized the well-known Bag-of-Features (BoF) or Bag-of-Words (BoW) representations to perform similar tasks. After clustering the patches (i.e. instances), the image (i.e. bag) is represented by the frequency of cluster assignments of the corresponding instances in the simple BoW setting [12]. These approaches implicitly transform the instance-level probabilistic distribution information to a bag-level summary [27], [42]. Recently, [10] has approached the problem by considering the geometric view of the instance space and obtain a bag-level summary using the similarities between the instances. Motivated by the success of the bag-level representations and their robustness to the MIL assumptions, this study proposes bag encoding strategies for MIL problems. Fig. 2 presents a summary of the bag representation algorithms, each of which will be discussed in detail in Section 4.

Most of the existing proposals to obtain bag-level summary require numerical features as an input since they involve transformations such as principal component analysis [42], density estimation [27] or distance calculations [10], [42]. However, a MIL dataset can have features other than numeric. When there are categorical features, dummy variables are required to be introduced. Moreover, standardization/normalization is required but standardization of the dummy variables introduced to represent categorical variables is not well-defined. Hence, an approach that can treat each variable without any modification may be required for certain applications. Considering this fact, our approach utilizes tree-based ensembles to partition the instance feature space. A tree learner trained on the raw data assigns each instance to a terminal node of the tree.

Use of trees for feature induction is a relatively new research direction, which is also named as hashing [40]. This method transforms each node in the tree to a feature. Moreover, the new representation is easy to be modified by changing the tree parameters. Each level of the tree provides a different partition of the instance feature space as they imply simple splitting rules on the features. An instance traverses the tree based on the splitting rules (i.e. follows a path in the constructed tree). The path followed by an instance implies regions of the feature space an instance belongs to and it provides an hierarchical information regarding the feature space an instance resides. Tree-based encoding of the feature space does not require scaling of the data as opposed to the approaches requiring distance calculations or density estimation. Fig. 1 illustrates the path-encoding of an instance. Next to each tree in Fig. 1, the traversed path by an instance is detected, and a binary vector is encoded conceiving whether a node is on that path, or not. Thus, these paths can be used to learn a BoW type representation. Our approach inherits the properties of tree-based learners. That is, it can handle numerical or categorical data. Besides, tree-based encoding is scale invariant and robust to missing values. The same tree can be used to encode the instances based only on terminal nodes. Earlier, Moosmann et al. [29] used a similar strategy for image classification problems using supervised randomized trees and has shown to provide successful results. This strategy has potential to lose information since two instances residing at sibling terminal nodes follow the same path, and therefore, they are closely similar to each other.

The first implementation of tree-based encoding to generate bag representations is proposed by TLC [43] in MIL setting. However, TLC builds a supervised tree on all instances in the training data assuming that the instances share the same label with their owner bags. This strong assumption regarding the instance labels and greediness of a single tree is problematic. To avoid potential problems with these assumptions, we propose randomized tree ensembles to convert MIL problem into a supervised learning problem. To best of our knowledge, this is the first study exploring the use of multiple unsupervised trees together with path-encoding to solve MIL problems. As shown by the seminal work by Criminisi et al. [13], unsupervised randomized trees are a generalization of Gaussian Mixture Models (GMM), where each leaf of a randomized clustering tree is considered as a Gaussian component. Hence, our representation implicitly takes the density information into account. This way, parametrized optimization processes that are common in generative learning models are avoided in tree-based encoding. The proposed approach scales well with large datasets and it is embarrassingly parallel. Once the bags are encoded, a supervised learning algorithm can be trained on the new representation. Our experiments on a large database of MIL problems show that performance of the proposed representations is competitive with the state-of-the-art algorithms. Classification of bags instead of individual instances is reasonable while solving MIL problem on large datasets. We also present experimental results on PASCAL Visual Object Classes (VOC) 2007 dataset [18] to verify the scalability of our proposed bag encoding algorithms.

The remainder of this paper is organized as follows: Section 2 summarizes and compares the MIL methods in the literature. Section 3 formalizes the MIL problem and provides the necessary background. Bag representation schemes and the proposed solution algorithms are introduced in Section 4. Density modeling success of randomized trees are discussed in Section 5. Description of real world datasets and the results of the carried out experiments followed by parametric and computational analysis are demonstrated in Section 6. Finally, Section 7 draws the conclusions.

Section snippets

Related work

The earliest MIL algorithm [15] maximizes the number of positive instances residing in a single axis parallel rectangle (APR), and minimizes the number of negative instances inside APR. Then, Diverse Density (DD) algorithm is proposed in [27], where positive instances are assumed to follow a Gaussian distribution. In DD and its variant EM-DD [47], gradient descent with multiple starts is employed to maximize the diverse density, which is the aggregation of closeness of instances to every

Background

Let xi be a d-dimensional feature vector of instance i. Then, X={xi:i=1,,n} forms the set of instances. A bag Bj is a set of xi’s where nj is the number of the instances in Bj. Therefore, χ={(Bj,lj):j=1,,m} is a training bag set containing instances in the bags and a single, discrete-valued feature, specifically a label lj of each bag Bj. Let g(Bj) → lj be the function of a bag-based single classifier that we are looking for. Another representation of bag Bj is formed by a k-dimensional

Multiple instance learning with bag encoding

The proposed approach has two main stages: bag encoding and bag-level classification. The underlying technical stages of bag encoding are illustrated in Fig. 2, consisting of k-means-encoding, path-encoding and terminal node-encoding. Initially, all of our proposals for bag encoding simply partition the feature space of instances as shown in stage one of Fig. 2. In k-means-encoding, k-means algorithm is utilized to cluster the instances. In RT-based encoding, multiple randomized decision trees

Comparison of miFV and RT-encoding in density estimation

In miFV [42], density of the instance feature space is modeled by a GMM. The parameters of Gaussian components are estimated to obtain the Fisher vector (FV) representation. Similarly, unsupervised RTs perform instance partitioning to represent the data, where each leaf implies a Gaussian component. We provide a discussion on parameter insensitivity and density modeling success of randomized trees on a toy example in this section. As mentioned in [13], density forests are generalizations of GMM

Experiments

We test our proposals on a wide range of MIL datasets from different categories to avoid the application bias. Unfortunately, most of the MIL studies follow different strategies for experimentation, which complicates the comparisons. For instance, some studies do not report performance on certain datasets because of its computational requirements, others use different settings for the cross-validation or split train data randomly and report test performance. Due to this fact, a comprehensive

Conclusions and future work

Use of bag-level representations for MIL has been shown to provide successful results in the literature. This paper proposes a robust framework that uses random trees to partition the feature space together with either a terminal node, or a path-based representation. Our encoding implicitly learns generalized Gaussian Mixture Model (GMM) on the instance feature space and this information is transformed into a bag-level summary. Proposed representations provide very fast and competitive results

References (50)

  • F. Briggs et al.

    Acoustic classification of multiple simultaneous bird species: a multi-instance multi-label approach

    J. Acoust. Soc. Am.

    (2012)
  • Y. Chen et al.

    Miles: multiple-instance learning via embedded instance selection

    Pattern Anal. Mach. Intell., IEEE Trans.

    (2006)
  • V. Cheplygina et al.

    Dissimilarity-based ensembles for multiple instance learning

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • A. Coates et al.

    Learning feature representations with k-means

    Neural Networks: Tricks of the Trade: Second Edition

    (2012)
  • A. Criminisi et al.

    Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning

    Found. Trends Comput. Graph. Vision

    (2012)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (2006)
  • R.P. Duin, PRtools. Version 4.1.5., 2009,...
  • A. Erdem et al.

    Multiple-instance learning with instance selection via dominant sets

    Similarity-Based Pattern Recognition

    (2011)
  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    Int. J. Comput. Vis.

    (2010)
  • J. Foulds et al.

    A review of multi-instance learning assumptions

    Knowl. Eng. Rev.

    (2010)
  • M. Friedman

    A comparison of alternative tests of significance for the problem of m rankings

    Ann. Math. Stat.

    (1940)
  • Z. Fu et al.

    Milis: multiple instance learning with instance selection

    Pattern Anal. Mach. Intell., IEEE Trans.

    (2011)
  • M. Haußmann et al.

    Variational bayesian multiple instance learning with gaussian processes

    CVPR

    (2017)
  • M. Kandemir et al.

    Empowering multiple instance histopathology cancer diagnosis by cell graphs

    Medical Image Computing and Computer-Assisted Intervention–MICCAI 2014

    (2014)
  • E.S. Kucukasci, M.G. Baydogan, Multiple instance learning bag encoding strategies, 2018,...
  • Cited by (8)

    • Multi-instance attention network for few-shot learning

      2022, Information Sciences
      Citation Excerpt :

      Wei et al.[38] uses a vector of locally aggregated descriptors and fisher vector to encode bags into low-dimensional features, making the proposed methods suitable for large scale data. Küçükaşcı et al. [18] proposes a hash encoding module to obtain the bag feature. With the success of deep learning in various fields, Chi et al. [5] proposed two multilayer perceptron modules, which are instance feature block and bag classifier block to exploit the information in the negative bags.

    • Multiple instance classification: Bag noise filtering for negative instance noise cleaning

      2021, Information Sciences
      Citation Excerpt :

      Bag space paradigm [6]: the classifier works with the whole bag by means of similarity functions. Embedded space paradigm [25]: the classifier transform the original space to a new embedded space, where the bags are represented as single attribute vectors. In this study we will select a set of representative classifiers from different categories from the aforementioned taxonomy, summarized as follows:

    View all citing articles on Scopus
    View full text