Elsevier

Knowledge-Based Systems

Volume 120, 15 March 2017, Pages 198-210
Knowledge-Based Systems

A multiple-instance stream learning framework for adaptive document categorization

https://doi.org/10.1016/j.knosys.2017.01.001Get rights and content

Abstract

The task of document categorization is to classify documents from a stream as relevant or non-relevant to a particular user interest so as to reduce information overload. Existing solutions typically perform classification at the document level, i.e., a document is returned as relevant if at least a part of the document is of interest of the user. In this paper, we propose a novel multiple-instance stream learning framework for adaptive document categorization, named MIS-DC. Our proposed approach has the ability of making accuracy prediction at both the document level and the block level, while only requires labeling the training documents at the document level. In addition, our proposed approach can also provide adaptive document categorization by detecting and handling concept drift at a finer granularity when data streams evolve over time, thereby yielding higher prediction accuracy than existing data stream algorithms. Experiments on benchmark and real-world datasets have demonstrated the effectiveness of our proposed approach.

Introduction

With the tremendous growth of resources on the Internet, the need to provide useful information to end users has become increasingly critical in the development of document-based recommendation systems. Nowadays, users find themselves confronted with huge amounts of information such as real-time news, blogs and documents in a streaming manner, which are refereed to as document data streams [1], [2], [3]. However, only a small fraction of this is actually relevant to the interests of a particular user. In order to reduce the effort a user has to put into determining which information is relevant to his interests, the primary task of document categorization is to automatically classify all non-relevant documents from an incoming stream, such that only relevant documents are presented to the end user.

Despite much progress in traditional data mining, classifying time-evolving document data streams still remains a difficult task [4], [5], [6]. One reason is that the content on documents may contain information about multiple topics, and possibly other undesirable parts, e.g., text advertisements. This is especially the case among long and multiple-topic documents, with each block corresponding to a different topic [7], [8]. Therefore, it is most likely that only some parts on the document are relevant to a particular user interest. If we treat each document as a whole and build a classifier using the features extracted from the entire document, it would inevitably incur a lot of noise into the learning process, which degrades the prediction accuracy of the classifier. Another source of difficulty is that both user interests and the information space are complex and dynamic. On the one hand, a user is typically interested in a variety of topics, which are fluid and interrelated. Such interests may vary rapidly over time, with new topics of interest emerging, and previously interesting topics waning and even becoming obsolete. For example, in a news categorization application, a user might be interested in news articles about “death of Michael Jackson” in July 2009, and then his interest might drift to “Avatar” when the new movie was released in late 2009. On the other hand, the information space itself also dynamically changes over time, filled with new material such as new combinations of concepts, even new concepts, and the occurrences of new events. Therefore, it is a challenging task to design an adaptive classifier that not only provides prediction to identify targeted content, but also matches to changing concepts in the stream [9], [10], [11].

In this paper, we propose a novel multiple-instance stream learning framework for document categorization, termed as MIS-DC. To the best of our knowledge, our work is the first to propose a multiple-instance learning framework for classifying time-evolving document data streams. In document data stream categorization, a document is classified as relevant to a particular user interest as long as a certain part of the document is on that topic. It does not require every piece of the document is about that topic. Therefore, it is more appropriate to formulate this task from a multiple-instance angle. Following the description of multiple-instance learning, we consider a document as a bag, and the paragraph blocks in the document as instances in the bag. A document is classified as relevant (positive) if it contains at least one block of content related to the targeted subject of interest. Otherwise, if a document contains all negative (non-relevant) blocks, it is classified as non-relevant. Based on this formulation, the classifier cannot only classify whether a document contains some relevant content, but also label which part of the document is of interest to the user at a finer granularity.

In our proposed framework, we also design a new mechanism to handle concept drift in user interests. Our method builds on chunk-based approaches [12], [13], which typically divide a data stream into a number of chunks and combine the base classifiers learnt on individual data chunks to form an ensemble classifier for prediction. The basic assumption made by these works is that the data within a same chunk shares an identical distribution, so that concept drift only happens at the boundaries between chunks. However, this assumption may not hold in real-world applications, where we have no prior knowledge of when the drift would happen. Therefore, we propose a novel approach to detect and handle concept drift that could happen within a data chunk. Specifically, our proposed approach works in three steps. Firstly, we decompose each data chunk in the stream into a sequence of small portions at a lower granularity, and for each portion, an instance-level multiple-instance learning algorithm is utilized to remove irrelevant paragraphs from positive documents so that the remaining paragraphs are of interest to the user. Secondly, we perform core vocabulary analysis to extract positive features from relevant paragraphs in positive documents and detect the occurrence of concept drift between portions. Finally, when concept drift is detected between portions, we construct a transfer learning model to make prediction for incoming data chunks at both the document level and at the block level.

The advantage of our proposed approach can be summarized as follows.

  • Our proposed approach can effectively detect and handle concept drift that occurs within a data chunk, thereby yielding higher prediction accuracy than existing data stream algorithms, when data streams evolve over time.

  • Our proposed approach enables accuracy predication at both the document level and block level, while only requires the labelling of training documents at the document level.

The rest of the paper is organized as follows. Section 2 discusses the previous works related to our study. Section 3 gives a formal definition of the problem addressed in this paper. Section 4 presents the details of our proposed approach. Section 5 reports the experimental results on real-world datasets. Section 6 concludes the paper and discusses possible directions for future work. To be clear, the basic notations in this paper are defined in Table 1.

Section snippets

Related work

In this section, we review the related work in two branches of research areas, including multiple-instance learning and learning concept drift from data streams.

Problem definition

Suppose we have a series of data chunks, denoted as D1, D2, , Dm, where Di(i=1,2,,m) contains the data that arrived between time period Ti1 and Ti. Di(i=1,2,,m1) are the historical chunks, and Dm is the current chunk, namely the training chunk. The yet-to-come data chunk Dm+1 is considered as the target chunk. The objective of data stream learning is to learn a classifier on the historical data chunks and the current data chunk, and utilize the learnt classifier to predict the yet-to-come

Our proposed approach

We propose a novel multiple-instance stream learning framework for adaptive document categorization. Distinctive from existing data stream learning works which assume that concept drift only happens on the boundaries of data chunks, our approach is proposed to learn the multiple-instance stream where concept drift may occur within a data chunk. Our approach consists of three steps, as illustrated in Algorithm 1. In the following, we will introduce each step in details.

In multiple-instance

Experiments

To evaluate the effectiveness of the proposed approach MIS-DC, we perform experiments on benchmark datasets – 20 Newsgroup and BankSearch, and the real-world spam filtering dataset – TREC05. The objectives of our experiments are

  • to evaluate the effectiveness of MIS-DC for handling concept drift which could occur on the chunk boundary or within the data chunk;

  • to evaluate the performance of MIS-DC on not only the document-level classification, but also the block-level prediction;

  • to evaluate the

Conclusion and future work

In this paper, we propose a novel multiple-instance stream learning framework for document categorization, named MIS-DC. MIS-DC has the ability of making accuracy prediction at both the document level and the block level, while only requires labeling the training documents at the document level. In addition, MIS-DC can also provide adaptive document categorization by detecting and handling concept drift at a finer granularity when data streams evolve over time, thereby yielding higher

Acknowledgments

This work was supported in part by the Natural Science Foundation of China under Grant 61472090, Grant 61672169, and Grant 61472089, in part by the NSFC-Guangdong Joint Found under Grant U1501254, in part by the Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S2013050014133, in part by the Natural Science Foundation of Guangdong under Grant 2015A030313486, Grant 2014A030306004, and Grant 2014A030308008, in part by the Science and Technology Planning Project of

References (52)

  • R. Elwell et al.

    Incremental learning of concept drift in nonstationary environments

    IEEE Trans. Neural Netw. Learn. Syst.

    (2011)
  • P. Zhang et al.

    E-Tree: an efficient indexing structure for ensemble models on data streams

    IEEE Trans. Knowl. Data Eng.

    (2015)
  • C.C. Aggarwal et al.

    A framework for on-demand classification of evolving data streams

    IEEE Trans. Knowl. Data Eng.

    (2006)
  • ZhangQ. et al.

    Online learning from trapezoidal data streams

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • J. Shao et al.

    Prototype-based learning on concept-drifting data streams

    Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

    (2014)
  • R.D. Rosa et al.

    The ABACOC algorithm: A novel approach for nonparametric classification of data streams

    Proceedings of IEEE International Conference on Data Mining (ICDM)

    (2015)
  • GaoJ. et al.

    A general framework for mining concept-drifting data streams with skewed distributions

    Proceedings of the SIAM International Conference on Data Mining (SDM)

    (2007)
  • LimY. et al.

    Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams

    Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

    (2015)
  • XiaoY. et al.

    A similarity-based classification framework for multiple-instance learning

    IEEE Trans. Cybern.

    (2014)
  • V. Cheplygina et al.

    Dissimilarity-based ensembles for multiple instance learning

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • D.T. Nguyen et al.

    Mi-ds: multiple-instance learning algorithm

    IEEE Trans. Cybern.

    (2013)
  • ChenY. et al.

    Image categorization by learning and reasoning with regions

    J. M. Learn. Res.

    (2004)
  • ChenY. et al.

    Mile: multiple instance learning via embedded instance selection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • S. Andrews et al.

    Support vector machines for multiple-instance learning

    Proceedings of the Advances in Neural Information Processing Systems (NIPS)

    (2003)
  • P. Viola et al.

    Multiple instance boosting for object detection

    Proceedings of the Advances in Neural Information Processing Systems (NIPS)

    (2006)
  • A. Bifet et al.

    Efficient online evaluation of big data stream classifiers

    Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

    (2015)
  • Cited by (11)

    • Transfer learning-based one-class dictionary learning for recommendation data stream

      2021, Information Sciences
      Citation Excerpt :

      Evolving systems act in sample-wise single-pass manner to update parameters and adjust intrinsic structural components [13], while online incremental machine learning only updates parameters. For the embed-based methods [1,18], they can combine the classifiers built on the historical chunks to form a classifier for prediction. The metric of the incremental methods is they can update the machine learning tool automatically; however, sometimes it is not easy to formulate the learning tool into the incremental form.

    • Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Among various text analysis tasks, document classification is the most classical one, whose goal is to allocate documents into different categories. It has been widely applied in various applications such as spam detection [1], ontology mapping [2], document recommendation [3], topic labeling [4], and sentiment classification [5,6]. In this paper, we focus on text representation learning for document classification.

    • A recent overview of the state-of-the-art elements of text classification

      2018, Expert Systems with Applications
      Citation Excerpt :

      There are two strategies for data labelling: labelling groups of texts and assigning a label or labels to each text part. The first strategy is called multi-instance learning (Alpaydın, Cheplygina, Loog, & Tax, 2015; Foulds & Frank, 2010; Herrera et al., 2016; Liu, Xiao, & Hao, 2018; Ray, Scott, & Blockeel, 2010; Xiao, Liu, Yin, & Hao, 2017; Yan, Li, & Zhang, 2016; Yan, Zhu, Liu, & Wu, 2017), whereas the second one includes different supervised methods (Sammut & Webb, 2017). The results of this phase are employed in the succeeding stages.

    • An evidential dynamical model to predict the interference effect of categorization on decision making results

      2018, Knowledge-Based Systems
      Citation Excerpt :

      For instance, doctors must classify the tumor before doing the surgery; judges need to categorize the defendant before making a judgement; the commander need to categorize an unexpected aircraft before making a command. Additionally, categorization is also an important task in knowledge-based systems [4–6], which is the basis of decision making. However, lots of practical examples and experiments show that categorization may result in the disjunction fallacy, which violates the law of total probability [7,8].

    View all citing articles on Scopus
    View full text