A multiple-instance stream learning framework for adaptive document categorization
Introduction
With the tremendous growth of resources on the Internet, the need to provide useful information to end users has become increasingly critical in the development of document-based recommendation systems. Nowadays, users find themselves confronted with huge amounts of information such as real-time news, blogs and documents in a streaming manner, which are refereed to as document data streams [1], [2], [3]. However, only a small fraction of this is actually relevant to the interests of a particular user. In order to reduce the effort a user has to put into determining which information is relevant to his interests, the primary task of document categorization is to automatically classify all non-relevant documents from an incoming stream, such that only relevant documents are presented to the end user.
Despite much progress in traditional data mining, classifying time-evolving document data streams still remains a difficult task [4], [5], [6]. One reason is that the content on documents may contain information about multiple topics, and possibly other undesirable parts, e.g., text advertisements. This is especially the case among long and multiple-topic documents, with each block corresponding to a different topic [7], [8]. Therefore, it is most likely that only some parts on the document are relevant to a particular user interest. If we treat each document as a whole and build a classifier using the features extracted from the entire document, it would inevitably incur a lot of noise into the learning process, which degrades the prediction accuracy of the classifier. Another source of difficulty is that both user interests and the information space are complex and dynamic. On the one hand, a user is typically interested in a variety of topics, which are fluid and interrelated. Such interests may vary rapidly over time, with new topics of interest emerging, and previously interesting topics waning and even becoming obsolete. For example, in a news categorization application, a user might be interested in news articles about “death of Michael Jackson” in July 2009, and then his interest might drift to “Avatar” when the new movie was released in late 2009. On the other hand, the information space itself also dynamically changes over time, filled with new material such as new combinations of concepts, even new concepts, and the occurrences of new events. Therefore, it is a challenging task to design an adaptive classifier that not only provides prediction to identify targeted content, but also matches to changing concepts in the stream [9], [10], [11].
In this paper, we propose a novel multiple-instance stream learning framework for document categorization, termed as MIS-DC. To the best of our knowledge, our work is the first to propose a multiple-instance learning framework for classifying time-evolving document data streams. In document data stream categorization, a document is classified as relevant to a particular user interest as long as a certain part of the document is on that topic. It does not require every piece of the document is about that topic. Therefore, it is more appropriate to formulate this task from a multiple-instance angle. Following the description of multiple-instance learning, we consider a document as a bag, and the paragraph blocks in the document as instances in the bag. A document is classified as relevant (positive) if it contains at least one block of content related to the targeted subject of interest. Otherwise, if a document contains all negative (non-relevant) blocks, it is classified as non-relevant. Based on this formulation, the classifier cannot only classify whether a document contains some relevant content, but also label which part of the document is of interest to the user at a finer granularity.
In our proposed framework, we also design a new mechanism to handle concept drift in user interests. Our method builds on chunk-based approaches [12], [13], which typically divide a data stream into a number of chunks and combine the base classifiers learnt on individual data chunks to form an ensemble classifier for prediction. The basic assumption made by these works is that the data within a same chunk shares an identical distribution, so that concept drift only happens at the boundaries between chunks. However, this assumption may not hold in real-world applications, where we have no prior knowledge of when the drift would happen. Therefore, we propose a novel approach to detect and handle concept drift that could happen within a data chunk. Specifically, our proposed approach works in three steps. Firstly, we decompose each data chunk in the stream into a sequence of small portions at a lower granularity, and for each portion, an instance-level multiple-instance learning algorithm is utilized to remove irrelevant paragraphs from positive documents so that the remaining paragraphs are of interest to the user. Secondly, we perform core vocabulary analysis to extract positive features from relevant paragraphs in positive documents and detect the occurrence of concept drift between portions. Finally, when concept drift is detected between portions, we construct a transfer learning model to make prediction for incoming data chunks at both the document level and at the block level.
The advantage of our proposed approach can be summarized as follows.
- •
Our proposed approach can effectively detect and handle concept drift that occurs within a data chunk, thereby yielding higher prediction accuracy than existing data stream algorithms, when data streams evolve over time.
- •
Our proposed approach enables accuracy predication at both the document level and block level, while only requires the labelling of training documents at the document level.
The rest of the paper is organized as follows. Section 2 discusses the previous works related to our study. Section 3 gives a formal definition of the problem addressed in this paper. Section 4 presents the details of our proposed approach. Section 5 reports the experimental results on real-world datasets. Section 6 concludes the paper and discusses possible directions for future work. To be clear, the basic notations in this paper are defined in Table 1.
Section snippets
Related work
In this section, we review the related work in two branches of research areas, including multiple-instance learning and learning concept drift from data streams.
Problem definition
Suppose we have a series of data chunks, denoted as D1, D2, Dm, where contains the data that arrived between time period and Ti. are the historical chunks, and Dm is the current chunk, namely the training chunk. The yet-to-come data chunk is considered as the target chunk. The objective of data stream learning is to learn a classifier on the historical data chunks and the current data chunk, and utilize the learnt classifier to predict the yet-to-come
Our proposed approach
We propose a novel multiple-instance stream learning framework for adaptive document categorization. Distinctive from existing data stream learning works which assume that concept drift only happens on the boundaries of data chunks, our approach is proposed to learn the multiple-instance stream where concept drift may occur within a data chunk. Our approach consists of three steps, as illustrated in Algorithm 1. In the following, we will introduce each step in details.
In multiple-instance
Experiments
To evaluate the effectiveness of the proposed approach MIS-DC, we perform experiments on benchmark datasets – 20 Newsgroup and BankSearch, and the real-world spam filtering dataset – TREC05. The objectives of our experiments are
- •
to evaluate the effectiveness of MIS-DC for handling concept drift which could occur on the chunk boundary or within the data chunk;
- •
to evaluate the performance of MIS-DC on not only the document-level classification, but also the block-level prediction;
- •
to evaluate the
Conclusion and future work
In this paper, we propose a novel multiple-instance stream learning framework for document categorization, named MIS-DC. MIS-DC has the ability of making accuracy prediction at both the document level and the block level, while only requires labeling the training documents at the document level. In addition, MIS-DC can also provide adaptive document categorization by detecting and handling concept drift at a finer granularity when data streams evolve over time, thereby yielding higher
Acknowledgments
This work was supported in part by the Natural Science Foundation of China under Grant 61472090, Grant 61672169, and Grant 61472089, in part by the NSFC-Guangdong Joint Found under Grant U1501254, in part by the Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S2013050014133, in part by the Natural Science Foundation of Guangdong under Grant 2015A030313486, Grant 2014A030306004, and Grant 2014A030308008, in part by the Science and Technology Planning Project of
References (52)
- et al.
Combining block-based and online methods in learning ensembles from concept drifting data streams
Inf. Sci.
(2014) - et al.
Learning concept-drifting data streams with random ensemble decision trees
Neurocomputing
(2015) - et al.
Recognizing input space and target concept drifts in data streams with scarcely labeled and unlabelled instances
Inf. Sci.
(2016) - et al.
Trend analysis of categorical data streams with a concept change method
Inf. Sci.
(2014) - et al.
Real-time stream data mining based on cantree and gtree
Inf. Sci.
(2016) - et al.
Solving the multiple instance problem with axis-parallel rectangles
Artif. Intell.
(1997) Multiple-instance learning based decision neural networks for image retrieval and classification
Neurocomputing
(2016)- et al.
Robust multiple-instance learning ensembles using random subspace instance selection
Pattern Recognit.
(2016) - et al.
A case-based technique for tracking concept drift in spam filtering
Knowl. Based Syst.
(2005) - et al.
Applying lazy learning algorithms to tackle concept drift in spam filtering
Expert Syst. Appl.
(2007)
Incremental learning of concept drift in nonstationary environments
IEEE Trans. Neural Netw. Learn. Syst.
E-Tree: an efficient indexing structure for ensemble models on data streams
IEEE Trans. Knowl. Data Eng.
A framework for on-demand classification of evolving data streams
IEEE Trans. Knowl. Data Eng.
Online learning from trapezoidal data streams
IEEE Trans. Knowl. Data Eng.
Prototype-based learning on concept-drifting data streams
Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
The ABACOC algorithm: A novel approach for nonparametric classification of data streams
Proceedings of IEEE International Conference on Data Mining (ICDM)
A general framework for mining concept-drifting data streams with skewed distributions
Proceedings of the SIAM International Conference on Data Mining (SDM)
Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
A similarity-based classification framework for multiple-instance learning
IEEE Trans. Cybern.
Dissimilarity-based ensembles for multiple instance learning
IEEE Trans. Neural Netw. Learn. Syst.
Mi-ds: multiple-instance learning algorithm
IEEE Trans. Cybern.
Image categorization by learning and reasoning with regions
J. M. Learn. Res.
Mile: multiple instance learning via embedded instance selection
IEEE Trans. Pattern Anal. Mach. Intell.
Support vector machines for multiple-instance learning
Proceedings of the Advances in Neural Information Processing Systems (NIPS)
Multiple instance boosting for object detection
Proceedings of the Advances in Neural Information Processing Systems (NIPS)
Efficient online evaluation of big data stream classifiers
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
Cited by (11)
Bengali text document categorization based on very deep convolution neural network
2021, Expert Systems with ApplicationsTransfer learning-based one-class dictionary learning for recommendation data stream
2021, Information SciencesCitation Excerpt :Evolving systems act in sample-wise single-pass manner to update parameters and adjust intrinsic structural components [13], while online incremental machine learning only updates parameters. For the embed-based methods [1,18], they can combine the classifiers built on the historical chunks to form a classifier for prediction. The metric of the incremental methods is they can update the machine learning tool automatically; however, sometimes it is not easy to formulate the learning tool into the incremental form.
Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
2020, Knowledge-Based SystemsCitation Excerpt :Among various text analysis tasks, document classification is the most classical one, whose goal is to allocate documents into different categories. It has been widely applied in various applications such as spam detection [1], ontology mapping [2], document recommendation [3], topic labeling [4], and sentiment classification [5,6]. In this paper, we focus on text representation learning for document classification.
A recent overview of the state-of-the-art elements of text classification
2018, Expert Systems with ApplicationsCitation Excerpt :There are two strategies for data labelling: labelling groups of texts and assigning a label or labels to each text part. The first strategy is called multi-instance learning (Alpaydın, Cheplygina, Loog, & Tax, 2015; Foulds & Frank, 2010; Herrera et al., 2016; Liu, Xiao, & Hao, 2018; Ray, Scott, & Blockeel, 2010; Xiao, Liu, Yin, & Hao, 2017; Yan, Li, & Zhang, 2016; Yan, Zhu, Liu, & Wu, 2017), whereas the second one includes different supervised methods (Sammut & Webb, 2017). The results of this phase are employed in the succeeding stages.
An evidential dynamical model to predict the interference effect of categorization on decision making results
2018, Knowledge-Based SystemsCitation Excerpt :For instance, doctors must classify the tumor before doing the surgery; judges need to categorize the defendant before making a judgement; the commander need to categorize an unexpected aircraft before making a command. Additionally, categorization is also an important task in knowledge-based systems [4–6], which is the basis of decision making. However, lots of practical examples and experiments show that categorization may result in the disjunction fallacy, which violates the law of total probability [7,8].
A semi-supervised framework for concept-based hierarchical document clustering
2023, World Wide Web