Learning from data streams with only positive and unlabeled data

Qin, Xiangju; Zhang, Yang; Li, Chen; Li, Xue

doi:10.1007/s10844-012-0231-6

Learning from data streams with only positive and unlabeled data

Published: 05 January 2013

Volume 40, pages 405–430, (2013)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Xiangju Qin¹,
Yang Zhang^1,2,
Chen Li¹ &
…
Xue Li³

875 Accesses
12 Citations
Explore all metrics

Abstract

Many studies on streaming data classification have been based on a paradigm in which a fully labeled stream is available for learning purposes. However, it is often too labor-intensive and time-consuming to manually label a data stream for training. This difficulty may cause conventional supervised learning approaches to be infeasible in many real world applications, such as credit fraud detection, intrusion detection, and rare event prediction. In previous work, Li et al. suggested that these applications be treated as Positive and Unlabeled learning problem, and proposed a learning algorithm, OcVFD, as a solution (Li et al. 2009). Their method requires only a set of positive examples and a set of unlabeled examples which is easily obtainable in a streaming environment, making it widely applicable to real-life applications. Here, we enhance Li et al.’s solution by adding three features: an efficient method to estimate the percentage of positive examples in the training stream, the ability to handle numeric attributes, and the use of more appropriate classification methods at tree leaves. Experimental results on synthetic and real-life datasets show that our enhanced solution (called PUVFDT) has very good classification performance and a strong ability to learn from data streams with only positive and unlabeled examples. Furthermore, our enhanced solution reduces the learning time of OcVFDT by about an order of magnitude. Even with 80 % of the examples in the training data stream unlabeled, PUVFDT can still achieve a competitive classification performance compared with that of VFDTcNB (Gama et al. 2003), a supervised learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of transfer learning

Article Open access 28 May 2016

A survey on semi-supervised learning

Article Open access 15 November 2019

Learning from positive and unlabeled data: a survey

Article 02 April 2020

Notes

Note that different from VFDTcNB in (Gama et al. 2003) which constructs a binary tree structure to deal with numeric attributes in Hoeffding tree, VFDTcNB in our paper represents VFDT (Domingos and Hulten 2000) that handles numeric attributes by using a Gaussian (i.e., normal) distribution to approximate the numeric distribution on a per-class basis in small constant space.
http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.
http://www.sigkdd.org/kddcup/index.php?section=1999&method=data.

References

Al-Kateb, M., Lee, B.S., Wang, X.S. (2007). Adaptive-size reservoir sampling over data streams. In Proc. of the 19th international conference on scientific and statistical database management (SSDBM’07) (pp. 22–31).
Bifet, A., & Gavaldà, R. (2009). Adaptive learning from evolving data streams. In Proc. of the 8th international symposium on intelligent data analysis: advances in intelligent data analysis VIII (pp. 249–260).
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R. (2009). New ensemble methods for evolving data streams. In Proc. of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 139–148).
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B. (2011). Data stream mining a practical approach. http://heanet.dl.sourceforge.net/project/moa-datastream/documentation/StreamMining.pdf.
Calvo, B., Larranaga, P., Lozano, J.A. (2007). Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters, 28(16), 2375–2384.
Article Google Scholar
Denis, F. (1998). PAC learning from positive statistical queries. In Proc. of the 9th international conference on algorithmic learning theory (pp. 112–126).
Denis, F., Gilleron, R., Letouzey, F. (2005). Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1), 70–83.
Article MathSciNet MATH Google Scholar
Denis, F., Gilleron, R., Tommasi, M. (2002). Text classiffication from positive and unlabeled examples. In Proc. of the 9th international conference on Information Processing and Management of Uncertainty in knowledge- based systems (IPMU 2002).
Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proc. of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80).
Elkan, C., & Noto, K. (2008). Learning Classiffiers from only positive and unlabeled data. In Proc. of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 213–220).
Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S. (2006). Text classiffication without negative examples revisit. IEEE Transation Knowledge Data Engineering, 18(1), 6–20.
Article Google Scholar
Gama, J. (2004). Functional trees. Machine Learning, 55(3), 219–250.
Article MATH Google Scholar
Gama, J., & Medas, P. (2005). Learning decision trees from dynamic data streams. Journal of Universal Computer Science, 11(8), 1353–1366.
Google Scholar
Gama, J., Rocha, R., Medas, P. (2003). Accurate decision trees for mining high-speed data streams. In Proc. of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 523–528).
Gama, J., Fernandes, R., Rocha, R. (2006). Decision trees for mining data streams. Intelligent Data Analysis, 10(1), 23–45.
Google Scholar
Greenwald, M., & Khanna, S. (2011). Space-efficient online computation of quantile summaries. In Proc. of the ACM SIGMOD international conference on management of data, (pp. 58–66).
Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
Article MathSciNet MATH Google Scholar
Hulten, G., & Domingos, P. (2003). VFML-a toolkit for mining high-speed time-changing data streams. http://www.cs.washington.edu/dm/vfml.
Hulten, G., Spencer, L., Domingos, P. (2001). Mining time-changing data streams. In Proc. of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 97–106).
Jin, R., & Agrawal, G. (2003). Efficient decision tree construction on streaming data. In Proc. of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 571–576).
Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classiffiers: A Decision-tree hybrid. In Proc. of the 2nd international conference on knowledge discovery and data mining (pp. 202–207).
Lee, W.S., & Liu, B. (2003). Learning with positive and unlabeled examples using weighted logistic regression. In Proc. of the 12th international conference machine learning (pp. 239–248).
Lewis, D.D., Yang, Y., Rose, T.G., Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning, 5, 361–397.
Google Scholar
Li, C., Zhang, Y., Li, X. (2009). OcVFDT: One-class very fast decision tree for one-class classiffication of data streams. In Proc. of the 3rd international workshop on knowledge discovery from sensor data, held in conjunction with SIGKDD’09 (pp. 79–86).
Li, P., Wu, X., Hu, X. (2010). Learning from Concept drifting data streams with unlabeled data. In Proc. of the 24th AAAI conference on artifficial intelligence (pp. 1945–1946).
Li, X.L., Yu, P.S., Liu, B., Ng, S.K. (2009). Positive unlabeled learning for data stream classiffication. In Proc. of the SIAM international conference on data mining (pp. 257–268).
Liang, C., Zhang, Y., Shi, P., Hu, Z. (2012). Learning very fast decision tree from uncertain data streams with positive and unlabeled samples. Information Science, 213, 50–67. doi:10.1016/j.ins.2012.05.023.
Article MathSciNet Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S. (2003). Building text classiffiers using positive and unlabeled examples. In Proc. of the 3rd IEEE international conference on data mining (pp. 179–186).
Pan, S., Zhang, Y., Li, X. (2011). Dynamic classiffier ensemble for positive unlabeled text stream classiffication. Knowledge and Information Systems, 1–21. doi:10.1007/s10115-011-0469-2.
Pfahringer, B., Holmes, G., Kirkby, R. (2008). Handling numeric attributes in hoeffding trees. In Proc.of the 12th Pacific-Asia conference on knowledge discovery and data mining (pp. 296–307).
Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Scholkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471.
Article Google Scholar
Street, W.N., & Kim, Y.S. (2001). A streaming ensemble algorithm (sea) for large-scale classiffication. In Proc. of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 377–382).
Utgoff, P.E. (1988). Perceptron trees: A case study in hybrid concept representations. In Proc. of the 7th national conference on artificial intelligence (pp. 601–606).
Vitter, J.S. (1985). Random sampling with a reservoir. ACM transactions on mathematical software, 11(1), 37–57.
Article MathSciNet MATH Google Scholar
Wang, H., Fan, W., Yu, P.S., Han, J. (2003). Mining concept-drifting data streams using ensemble classiffiers. In Proc. of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 226–235).
Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann.
Yu, H. (2003). General MC: Estimating boundary of positive class from small positive data. In Proc. of the 3rd IEEE international conference on data mining (pp. 693–696). Melbourne, Florida, USA.
Yu, H. (2005). Single-class classiffication with mapping convergence. Machine Learning, 61(1–3), 49–69.
Article Google Scholar
Yu, H., Han, J., Chang, K.C.C. (2002). PEBL: Positive example based learning for web page classiffication using SVM. In Proc. of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 239–248).
Yu, H., Han, J., Chang, K.C.C. (2004). PEBL: Web page classiffication without negative examples. IEEE Transation Knowledge Data Engineering, 16(1), 70–81.
Article Google Scholar
Yu, H., Zhai, C.X., Han, J. (2003). Text classiffication from positive and unlabeled documents. In Proc. of the 12th international conference on information and knowledge management (pp. 232–239).
Zhang, P., Zhu, X., Guo, L. (2009). Mining data streams with labeled and unlabeled training examples. In Proc. of the 9th IEEE international conference on data mining (pp. 627–636).
Zhang, P., Zhu, X., Shi, Y. (2008). Categorizing and mining concept-drifting data streams. In Proc. of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 812–820).
Zhang, Y., & Jin, X. (2006). An automatic construction and organization strategy for ensemble learning on data streams. SIGMOD Record, 35(3), 28–33.
Article Google Scholar
Zhang, Y., Li, X., Orlowska, M. (2008). One-class classiffication of text streams with concept drift. In Proc. of the 2008 IEEE international conference on data mining workshops (pp. 116–125).
Zhu, X., Wu, X., Zhu, Y. (2004). Dynamic classiffier selection for effective mining from noisy data streams. In Proc. of the 4th IEEE international conference on data mining (pp. 305–312).
Zhu, X., Zhang, P., Lin, X., Shi, Y. (2007). Active learning from data streams. In Proc. of the 7th IEEE international conference on data mining (pp. 757–762).
Zhu, X.,Ding, W., Yu, P.S., Zhang, C. (2011). One-class learning and concept summarization for data streams. Knowledge and Information Systems, 28(3), 523–553.
Article Google Scholar

Download references

Acknowledgements

This research is partially supported by the National Natural Science Foundation of China (60873196) and Chinese Universities Scientific Fund (QN2009092). The authors would like to thank the associate editor and anonymous reviewers for their constructive comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

College of Information Engineering, Northwest A&F University, Yangling, China
Xiangju Qin, Yang Zhang & Chen Li
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yang Zhang
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, Australia
Xue Li

Authors

Xiangju Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Xue Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, X., Zhang, Y., Li, C. et al. Learning from data streams with only positive and unlabeled data. J Intell Inf Syst 40, 405–430 (2013). https://doi.org/10.1007/s10844-012-0231-6

Download citation

Received: 08 May 2012
Revised: 27 November 2012
Accepted: 30 November 2012
Published: 05 January 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10844-012-0231-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from data streams with only positive and unlabeled data

Abstract

Access this article

Similar content being viewed by others

A survey of transfer learning

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning from data streams with only positive and unlabeled data

Abstract

Access this article

Similar content being viewed by others

A survey of transfer learning

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation