Skip to main content
Log in

Learning from data streams with only positive and unlabeled data

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Many studies on streaming data classification have been based on a paradigm in which a fully labeled stream is available for learning purposes. However, it is often too labor-intensive and time-consuming to manually label a data stream for training. This difficulty may cause conventional supervised learning approaches to be infeasible in many real world applications, such as credit fraud detection, intrusion detection, and rare event prediction. In previous work, Li et al. suggested that these applications be treated as Positive and Unlabeled learning problem, and proposed a learning algorithm, OcVFD, as a solution (Li et al. 2009). Their method requires only a set of positive examples and a set of unlabeled examples which is easily obtainable in a streaming environment, making it widely applicable to real-life applications. Here, we enhance Li et al.’s solution by adding three features: an efficient method to estimate the percentage of positive examples in the training stream, the ability to handle numeric attributes, and the use of more appropriate classification methods at tree leaves. Experimental results on synthetic and real-life datasets show that our enhanced solution (called PUVFDT) has very good classification performance and a strong ability to learn from data streams with only positive and unlabeled examples. Furthermore, our enhanced solution reduces the learning time of OcVFDT by about an order of magnitude. Even with 80 % of the examples in the training data stream unlabeled, PUVFDT can still achieve a competitive classification performance compared with that of VFDTcNB (Gama et al. 2003), a supervised learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Note that different from VFDTcNB in (Gama et al. 2003) which constructs a binary tree structure to deal with numeric attributes in Hoeffding tree, VFDTcNB in our paper represents VFDT (Domingos and Hulten 2000) that handles numeric attributes by using a Gaussian (i.e., normal) distribution to approximate the numeric distribution on a per-class basis in small constant space.

  2. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm.

  3. http://www.sigkdd.org/kddcup/index.php?section=1999&method=data.

References

  • Al-Kateb, M., Lee, B.S., Wang, X.S. (2007). Adaptive-size reservoir sampling over data streams. In Proc. of the 19th international conference on scientific and statistical database management (SSDBM’07) (pp. 22–31).

  • Bifet, A., & Gavaldà, R. (2009). Adaptive learning from evolving data streams. In Proc. of the 8th international symposium on intelligent data analysis: advances in intelligent data analysis VIII (pp. 249–260).

  • Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R. (2009). New ensemble methods for evolving data streams. In Proc. of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 139–148).

  • Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B. (2011). Data stream mining a practical approach. http://heanet.dl.sourceforge.net/project/moa-datastream/documentation/StreamMining.pdf.

  • Calvo, B., Larranaga, P., Lozano, J.A. (2007). Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters, 28(16), 2375–2384.

    Article  Google Scholar 

  • Denis, F. (1998). PAC learning from positive statistical queries. In Proc. of the 9th international conference on algorithmic learning theory (pp. 112–126).

  • Denis, F., Gilleron, R., Letouzey, F. (2005). Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1), 70–83.

    Article  MathSciNet  MATH  Google Scholar 

  • Denis, F., Gilleron, R., Tommasi, M. (2002). Text classiffication from positive and unlabeled examples. In Proc. of the 9th international conference on Information Processing and Management of Uncertainty in knowledge- based systems (IPMU 2002).

  • Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proc. of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80).

  • Elkan, C., & Noto, K. (2008). Learning Classiffiers from only positive and unlabeled data. In Proc. of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 213–220).

  • Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S. (2006). Text classiffication without negative examples revisit. IEEE Transation Knowledge Data Engineering, 18(1), 6–20.

    Article  Google Scholar 

  • Gama, J. (2004). Functional trees. Machine Learning, 55(3), 219–250.

    Article  MATH  Google Scholar 

  • Gama, J., & Medas, P. (2005). Learning decision trees from dynamic data streams. Journal of Universal Computer Science, 11(8), 1353–1366.

    Google Scholar 

  • Gama, J., Rocha, R., Medas, P. (2003). Accurate decision trees for mining high-speed data streams. In Proc. of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 523–528).

  • Gama, J., Fernandes, R., Rocha, R. (2006). Decision trees for mining data streams. Intelligent Data Analysis, 10(1), 23–45.

    Google Scholar 

  • Greenwald, M., & Khanna, S. (2011). Space-efficient online computation of quantile summaries. In Proc. of the ACM SIGMOD international conference on management of data, (pp. 58–66).

  • Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann.

  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.

    Article  MathSciNet  MATH  Google Scholar 

  • Hulten, G., & Domingos, P. (2003). VFML-a toolkit for mining high-speed time-changing data streams. http://www.cs.washington.edu/dm/vfml.

  • Hulten, G., Spencer, L., Domingos, P. (2001). Mining time-changing data streams. In Proc. of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 97–106).

  • Jin, R., & Agrawal, G. (2003). Efficient decision tree construction on streaming data. In Proc. of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 571–576).

  • Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classiffiers: A Decision-tree hybrid. In Proc. of the 2nd international conference on knowledge discovery and data mining (pp. 202–207).

  • Lee, W.S., & Liu, B. (2003). Learning with positive and unlabeled examples using weighted logistic regression. In Proc. of the 12th international conference machine learning (pp. 239–248).

  • Lewis, D.D., Yang, Y., Rose, T.G., Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning, 5, 361–397.

    Google Scholar 

  • Li, C., Zhang, Y., Li, X. (2009). OcVFDT: One-class very fast decision tree for one-class classiffication of data streams. In Proc. of the 3rd international workshop on knowledge discovery from sensor data, held in conjunction with SIGKDD’09 (pp. 79–86).

  • Li, P., Wu, X., Hu, X. (2010). Learning from Concept drifting data streams with unlabeled data. In Proc. of the 24th AAAI conference on artifficial intelligence (pp. 1945–1946).

  • Li, X.L., Yu, P.S., Liu, B., Ng, S.K. (2009). Positive unlabeled learning for data stream classiffication. In Proc. of the SIAM international conference on data mining (pp. 257–268).

  • Liang, C., Zhang, Y., Shi, P., Hu, Z. (2012). Learning very fast decision tree from uncertain data streams with positive and unlabeled samples. Information Science, 213, 50–67. doi:10.1016/j.ins.2012.05.023.

    Article  MathSciNet  Google Scholar 

  • Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S. (2003). Building text classiffiers using positive and unlabeled examples. In Proc. of the 3rd IEEE international conference on data mining (pp. 179–186).

  • Pan, S., Zhang, Y., Li, X. (2011). Dynamic classiffier ensemble for positive unlabeled text stream classiffication. Knowledge and Information Systems, 1–21. doi:10.1007/s10115-011-0469-2.

  • Pfahringer, B., Holmes, G., Kirkby, R. (2008). Handling numeric attributes in hoeffding trees. In Proc.of the 12th Pacific-Asia conference on knowledge discovery and data mining (pp. 296–307).

  • Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.

  • Scholkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471.

    Article  Google Scholar 

  • Street, W.N., & Kim, Y.S. (2001). A streaming ensemble algorithm (sea) for large-scale classiffication. In Proc. of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 377–382).

  • Utgoff, P.E. (1988). Perceptron trees: A case study in hybrid concept representations. In Proc. of the 7th national conference on artificial intelligence (pp. 601–606).

  • Vitter, J.S. (1985). Random sampling with a reservoir. ACM transactions on mathematical software, 11(1), 37–57.

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, H., Fan, W., Yu, P.S., Han, J. (2003). Mining concept-drifting data streams using ensemble classiffiers. In Proc. of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 226–235).

  • Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann.

  • Yu, H. (2003). General MC: Estimating boundary of positive class from small positive data. In Proc. of the 3rd IEEE international conference on data mining (pp. 693–696). Melbourne, Florida, USA.

  • Yu, H. (2005). Single-class classiffication with mapping convergence. Machine Learning, 61(1–3), 49–69.

    Article  Google Scholar 

  • Yu, H., Han, J., Chang, K.C.C. (2002). PEBL: Positive example based learning for web page classiffication using SVM. In Proc. of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 239–248).

  • Yu, H., Han, J., Chang, K.C.C. (2004). PEBL: Web page classiffication without negative examples. IEEE Transation Knowledge Data Engineering, 16(1), 70–81.

    Article  Google Scholar 

  • Yu, H., Zhai, C.X., Han, J. (2003). Text classiffication from positive and unlabeled documents. In Proc. of the 12th international conference on information and knowledge management (pp. 232–239).

  • Zhang, P., Zhu, X., Guo, L. (2009). Mining data streams with labeled and unlabeled training examples. In Proc. of the 9th IEEE international conference on data mining (pp. 627–636).

  • Zhang, P., Zhu, X., Shi, Y. (2008). Categorizing and mining concept-drifting data streams. In Proc. of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 812–820).

  • Zhang, Y., & Jin, X. (2006). An automatic construction and organization strategy for ensemble learning on data streams. SIGMOD Record, 35(3), 28–33.

    Article  Google Scholar 

  • Zhang, Y., Li, X., Orlowska, M. (2008). One-class classiffication of text streams with concept drift. In Proc. of the 2008 IEEE international conference on data mining workshops (pp. 116–125).

  • Zhu, X., Wu, X., Zhu, Y. (2004). Dynamic classiffier selection for effective mining from noisy data streams. In Proc. of the 4th IEEE international conference on data mining (pp. 305–312).

  • Zhu, X., Zhang, P., Lin, X., Shi, Y. (2007). Active learning from data streams. In Proc. of the 7th IEEE international conference on data mining (pp. 757–762).

  • Zhu, X.,Ding, W., Yu, P.S., Zhang, C. (2011). One-class learning and concept summarization for data streams. Knowledge and Information Systems, 28(3), 523–553.

    Article  Google Scholar 

Download references

Acknowledgements

This research is partially supported by the National Natural Science Foundation of China (60873196) and Chinese Universities Scientific Fund (QN2009092). The authors would like to thank the associate editor and anonymous reviewers for their constructive comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, X., Zhang, Y., Li, C. et al. Learning from data streams with only positive and unlabeled data. J Intell Inf Syst 40, 405–430 (2013). https://doi.org/10.1007/s10844-012-0231-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0231-6

Keywords

Navigation