Skip to main content

Towards Positive Unlabeled Learning for Parallel Data Mining: A Random Forest Framework

  • Conference paper
Book cover Advanced Data Mining and Applications (ADMA 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8933))

Included in the following conference series:

Abstract

Parallel computing techniques can greatly facilitate traditional data mining algorithms to efficiently tackle learning tasks that are characterized by high computational complexity and huge amounts of data, to meet the requirement of real-world applications. However, most of these techniques require fully labeled training sets, which is a challenging requirement to meet. In order to address this problem, we investigate widely used Positive and Unlabeled (PU) learning algorithms including PU information gain and a newly developed PU Gini index combining with popular parallel computing framework - Random Forest (RF), thereby enabling parallel data mining to learn from only positive and unlabeled samples. The proposed framework, termed PURF (Positive Unlabeled Random Forest), is able to learn from positive and unlabeled instances and achieve comparable classifcation performance with RF trained by fully labeled data through parallel computing according to experiments on both synthetic and real-world UCI datasets. PURF is a promising framework that facilitates PU learning in parallel data mining and is anticipated to be useful framework in many real-world parallel computing applications with huge amounts of unlabeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Parthasarathy, S., Zaki, M.J., Ogihara, M., Li, W.: Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems 3, 1–29 (2001)

    Article  MATH  Google Scholar 

  2. Li, J., Liu, Y., Liao, W., Choudhary, A.: Parallel data mining algorithms for association rules and clustering. In: International Conference on Management of Data (2008)

    Google Scholar 

  3. Chen, H., Schatz, B., Ng, T., Martinez, J., Kirchhoff, A., Lin, C.: A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois digital library initiative project. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 771–782 (1996)

    Article  Google Scholar 

  4. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  5. Letouzey, F., Denis, F., Gilleron, R.: Learning from positive and unlabeled examples. In: Arimura, H., Sharma, A.K., Jain, S. (eds.) ALT 2000. LNCS (LNAI), vol. 1968, pp. 71–83. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Calvo, B., Larranaga, P., Lozano, J.A.: Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters 28, 2375–2384 (2007)

    Article  Google Scholar 

  7. Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2008), pp. 213–220 (2008)

    Google Scholar 

  8. Yu, H.: Single-Class Classification with Mapping Convergence. Machine Learning 61, 49–69 (2005)

    Article  Google Scholar 

  9. Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), pp. 179–186 (2003)

    Google Scholar 

  10. Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering 18, 6–20 (2006)

    Article  Google Scholar 

  11. Yu, H., Han, J., Chang, K.C.C.: PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering 16, 70–81 (2004)

    Article  Google Scholar 

  12. Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering 8, 962–969 (1996)

    Article  Google Scholar 

  13. Han, E., Karypis, G., Kumar, V.: Scalable parallel data mining for association rules, vol. 26. ACM (1997)

    Google Scholar 

  14. Zaki, M.J., Parthasarathy, S., Li, W.: A localized algorithm for parallel association mining. In: Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 321–330 (1997)

    Google Scholar 

  15. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Record 29, 1–12 (2000)

    Article  Google Scholar 

  16. Zaïane, O.R., El-Hajj, M., Lu, P.: Fast parallel association rule mining without candidacy generation. In: Proceedings IEEE International Conference on Data Mining (ICDM 2001), pp. 665–668 (2001)

    Google Scholar 

  17. Pramudiono, I., Kitsuregawa, M.: Tree structure based parallel frequent pattern mining on PC cluster. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 537–547. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  18. Cheung, D.W., Lee, S.D., Xiao, Y.: Effect of data skewness and workload balance in parallel data mining. IEEE Transactions on Knowledge and Data Engineering 14, 498–514 (2002)

    Article  Google Scholar 

  19. Kalé, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., Schulten, K.: NAMD2: greater scalability for parallel molecular dynamics. Journal of Computational Physics 151, 283–312 (1999)

    Article  MATH  Google Scholar 

  20. Sanbonmatsu, K.Y., Tung, C.S.: High performance computing in biology: multimillion atom simulations of nanoscale systems. Journal of Structural Biology 157, 470–480 (2007)

    Article  Google Scholar 

  21. D’Agostino, N., Aversano, M., Chiusano, M.L.: ParPEST: a pipeline for EST data analysis based on parallel computing. BMC Bioinformatics 6, S9 (2005)

    Article  Google Scholar 

  22. Li, C., Zhang, Y., Li, X.: OcVFDT: one-class very fast decision tree for one-class classification of data streams. In: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data (SensorKDD 2009), pp. 79–86 (2009)

    Google Scholar 

  23. Steinberg, D., Colla, P.: CART: tree-structured non-parametric data analysis. Salford Systems, San Diego (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, C., Hua, XL. (2014). Towards Positive Unlabeled Learning for Parallel Data Mining: A Random Forest Framework. In: Luo, X., Yu, J.X., Li, Z. (eds) Advanced Data Mining and Applications. ADMA 2014. Lecture Notes in Computer Science(), vol 8933. Springer, Cham. https://doi.org/10.1007/978-3-319-14717-8_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14717-8_45

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14716-1

  • Online ISBN: 978-3-319-14717-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics