Towards Positive Unlabeled Learning for Parallel Data Mining: A Random Forest Framework

Li, Chen; Hua, Xue-Liang

doi:10.1007/978-3-319-14717-8_45

Chen Li^22,23 &
Xue-Liang Hua²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8933))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3339 Accesses
3 Citations

Abstract

Parallel computing techniques can greatly facilitate traditional data mining algorithms to efficiently tackle learning tasks that are characterized by high computational complexity and huge amounts of data, to meet the requirement of real-world applications. However, most of these techniques require fully labeled training sets, which is a challenging requirement to meet. In order to address this problem, we investigate widely used Positive and Unlabeled (PU) learning algorithms including PU information gain and a newly developed PU Gini index combining with popular parallel computing framework - Random Forest (RF), thereby enabling parallel data mining to learn from only positive and unlabeled samples. The proposed framework, termed PURF (Positive Unlabeled Random Forest), is able to learn from positive and unlabeled instances and achieve comparable classifcation performance with RF trained by fully labeled data through parallel computing according to experiments on both synthetic and real-world UCI datasets. PURF is a promising framework that facilitates PU learning in parallel data mining and is anticipated to be useful framework in many real-world parallel computing applications with huge amounts of unlabeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Parthasarathy, S., Zaki, M.J., Ogihara, M., Li, W.: Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems 3, 1–29 (2001)
Article MATH Google Scholar
Li, J., Liu, Y., Liao, W., Choudhary, A.: Parallel data mining algorithms for association rules and clustering. In: International Conference on Management of Data (2008)
Google Scholar
Chen, H., Schatz, B., Ng, T., Martinez, J., Kirchhoff, A., Lin, C.: A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois digital library initiative project. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 771–782 (1996)
Article Google Scholar
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Letouzey, F., Denis, F., Gilleron, R.: Learning from positive and unlabeled examples. In: Arimura, H., Sharma, A.K., Jain, S. (eds.) ALT 2000. LNCS (LNAI), vol. 1968, pp. 71–83. Springer, Heidelberg (2000)
Chapter Google Scholar
Calvo, B., Larranaga, P., Lozano, J.A.: Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters 28, 2375–2384 (2007)
Article Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2008), pp. 213–220 (2008)
Google Scholar
Yu, H.: Single-Class Classification with Mapping Convergence. Machine Learning 61, 49–69 (2005)
Article Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W.S., Yu, P.S.: Building text classifiers using positive and unlabeled examples. In: Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003), pp. 179–186 (2003)
Google Scholar
Fung, G.P.C., Yu, J.X., Lu, H., Yu, P.S.: Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering 18, 6–20 (2006)
Article Google Scholar
Yu, H., Han, J., Chang, K.C.C.: PEBL: web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering 16, 70–81 (2004)
Article Google Scholar
Agrawal, R., Shafer, J.C.: Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering 8, 962–969 (1996)
Article Google Scholar
Han, E., Karypis, G., Kumar, V.: Scalable parallel data mining for association rules, vol. 26. ACM (1997)
Google Scholar
Zaki, M.J., Parthasarathy, S., Li, W.: A localized algorithm for parallel association mining. In: Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 321–330 (1997)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Record 29, 1–12 (2000)
Article Google Scholar
Zaïane, O.R., El-Hajj, M., Lu, P.: Fast parallel association rule mining without candidacy generation. In: Proceedings IEEE International Conference on Data Mining (ICDM 2001), pp. 665–668 (2001)
Google Scholar
Pramudiono, I., Kitsuregawa, M.: Tree structure based parallel frequent pattern mining on PC cluster. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 537–547. Springer, Heidelberg (2003)
Chapter Google Scholar
Cheung, D.W., Lee, S.D., Xiao, Y.: Effect of data skewness and workload balance in parallel data mining. IEEE Transactions on Knowledge and Data Engineering 14, 498–514 (2002)
Article Google Scholar
Kalé, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., Schulten, K.: NAMD2: greater scalability for parallel molecular dynamics. Journal of Computational Physics 151, 283–312 (1999)
Article MATH Google Scholar
Sanbonmatsu, K.Y., Tung, C.S.: High performance computing in biology: multimillion atom simulations of nanoscale systems. Journal of Structural Biology 157, 470–480 (2007)
Article Google Scholar
D’Agostino, N., Aversano, M., Chiusano, M.L.: ParPEST: a pipeline for EST data analysis based on parallel computing. BMC Bioinformatics 6, S9 (2005)
Article Google Scholar
Li, C., Zhang, Y., Li, X.: OcVFDT: one-class very fast decision tree for one-class classification of data streams. In: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data (SensorKDD 2009), pp. 79–86 (2009)
Google Scholar
Steinberg, D., Colla, P.: CART: tree-structured non-parametric data analysis. Salford Systems, San Diego (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biochemistry and Molecular Biology, Monash University, VIC, 3800, Australia
Chen Li
College of Information Engineering, Northwest A&F University, Yangling, 712100, China
Chen Li
Faculty of Information Technology, Monash University, VIC, 3800, Australia
Xue-Liang Hua

Authors

Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Xue-Liang Hua
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, P.R. China
Xudong Luo
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Jeffrey Xu Yu
Guanxi Normal University, Guilin, P.R. China
Zhi Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C., Hua, XL. (2014). Towards Positive Unlabeled Learning for Parallel Data Mining: A Random Forest Framework. In: Luo, X., Yu, J.X., Li, Z. (eds) Advanced Data Mining and Applications. ADMA 2014. Lecture Notes in Computer Science(), vol 8933. Springer, Cham. https://doi.org/10.1007/978-3-319-14717-8_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-14717-8_45
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14716-1
Online ISBN: 978-3-319-14717-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics