Skip to main content
Log in

Multiple Bayesian discriminant functions for high-dimensional massive data classification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The presence of complex distributions of samples concealed in high-dimensional, massive sample-size data challenges all of the current classification methods for data mining. Samples within a class usually do not uniformly fill a certain (sub)space but are individually concentrated in certain regions of diverse feature subspaces, revealing the class dispersion. Current classifiers applied to such complex data inherently suffer from either high complexity or weak classification ability, due to the imbalance between flexibility and generalization ability of the discriminant functions used by these classifiers. To address this concern, we propose a novel representation of discriminant functions in Bayesian inference, which allows multiple Bayesian decision boundaries per class, each in its individual subspace. For this purpose, we design a learning algorithm that incorporates the naive Bayes and feature weighting approaches into structural risk minimization to learn multiple Bayesian discriminant functions for each class, thus combining the simplicity and effectiveness of naive Bayes and the benefits of feature weighting in handling high-dimensional data. The proposed learning scheme affords a recursive algorithm for exploring class density distribution for Bayesian estimation, and an automated approach for selecting powerful discriminant functions while keeping the complexity of the classifier low. Experimental results on real-world data characterized by millions of samples and features demonstrate the promising performance of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. They are available at: Macfee http://www.mcafee.com/ca/mcafee-labs.aspx; Symantec http://www.symantec.com/security_response/publications/threatreport.jsp; Panda http://www.pandasecurity.com/mediacenter/reports/.

  2. The terminologies “feature”, “attribute” and “dimension” are used interchangeably throughout the paper.

  3. We have borrowed the terminology “class-dispersion” from Vilalta and Rish (2003); the dispersion of our case is much more complicated, however, because it is subspace-related.

  4. http://datam.i2r.a-star.edu.sg/datasets/krbd/OvarianCancer/OvarianCancer-NCI-PBSII.html.

  5. http://qwone.com/~jason/20Newsgroups.

  6. http://vxheaven.org.

  7. http://info.usherbrooke.ca/Prospectus/Members/JZhang/project.

References

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases (VLDB), pp 852–863

    Chapter  Google Scholar 

  • Aha D, Kibler D (1991) Instance-based learning algorithms. Mach Learn 6:37–66

    MATH  Google Scholar 

  • Albert MK, Aha DW (1991) Analyses of instance-based learning algorithms. In: Proceedings of the ninth national conference on artificial intelligence (AAAI), pp 553–558

  • Atashpaz-Gargari E, Sima C, Braga-Neto UM, Dougherty ER (2013) Relationship between the accuracy of classifier error estimation and complexity of decision boundary. Pattern Recognit 46(5):1315–1322

    Article  Google Scholar 

  • Bengio Y, Bengio S (1999) Modeling high-dimensional discrete data with multi-layer neural networks. In: The twelfth advances in neural information processing systems (NIPS), pp 400–406

  • Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Chang CC, Lin CJ (2011) Libsvm: a library of the support vector machines. ACM Trans Intell Syst Technol 2:27

    Article  Google Scholar 

  • Chen L, Wang S (2012a) Automated feature weighting in naive Bayes for high-dimensional data classification. In: Proceedings of the twenty-first ACM international conference on information and knowledge management (CIKM), pp 1243–1252

  • Chen L, Wang S (2012b) Semi-naive Bayesian classification by weighted kernel density estimation. In: Proceedings of the eighth international conference on advanced data mining and applications (ADMA), pp 260–270

    Chapter  Google Scholar 

  • Chen L, Jiang Q, Wang S (2012) Model-based method for projective clustering. IEEE Trans Knowl Data Eng 24(7):1291–1305

    Article  Google Scholar 

  • Dahl GE, Stokes JW, Deng L, Yu D (2013) Large-scale malware classification using random projections and neural networks. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3422–3426

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38

  • Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97

    Article  MathSciNet  Google Scholar 

  • Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130

    Article  Google Scholar 

  • Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett 14(2):1–5. doi:10.1145/2481244.2481246

    Article  Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  • Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the thirteenth international joint conference on artificial intelligence (IJCAI), pp 1022–1029

  • Fortuny EJ, Martens D, Provost F (2013) Predictive modeling with big data: is bigger really better? Big Data 1(4):215–226

    Article  Google Scholar 

  • Frank E, Hall M, Pfahringer B (2003) Locally weighted naive Bayes. In: Proceedings of the nineteenth annual conference on uncertainty in artificial intelligence (UAI), pp 249–256

  • Friedman JH (1997) On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min Knowl Discov 1(1):55–77

    Article  Google Scholar 

  • Gopal S, Yang Y, Bai B, Mizil AN (2012) Bayesian models for large-scale hierarchical classification. In: Proceedings of the twenty-sixth annual conference on neural information processing systems (NIPS), pp 2420–2428

  • Han EHS, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of the fourth European conference on principles of knowledge discovery and data mining, pp 424–431

    Chapter  Google Scholar 

  • Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: The twenty-sixth international conference on very large databases (VLDB), pp 506–515

  • Hsu CN, Huang HJ, Wong TT (2003) Implications of the dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Mach Learn 53(3):235–263

    Article  Google Scholar 

  • Ifrim G, Bakır G, Weikum G (2008) Fast logistic regression for text categorization with variable-length n-grams. In: Proceedings of the fourteenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 354–362

  • Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing (STOC), pp 604–613

  • Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041

    Article  Google Scholar 

  • Joachims T (1996) A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical Reports, DTIC Document

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Springer, Berlin

    Google Scholar 

  • John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence (UAI), pp 338–345

  • Kang DK, Silvescu A, Honavar V (2006) RNBL-MN: a recursive naive Bayes learner for sequence classification. In: Proceedings of the eighteenth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 45–54

    Chapter  Google Scholar 

  • Kim J, Le DX, Thoma GR (2008) Naive Bayes classifier for extracting bibliographic information from biomedical online articles. In: Proceedings of the international conference on data mining (DMIN), pp 373–378

  • Kohavi R, Langley P, Yun Y (1997) The utility of feature weighting in nearest-neighbor algorithms. In: Proceedings of the ninth European conference on machine learning (ECML), pp 85–92

  • Kooij AJ, et al (2007) Chapter 4: regularization with ridge penalties, the lasso, and the elastic net for regression with optimal scaling transformations. In: Prediction accuracy and stability of regression with optimal scaling transformations, pp 65–90

  • Lee CH, Gutierrez F, Dou D (2011) Calculating feature weights in naive Bayes with kullback–leibler measure. In: Proceedings of the eleventh IEEE international conference on data mining (ICDM), pp 1146–1151

  • Li P, Shrivastava A, Moore JL, König AC (2011) Hashing algorithms for large-scale learning. In: Proceedings of the twenty-fifth annual conference on neural information processing systems (NIPS), pp 2672–2680

  • Lin C, Weng RC, Keerthi SS (2007) Trust region Newton methods for large-scale logistic regression. In: Proceedings of the twenty-fourth international conference on machine learning (ICML), pp 561–568

  • Lin G, Shen C, Shi Q, van den Hengel A, Suter D (2014) Fast supervised hashing with decision trees for high-dimensional data. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1963–1970

  • Liu S, Trenkler G (2008) Hadamard, khatri-rao, kronecker and other matrix products. Int J Inf Syst Sci 4(1):160–177

    MathSciNet  MATH  Google Scholar 

  • Manevitz LM, Yousef M (2002) One-class svms for document classification. J Mach Learn Res 2:139–154

    MATH  Google Scholar 

  • Marchiori E (2013) Class dependent feature weighting and k-nearest neighbor classification. In: Proceedings of the eighth IAPR international conference on pattern recognition in bioinformatics (PRIB), pp 69–78

    Chapter  Google Scholar 

  • Martens D, Provost F (2011) Pseudo-social network targeting from consumer transaction data. Faculty of Applied Economics, University of Antwerp, Belgium

    Google Scholar 

  • Martínez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23(2):228–233

    Article  Google Scholar 

  • Masud MM, Khan L, Thuraisingham B (2008) A scalable multi-level feature extraction technique to detect malicious executables. Inf Syst Front 10(1):33–45

    Article  Google Scholar 

  • Nakajima S, Watanabe S (2005) Generalization error of linear neural networks in an empirical Bayes approach. In: Proceedings of the nineteenth international joint conference on artificial intelligence (IJCAI), pp 804–810

  • Navon IM, Phua PK, Ramamurthy M (1988) Vectorization of conjugate-gradient methods for large-scale minimization. In: Proceedings of the second ACM/IEEE conference on supercomputing, pp 410–418

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newslett 6(1):90–105

    Article  Google Scholar 

  • Rani P, Pudi V (2008) RBNBC: Repeat based naive Bayes classifier for biological sequences. In: Proceedings of the eighth IEEE international conference on data mining (ICDM), pp 989–994

  • Rao J, Wu C (2010) Bayesian pseudo-empirical-likelihood intervals for complex surveys. J R Stat Soc Ser B 72(4):533–544

    Article  MathSciNet  Google Scholar 

  • Seeger M (2006) Bayesian modelling in machine learning: a tutorial review. Technical Reports

  • Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf Secur Tech Rep 14(1):16–29

    Article  Google Scholar 

  • Straub WO (2009) A brief look at gaussian integrals. Article, Pasadena California. http://www.weylmann.com/gaussian.pdf

  • Su J, Shirab JS, Matwin S (2011) Large scale text classification using semisupervised multinomial naive Bayes. In: Proceedings of the twenty-eighth international conference on machine learning (ICML), pp 97–104

  • Tan M, Wang L, Tsang IW (2010) Learning sparse SVM for feature selection on very high dimensional datasets. In: Proceedings of the twenty-seventh international conference on machine learning (ICML), pp 1047–1054

  • Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15(1):1371–1429

    MathSciNet  MATH  Google Scholar 

  • Tan S (2008) An improved centroid classifier for text categorization. Exp Syst Appl 35(1–2):279–285

    Article  Google Scholar 

  • Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    Article  MathSciNet  Google Scholar 

  • Vapnik VN (1992) Principles of risk minimization for learning theory. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 831–838

  • Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw Learn Syst 10(5):988–999

    Article  Google Scholar 

  • Vapnik VN (2000) The nature of statistical learning theory. Springer, Berlin

    Book  Google Scholar 

  • Vapnik VN, Levin E, Le Cun Y (1994) Measuring the VC-dimension of a learning machine. Neural Comput 6(5):851–876

    Article  Google Scholar 

  • Verweij PJ, Van Houwelingen HC (1994) Penalized likelihood in cox regression. Stat Med 13(23–24):2427–2436

    Article  Google Scholar 

  • Vilalta R, Rish I (2003) A decomposition of classes via clustering to explain and improve naive Bayes. In: Proceedings of the fourteenth European conference on machine learning (ECML), pp 444–455

    Chapter  Google Scholar 

  • Vilalta R, Achari MK, Eick CF (2003) Class decomposition via clustering: a new framework for low-variance classifiers. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 673–676

  • Weinberger K, Dasgupta A, Langford J, Smola A, Attenberg J (2009) Feature hashing for large scale multitask learning. In: Proceedings of the twenty-sixth annual international conference on machine learning (ICML), pp 1113–1120

  • Xu B, Huang JZ, Williams G, Wang Q, Ye Y (2012) Classifying very high-dimensional data with random forests built from small subspaces. Int J Data Warehous Min 8(2):44–63

    Article  Google Scholar 

  • Yu S, Fung G, Rosales R, Krishnan S, Rao RB, Dehing-Oberije C, Lambin P (2008) Privacy-preserving Cox regression for survival analysis. In: Proceedings of the fourteenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1034–1042

  • Yuan YC (2010) Multiple imputation for missing data: concepts and new development (version 9.0). SAS Institute Inc, Rockville

    Google Scholar 

  • Zaidi NA, Cerquides J, Carman MJ, Webb GI (2013) Alleviating naive Bayes attribute independence assumption by attribute weighting. J Mach Learn Res 14(1):1947–1988

    MathSciNet  MATH  Google Scholar 

  • Zhang J, Chen L, Guo G (2013) Projected-prototype based classifier for text categorization. Knowl Based Syst 49:179–189

    Article  Google Scholar 

  • Zhou Z, Chen Z (2002) Hybrid decision tree. Knowl Based Syst 15(8):515–528

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank Carol Harris for improving this paper significantly. This work has been supported by fundings from the Natural Sciences and Engineering Research Council of Canada (NSERC) to Shengrui Wang under Grant No. 396097-2010, and from the Natural Science Foundation of Fujian Province of China to Lifei Chen under Grant No. 2015J01238. Thanks UPMC—Université Pierre et Marie Curie for a financial support allowing the collaboration between Shengrui Wang and Patrick Gallinari. Shengrui Wang is also partly supported by Natural Science Foundation of China (NSFC) under Grant No. 61170130.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengrui Wang.

Additional information

Responsible editor: Kristian Kersting.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Wang, S., Chen, L. et al. Multiple Bayesian discriminant functions for high-dimensional massive data classification. Data Min Knowl Disc 31, 465–501 (2017). https://doi.org/10.1007/s10618-016-0481-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0481-y

Keywords

Navigation