Abstract
Data clustering is a fundamental unsupervised learning task in several domains such as data mining, computer vision, information retrieval, and pattern recognition. In this paper, we propose and analyze a new clustering approach based on both hierarchical Dirichlet processes and the generalized Dirichlet distribution, which leads to an interesting statistical framework for data analysis and modelling. Our approach can be viewed as a hierarchical extension of the infinite generalized Dirichlet mixture model previously proposed in Bouguila and Ziou (IEEE Trans Neural Netw 21(1):107–122, 2010). The proposed clustering approach tackles the problem of modelling grouped data where observations are organized into groups that we allow to remain statistically linked by sharing mixture components. The resulting clustering model is learned using a principled variational Bayes inference-based algorithm that we have developed. Extensive experiments and simulations, based on two challenging applications namely images categorization and web service intrusion detection, demonstrate our model usefulness and merits.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
PCA-SIFT: http://www.cs.cmu.edu/~yke/pcasift.
Available at: http://www.robots.ox.ac.uk/~vgg/data/pets/.
References
Agarwal S, Roth D (2002) Learning a sparse representation for object detection. In: Heyden A, Sparr G, Nielsen M, Johansen P (eds) ECCV (4), Lecture notes in computer science vol 2353. Springer, Berlin, Heidelberg, pp 113–130
Attias H (1999) A variational Bayes framework for graphical models. In: Proceedings of advances in neural information processing systems (NIPS), pp 209–215
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2004) Clustering with bregman divergences. In: Proceedings of the 4th SIAM international conference on data mining (SDM), pp 234–245
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Blei DM, Jordan MI (2005) Variational inference for Dirichlet process mixtures. Bayesian Anal 1:121–144
Bouguila N, Ziou D (2005) Using unsupervised learning of a finite dirichlet mixture model to improve pattern recognition applications. Pattern Recognit Lett 26(12):1916–1925
Bouguila N, Ziou D (2006) A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture. IEEE Trans Image Process 15(9):2657–2668
Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized Dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731
Bouguila N, Ziou D (2010) A dirichlet process mixture of generalized dirichlet distributions for proportional data modeling. IEEE Trans Neural Netw 21(1):107–122
Boutemedjet S, Bouguila N, Ziou D (2009) A hybrid feature extraction selection approach for high-dimensional non-Gaussian data clustering. IEEE Trans Pattern Anal Mach Intell 31(8):1429–1443
Corona I, Giacinto G (2010) Detection of server-side web attacks. In: Diethe T, Cristianini N, Shawe-Taylor J (eds) JMLR Proceedings, WAPA, vol 11, JMLR.org, pp 160–166
Dagdee N, Thakar U (2008) Intrusion attack pattern analysis and signature extraction for web services using honeypots. In: Proceedings of the First international conference on emerging trends in engineering and technology (ICETET), p 1232–1237
Desmet L, Jacobs B, Piessens F, Joosen W (2005) Threat modelling for web services based web applications. In: Chadwick D, Preneel B (eds) Communications and multimedia security, vol 175. IFIPG The International Federation for Information ProcessingSpringer, US, pp 131–144
Fan W, Bouguila N, Ziou D (2011) Unsupervised anomaly intrusion detection via localized bayesian feature selection. In: Proceedings of the EEE international conference on data mining (ICDM), pp 1032–1037
Fan W, Bouguila N (2013) Variational learning of a Dirichlet process of generalized Dirichlet distributions for simultaneous clustering and feature selection. Pattern Recognit 46(10):2754–2769
Fan W, Bouguila N, Ziou D (2013) Unsupervised hybrid feature extraction selection for high-dimensional non-gaussian data clustering with variational inference. IEEE Transa Knowl Data Eng 25(7):1670–1685
Ferguson TS (1983) Bayesian density estimation by mixtures of normal distributions. Recent Adv Stat 24:287–302
Gruschka N, Luttenberger N (2006) Protecting web services from dos attacks by soap message validation. In: Fischer-Hebner S, Rannenberg K, Yngstram L, Lindskog S (eds) Security and privacy in dynamic environments, vol 201. IFIP International Federation for Information ProcessingSpringer, US, pp 171–182
Horng S-J, Su M-Y, Chen Y-H, Kao T-W, Chen R-J, Lai J-L, Perkasa CD (2011) A novel intrusion detection system based on hierarchical clustering and support vector machines. Expert Syst Appl 38(1):306–313
Ishwaran H, James LF (2001) Gibbs sampling methods for stick-breaking priors. J Am Statistical Assoc 96:161–173
Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR), vol 1. pp 260–263
Jensen M, Gruschka N, Herkenhener R (2009) A survey of attacks on web services. Comput Sci Res Dev 24(4):185–197
Jensen M, Gruschka N, Herkenhoner R, Luttenberger N (2007) Soa and web services: new technologies, new standards—new attacks. In: Proceedings of the fifth European conference on web services (ECOWS), pp 35–44
Kahn JM (2004) A generative bayesian model for aggregating experts’ probabilities. In: Proceedings of the 20th conference in uncertainty in artificial intelligence (UAI), AUAI Press, pp 301–308
Ke Y, Sukthankar R (2004) PCA-SIFT: A more distinctive representation for local image descriptors. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 506–513
Khan L, Awad M, Thuraisingham B (2007) A new intrusion detection system using support vector machines and hierarchical clustering. VLDB J 16(4):507–521
Kirchner M (2010) A framework for detecting anomalies in http traffic using instance-based learning and k-nearest neighbor classification. In: Proceedings of the 2nd international workshop on security and communication networks (IWSCN), pp 1–8
Korwar RM, Hollander M (1973) Contributions to the theory of dirichlet processes. Ann Probab 1:705–711
Lamdan Y, Schwartz JT, Wolfson HJ (1988) Object recognition by affine invariant matching. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 335–344
Laskov P, Dessel P, Schefer C, Rieck K (2005) Learning intrusion detection: supervised or unsupervised? In: Roli F, Vitulano S (eds) Image analysis and processing (ICIAP), Lecture notes in computer science vol 3617. Springer, Berlin, pp 50–57
Law MHC, Topchy AP, Jain AK (2005) Model-based clustering with probabilistic constraints. In: Proceedings of the SIAM international conference on data mining (SDM), pp 641–645
Lazebnik S, Schmid C, Ponce J (2004) Semi-local affine parts for object recognition. In: Proceedings of the British machine vision conference (BMVC), BMVA Press, pp 1–10
Li B, Zhong R-T, Wang X-J, Zhuang Z-Q (2006) Continuous optimization based-on boosting gaussian mixture model. In: Proceedings of the 18th international conference on pattern recognition (ICPR), vol 1. pp 1192–1195
Lowd D, Meek C (2005) Adversarial learning. In: Proceedings of the Eleventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 641–647
Lu Q, Yao X (2005) Clustering and learning gaussian distribution for continuous optimization. IEEE Trans Syst Man Cybern Part C Appl Rev 35(2):195–204
Matas J, Koubaroulis D, Kittler J (2002) The multimodal neighborhood signature for modeling object color appearance and applications in object recognition and image retrieval. Comput Vis Image Underst 88(1):1–23
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Mehdi M, Bouguila N, Bentahar J (2012) Trustworthy web service selection using probabilistic models. In: Proceedings of the IEEE 19th international conference on web services (ICWS), pp 17–24
Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. Int J Comput Vis 60:63–86
Northcutt S, Novak J (2002) Network intrusion detection: an analyst’s handbook. New Riders Publishing, UK
Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2013) Cats and dogs. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), pp 3498–3505
Patcha A, Park J-M (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput Netw 51(12):3448–3470
Pearce C, Bertok P, Schyndel R (2005) Protecting consumer data in composite web services. In: Sasaki R, Qing S, Okamoto E, Yoshiura H (eds) Security and privacy in the age of ubiquitous computing, vol 181. IFIP Advances in Information and Communication Technology Springer, US, pp 19–34
Pereira H, Jamhour E (2013) A clustering-based method for intrusion detection in web servers. In: Proceedings of the 20th international conference on telecommunications (ICT), pp 1–5
Pinzen C, Paz JF, Zato C, Perez J (2010) Protecting web services against dos attacks: A case-based reasoning approach. In: Romay M, Corchado E, Garcia Sebastian MT (eds) Hybrid artificial intelligence systems, Lecture notes in computer science, vol 6076. Springer, Berlin, pp 229–236
Rasiwasia N, Vasconcelos N (2008) Scene classification with low-dimensional semantic spaces and weak supervision. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), p 1–6
Sethuraman J (1994) A constructive definition of Dirichlet priors. Statistica Sin 4:639–650
Shoham S, Fellows MR, Normann RA (2003) Robust, automatic spike sorting using mixtures of multivariate t-distributions. J Neurosci Methods 127(2):111–122
Teh Y-W, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Teh YW, Jordan MI (2010) Hierarchical Bayesian nonparametric models with applications. In: Hjort N, Holmes C, Müller P, Walker S (eds) Bayesian nonparametrics: principles and practice. Cambridge University Press, London
Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y (2009) Review: intrusion detection by machine learning: a review. Expert syst Appl 36(10):11994–12000
Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical Dirichlet process. J Mach Learn Res Proc Track 15:752–760
Xiang S, Nie F, Zhang C (2008) Learning a mahalanobis distance metric for data clustering and classification. Pattern Recognit 41(12):3600–3612
Yamanishi K, Takeuchi J-I, Williams GJ, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min Knowl Discov 8(3):275–300
Yee CG, Shin WH, Rao G (2007) An adaptive intrusion detection and prevention (ID/IP) framework for web services. In: Proceedings of the international conference on convergence information technology (ICCIT), p 528–534
Zanero S, Savaresi SM (2004) Unsupervised learning techniques for an intrusion detection system. In: Proceedings of the ACM symposium on applied computing (SAC), ACM, pp 412–419
Zhou CV, Leckie C, Karunasekera S (2010) A survey of coordinated attacks and collaborative intrusion detection. Comput Secur 29(1):124–140
Zolotukhin M, Hamalainen T (2013) Detection of anomalous http requests based on advanced n-gram model and clustering techniques. In: Balandin S, Andreev S, Koucheryavy Y (eds) Internet of things., smart spaces, and next generation networking, Lecture notes in computer science, vol 8121. Springer, Berlin, pp 371–382
Zolotukhin M, Hamalainen T, Juvonen A (2013) Growing hierarchical self-organizing maps and statistical distribution models for online detection of web attacks. In: Cordeiro J, Krempels KH (eds) Web information systems and technologies, Lecture notes in business information processing vol 140. Springer, Berlin, pp 281–295
Acknowledgments
The second author would like to thank King Abdulaziz City for Science and Technology (KACST), Kingdom of Saudi Arabia, for their funding support under grant number 11-INF1787-08. The authors would like to thank the anonymous referees and the associate editor for their comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Fan, W., Sallay, H., Bouguila, N. et al. Variational learning of hierarchical infinite generalized Dirichlet mixture models and applications. Soft Comput 20, 979–990 (2016). https://doi.org/10.1007/s00500-014-1557-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1557-5