Abstract
This paper proposes an unsupervised clustering technique for data classification based on the K-means algorithm. The K-means algorithm is well known for its simplicity and low time complexity. However, the algorithm has three main drawbacks: dependency on the initial centroids, dependency on the number of clusters, and degeneracy. Our solution accommodates these three issues, by proposing an approach to automatically detect a semi-optimal number of clusters according to the statistical nature of the data. As a side effect, the method also makes choices of the initial centroid-seeds not critical to the clustering results. The experimental results show the robustness of the Y-means algorithm as well as its good performance against a set of other well known unsupervised clustering techniques. Furthermore, we study the performance of our proposed solution against different distance and outlier-detection functions and recommend the best combinations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chan, P.K., Mahoney, M.V., Arshad, M.H.: Managing cyber threats: Issues, approaches, and challenges. In: Learning Rules and Clusters for Anomaly Detection in Network Traffic, ch. 3, pp. 81–99. Springer, Heidelberg (2005)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
Cover, T., Hart, P.G.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT-13(1), 21–27 (1967)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B 39(1), 1–38 (1977)
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernatics 3(1), 32–57 (1973)
Frigge, M., Hoaglin, D.C., Iglewicz, B.: Some implementations of the boxplot. The American Statistician 43(1), 50–54 (1989)
Gibson, H.R.: Elementary statistics. William C. Brown Publishers, Dubuque (1994)
Guan, Y., Belacel, N., Ghorbani, A.A.: Y-means: a clustering method for intrusion detection. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering, Montreal, Canada, May 2003, pp. 1083–1086 (2003)
Han, J., Kamber, M.: Data mining: Concepts and techniques. Morgan Kaufmann Publishers, New York (2001)
Hansen, P., Mladenovi, N.: J-means: a new local search heuristic for minimum sum-of-squares clustering. Pattern Recognition 34(2), 405–413 (2002)
Jain, A.K., Dubes, R.C.: Algorithms for cluster data. Prentice Hall, Englewood Cliffs (1988)
Kohonen, T.: Self-organizing map. Springer, Heidelberg (1997)
MIT Lincoln Laboratory, Intrusion detection evaluation data set DARPA1998 (1998), http://www.ll.mit.edu/IST/ideval/data/1998/1998_data_index.html
Lei, J.Z., Ghorbani, A.: Network intrusion detection using an improved competitive learning neural network. In: Proceedings of The Second Annual Conference on Communication Networks and Services Research (CNSR), pp. 190–197 (2004)
Lin, Y., Shiueng, C.: A genetic approach to the automatic clustering problem. Pattern Recognition 34(2), 415–424 (2001)
Lippman, R.P.: An introduction to computing with neural networks. Proceedings of the ASSP Magazine 4(2), 4–22 (1987)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 2(1), pp. 281–297 (1967)
Mahalanobis, P.: On the generalized distance in statistics. Proceedings of the National Instute of Science (India) 2(1), 49–55
University of California Irvine, Knowledge discovery and data mining dataset KDD 1999 (1999), http://kdd.ics.uci.edu/databases/kddcup99/task.html
Pelleg, D., Moore, A.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000)
Portnoy, L., Eskin, E., Stolfo, S.J.: Intrusion detection with unlabeled data using clustering. In: Proceedings of ACM CSS Workshop on Data Mining Applied to Security, DMSA 2001, November 2001. ACM, New York (2001)
Quinlan, J.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Spath, H.: Clustering analysis algorithms for data reduction and classification of objects. Ellis Horwood, Chichester (1980)
Walpole, R.E.: Elementary Statistical Concepts, 2nd edn. Macmillan, Basingstoke (1983)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ghorbani, A.A., Onut, IV. (2010). Y-Means: An Autonomous Clustering Algorithm. In: Graña Romay, M., Corchado, E., Garcia Sebastian, M.T. (eds) Hybrid Artificial Intelligence Systems. HAIS 2010. Lecture Notes in Computer Science(), vol 6076. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13769-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-13769-3_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13768-6
Online ISBN: 978-3-642-13769-3
eBook Packages: Computer ScienceComputer Science (R0)