Abstract
This paper presents a novel density based agglomerative clustering algorithm named TOBAE which is a parameter-less algorithm and automatically filters noise. It finds the appropriate number of clusters while giving a competitive running time. TOBAE works by tracking the cumulative density distribution of the data points on a grid and only requires the original data set as input. The clustering problem is solved by automatically finding the optimal density threshold for the clusters. It is applicable to any N-dimensional data set which makes it highly relevant for real world scenarios. The algorithm outperforms state of the art clustering algorithms by the additional feature of automatic noise filtration around clusters. The concept behind the algorithm is explained using the analogy of puddles (’tobae’), which the algorithm is inspired from. This paper provides a detailed algorithm for TOBAE along with the complexity analysis for both time and space. We show experimental results against known data sets and show how TOBAE competes with the best algorithms in the field while providing its own set of advantages.
Similar content being viewed by others
References
ABRAHAM, C., CORNILLON, P.-A., MATZNER-LØBER, E., and MOLINARI, N. (2003), “Unsupervised Curve Clustering Using B-Splines”, Scandinavian Journal of Statistics, 30(3), 581–595.
ALON, J., SCLAROFF, S., KOLLIOS, G., and PAVLOVIC, V. (2003), “Discovering Clusters in Motion Time-Series Data”, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1, 1–375.
ANKERST, M., BREUNIG, M.M., KRIEGEL, H.-P., and SANDER, J. (1999), “Optics: Ordering Points to Identify the Clustering Structure”, ACM Sigmod Record, 28(2), 49-60.
BAGNALL, A.J., JANACEK, G.J., and ZHANG, M. (2003), “Clustering Time Series from Mixture Polynomial Models with Discretised Data”, University of East Anglia.
BASHIR, F.I., KHOKHAR, A.A., and SCHONFELD, D. (2007a), “Object Trajectory- Based Activity Classification and Recognition Using Hidden Markov Models”, IEEE Transactions on Image Processing, 16(7), 1912–1919.
BASHIR, F.I., KHOKHAR, A.A., and SCHONFELD, D. (2007b), “Real-Time Motion Trajectory-Based Indexing and Retrieval of Video Sequences”, IEEE Transactions on Multimedia, 9(1), 58–65.
BOLEY, D., GINI, M. et al. (1999), “Partitioning-Based Clustering for Web Document Categorization”, Decision Support Systems, 27(3), 329–341.
BUZAN, D., SCLAROFF, S., and KOLLIOS, G. (2004), “Extraction and Clustering of Motion Trajectories in Video”, Proceedings of the 17th IEEE International Conference on Pattern Recognition, 2, 521–524.
CONAN-GUEZ, B., and ROSSI, F. (2002), “Multi-Layer Perceptrons for Functional Data Analysis: A Projection Based Approach”, Proceedings of the ICANN 2002 Conference on Artificial Neural Networks, 667–672.
DAVIES, D.L., and BOULDIN, D.W. (1979), “A Cluster Separation Measure”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 224–227.
DUECK, D., and FREY, B.J. (2007), “Non-Metric Affinity Propagation for Unsupervised Image Categorization, IEEE 11th International Conference on Computer Vision, 1–8.
DUECK, D., FREY, B.J. et al. (2008), “Constructing Treatment Portfolios Using Affinity Propagation”, Research in Computational Molecular Biology, Heidelberg, Germany: Springer, pp. 360–371.
EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, London: Arnold, A member of the Hodder Headline Group.
FRALEY, C., and RAFTERY, A.E. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association, 97(458), 611-631.
FREY, B.J., and DUECK, D. (2007), “Clustering by Passing Messages Between Data Points”, Science, 315, 972–976.
FREY, B.J., MOHAMMAD, N. et al. (2005), “Genome-Wide Analysis of Mouse Transcripts Using Exon Microarrays and Factor Graphs”, Nature Genetics, 37(9), 991-996.
GUHA, S., RASTOGI, R., and SHIM, K. (2001), “Cure: An Efficient Clustering Algorithm for Large Databases”, Information Systems, 26(1), 35–58.
GUHA, S., RASTOGI, R., and SHIM, K. (1999), “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, Proceedings of the 15th IEEE International Conference on Data Engineering, 512–521.
HINNEBURG, A., and GABRIEL, H.-H. (2007), “Denclue 2.0: Fast Clustering Based on Kernel Density Estimation”, in Advances in Intelligent Data Analysis VII, eds. M.R. Berthold, J. Shawe-Taylor, and N. Lavra¸ Springer, pp. 70–80.
HINNEBURG, A., and KEIM, D.A. (1998), “An Efficient Approach to Clustering in Large Multimedia Databases with Noise, KDD, 98, 58–65.
JAIN, A.K., and DUBES, R.C. (1988), Algorithms for Clustering Data, Upper Saddle River NJ: Prentice-Hall, Inc.
JOHNSON, N., and HOGG, D. (1996), “Learning the Distribution of Object Trajectories for Event Recognition”, Image and Vision Computing, 14(8), 609–615.
KARYPIS, G., HAN, E.-H., and KUMAR, V. (1999), “Chameleon: Hierarchical Clustering Using Dynamic Modeling, Computer, 32(8), 68–75.
KEOGH, E., WEI, L., XI, X., LEE,S.-H., and VLACHOS, M. (2006), “LB Keogh Supports Exact Indexing of Shapes Under Rotation Invariance with Arbitrary Representations and Distance Measures”, Proceedings of the 32nd international Conference on Very Large Data Bases, 882–893.
KHALID, S. (2010a), “Activity Classification and Anomaly Detection Using¡ i¿ m¡/i¿-mediods Based Modelling of Motion Patterns”, Pattern Recognition, 43(10), 3636-3647.
KHALID, S. (2010b), “Motion-Based Behaviour Learning, Profiling and Classification in the Presence of Anomalies”, Pattern Recognition, 43(1), 173–186.
KHALID, S., and NAFTEL, A. (2010), “Automatic Motion Learning in the Presence of Anomalies Using Coefficient Feature Space Representation of Trajectories”, Acta Automatica Sinica, 36(5), 655–666.
KHALID, S., and RAZZAQ, S. (2012), “Frameworks for Multivariate¡ i¿ m¡/i¿-mediods Based Modeling and Classification in Euclidean and General Feature Spaces”, Pattern Recognition, 45(3), 1092–1103.
KOHONEN, T. (1997), “Learning Vector Quantization”, Self-Organizing Maps, 30, 203-217.
LAZIC, N., FREY, B.J., and AARABI, P. (2010), “Solving the Uncapacitated Facility Location Problem Using Message Passing Algorithms”, Proceedings of the International Conference on Artificial Intelligence and Statistics, 429–436.
LAZIC, N., GIVONI, I., FREY, B., and AARABI, P. (2009), “Floss: Facility Location for Subspace Segmentation”, Proceedings of the IEEE 12th International Conference on Computer Vision, 825–832.
NAFTEL, A., and KHALID, S. (2006), “Classifying Spatiotemporal Object Trajectories Using Unsupervised Learning in the Coefficient Feature Space”, Multimedia Systems, 12(3), 227–238.
NG, A.Y., JORDAN, M.I., WEISS, Y. et al. (2002), “On Spectral Clustering: Analysis and an Algorithm”, Advances in Neural Information Processing Systems, 2, 849–856.
OWENS, J., and HUNTER, A. (2000), “Application of the Self-Organising Map to Trajectory Classification”, Proceedings of the Third IEEE International Workshop on Visual Surveillance, 77–83.
PORIKLI, F., and HAGA, T. (2004), “Event Detection by Eigenvector Decomposition Using Object and Frame Features, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 114–114.
SANDER, J., ESTER, M., KRIEGEL, H.-P., and XU, X. (1998), “Density-Based Clustering in Spatial Databases: The Algorithm Gdbscan and its Applications”, Data Mining and Knowledge Discovery, 2(2), 169–194.
STAUFFER, C., and GRIMSON, W.E.L. (2000), “Learning Patterns of Activity Using Real-Time Tracking”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 747–757.
STUETZLE, W. (2003), “Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample”, Journal of Classification, 20(1), 025–047.
STUETZLE, W., and NUGENT, R. (2010), “A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density”, Journal of Computational and Graphical Statistics, 19(2), 397–418.
SUMPTER, N., and BULPITT, A. (2000), Learning Spatio-Temporal Patterns for Predicting Object Behaviour, Image and Vision Computing, 18(9), 697–704.
TAGARELLI, A. and KARYPIS, G. (2013), “A Segment-Based Approach to Clustering Multi-Topic Documents”, Knowledge and Information Systems, 34(3), 563–595.
VLACHOS, M., KOLLIOS, G., and GUNOPULOS, D. (2002), “Discovering Similar Multidimensional Trajectories”, Proceedings of the 18th IEEE International Conference on Data Engineering, 673–684.
WANG, K., ZHANG, J., LI, D., ZHANG, X., and GUO, T. (2008), “Adaptive Affinity Propagation Clustering”, arXiv, 0805.1096.
YAGER, R.R. (2000), “Intelligent Control of the Hierarchical Agglomerative Clustering Process”, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 30(6), 835–845.
ZELNIK-MANOR, L., and PERONA, P. (2004) “Self-Tuning Spectral Clustering”, Advances in Neural Information Processing Systems, 17, 1601–1608.
ZHANG, T., RAMAKRISHNAN, R., and LIVNY, M. (1996), “BIRCH: An Efficient Data Clustering Method for Very Large Databases”, ACM SIGMOD Record, 25(2), 103–114.
ZHAO, Y., KARYPIS, G., and FAYYAD, U. (2005), “Hierarchical Clustering Algorithms for Document Datasets”, Data Mining and Knowledge Discovery, 10(2), 141–168.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khalid, S., Razzaq, S. TOBAE: A Density-based Agglomerative Clustering Algorithm. J Classif 32, 241–267 (2015). https://doi.org/10.1007/s00357-015-9166-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-015-9166-2