skip to main content
tutorial

Systematic Review of Clustering High-Dimensional and Large Datasets

Published: 23 January 2018 Publication History

Abstract

Technological advancement has enabled us to store and process huge amount of data in relatively short spans of time. The nature of data is rapidly changing, particularly its dimensionality is more commonly multi- and high-dimensional. There is an immediate need to expand our focus to include analysis of high-dimensional and large datasets. Data analysis is becoming a mammoth task, due to incremental increase in data volume and complexity in terms of heterogony of data. It is due to this dynamic computing environment that the existing techniques either need to be modified or discarded to handle new data in multiple high-dimensions. Data clustering is a tool that is used in many disciplines, including data mining, so that meaningful knowledge can be extracted from seemingly unstructured data. The aim of this article is to understand the problem of clustering and various approaches addressing this problem. This article discusses the process of clustering from both microviews (data treating) and macroviews (overall clustering process). Different distance and similarity measures, which form the cornerstone of effective data clustering, are also identified. Further, an in-depth analysis of different clustering approaches focused on data mining, dealing with large-scale datasets is given. These approaches are comprehensively compared to bring out a clear differentiation among them. This article also surveys the problem of high-dimensional data and the existing approaches, that makes it more relevant. It also explores the latest trends in cluster analysis, and the real-life applications of this concept. This survey is exhaustive as it tries to cover all the aspects of clustering in the field of data mining.

References

[1]
Elke Achtert, Christian Bohm, Hans-Peter Kriegel, Peer Kroger, and Arthur Zimek. 2007b. On exploring complex relationships of correlation clusters. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSBDM’07). IEEE, 7--7.
[2]
Elke Achtert, Christian Böhm, Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2007a. Robust, complete, and efficient correlation clustering. In Proceedings of the 2007 SIAM International Conference on Data Mining. SIAM, 413--418.
[3]
Elke Achtert, Christian Bohm, Peer Kroger, and Arthur Zimek. 2006. Mining hierarchies of correlation clusters. In Proceedings of the 18th International Conference on Scientific and Statistical Database Management (SSDBM’06). IEEE, 119--128.
[4]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29 (VLDB’03). 81--92.
[5]
Charu C. Aggarwal and S. Yu Philip. 2004. A condensation approach to privacy preserving data mining. In Proceedings of the International Conference on Extending Database Technology, Advances in Database Technology (EDBT’04). Springer, 183--199.
[6]
Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. 1999. Fast algorithms for projected clustering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Vol. 28. ACM, 61--72.
[7]
Charu C. Aggarwal and Philip S. Yu. 2000. Finding Generalized Projected Clusters in High Dimensional Spaces, Vol. 29. ACM.
[8]
Charu C. Aggarwal and Philip S. Yu. 2002. Redefining clustering for high-dimensional applications. IEEE Transactions on Knowledge and Data Engineering 14, 2 (2002), 210--225.
[9]
Charu C. Aggarwal and ChengXiang Zhai. 2012a. Mining Text Data. Springer Science 8 Business Media.
[10]
Charu C. Aggarwal and ChengXiang Zhai. 2012b. A survey of text clustering algorithms. In Mining Text Data. Springer, 77--128.
[11]
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Vol. 27. ACM.
[12]
Rakesh Agrawal, Johannes Ernst Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1999. Automatic subspace clustering of high dimensional data for data mining applications. U.S. Patent 6,003,029, issued December 14, 1999.
[13]
Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. 2009. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 4 (2009), 461--486.
[14]
Amineh Amini, Teh Ying Wah, and Hadi Saboohi. 2014. On density-based data streams clustering algorithms: a survey. Journal of Computer Science and Technology 29, 1 (2014), 116--141.
[15]
Rajaraman Anand and D. U. Jeffrey. 2012. Mining of Massive Datasets.
[16]
S. Aranganayagi and K. Thangavel. 2007. Clustering categorical data using silhouette coefficient as a relocating measure. In Proceedings of the 2007 International Conference on Computational Intelligence and Multimedia Applications, Vol. 2. IEEE, 13--17.
[17]
Saurabh Arora and Inderveer Chana. 2014. A survey of clustering techniques for big data analysis. In Proceedings of the 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence). IEEE, 59--65.
[18]
Ira Assent. 2012. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 4 (2012), 340--350.
[19]
Ira Assent, Ralph Krieger, Emmanuel Muller, and Thomas Seidl. 2007. DUSC: Dimensionality unbiased subspace clustering. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 409--414.
[20]
Abdelkarim Ben Ayed, Mohamed Ben Halima, and Adel M. Alimi. 2014. Survey on clustering methods: Towards fuzzy clustering for big data. In Proceedings of the 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR’14). IEEE, 331--336.
[21]
Pierre Baldi and Kurt Hornik. 1989. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks 2, 1 (1989), 53--58.
[22]
Daniel Barbará and Ping Chen. 2000. Using the fractal dimension to cluster datasets. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 260--264.
[23]
Mikhail Belkin and Partha Niyogi. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS’01), Vol. 14. 585--591.
[24]
Michael W. Berry and Malu Castellanos. 2004. Survey of text mining. Computing Reviews 45, 9 (2004), 548.
[25]
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When is nearest neighbor meaningful? In Proceedings of the International Conference on Database Theory (ICDT’99). Springer, 217--235.
[26]
Christophe Biernacki, Gilles Celeux, and Gérard Govaert. 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 7 (2000), 719--725.
[27]
Christopher M. Bishop. 1995. Neural Networks for Pattern Recognition. Oxford University Press.
[28]
Leon Bobrowski and James C. Bezdek. 1991. c-means clustering with the l l and l norms. IEEE Transactions on Systems, Man and Cybernetics 21, 3 (1991), 545--554.
[29]
Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek. 2004. Computing clusters of correlation connected objects. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 455--466.
[30]
Urszula Boryczka. 2009. Finding groups in data: Cluster analysis with ants. Applied Soft Computing 9, 1 (2009), 61--70.
[31]
Olutayo Boyinbode, Hanh Le, and Makoto Takizawa. 2011. A survey on clustering algorithms for wireless sensor networks. International Journal of Space-Based and Situated Computing 1, 2--3 (2011), 130--136.
[32]
Ulrik Brandes, Marco Gaertler, and Dorothea Wagner. 2003. Experiments on graph clustering algorithms. In Proceedings of the European Symposium on Algorithms. Springer, 568--579.
[33]
Janez Brank, Marko Grobelnik, and Dunja Mladenic. 2005. A survey of ontology evaluation techniques. In Proceedings of the Conference on Data Mining and Data Warehouses (SiKDD’05). 166--170.
[34]
Ryan P. Browne, Paul D. McNicholas, and Matthew D. Sparling. 2012. Model-based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 814--817.
[35]
Peter Brucker. 1978. On the complexity of clustering problems. In Optimization and Operations Research. Springer, 45--54.
[36]
Joachim Buhmann. 1995. Data clustering and learning. In The Handbook of Brain Theory and Neural Networks, Michael A. Arbib (Ed.). MIT Press, 278--281.
[37]
Gail A. Carpenter, Stephen Grossberg, and David B. Rosen. 1991. Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks 4, 6 (1991), 759--771.
[38]
Umit V. Catalyiirek, Kamer Kaya, Johannes Langguth, and Bora Uçar. 2013. A partitioning-based divisive clustering technique for maximizing the modularity. Graph Partitioning and Graph Clustering 588 (2013), 171.
[39]
Kaushik Chakrabarti and Sharad Mehrotra. 2000. Local dimensionality reduction: A new approach to indexing high dimensional spaces. In Proceedings of the 26th VLDB Conference. 89--100.
[40]
Asis Kumar Chattopadhyay, Tanuka Chattyopadhyay, Tuli De, and Saptarshi Mondal. 2013. Independent component analysis for dimension reduction classification: Hough transform and CASH algorithm. In Astrostatistical Challenges for the New Astronomy. Springer, 185--202.
[41]
C. L. Philip Chen and Chun-Yang Zhang. 2014. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences 275 (2014), 314--347.
[42]
Min Chen, Shiwen Mao, and Yunhao Liu. 2014. Big data: A survey. Mobile Networks and Applications 19, 2 (2014), 171--209.
[43]
Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, and Jianyong Wang. 2002. Multi-dimensional regression analysis of time-series data streams. In Proceedings of the 28th International Conference on Very Large Data Bases. 323--334.
[44]
Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. 1999. Entropy-based subspace clustering for mining numerical data. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 84--93.
[45]
Yizong Cheng and George M. Church. 2000. Biclustering of expression data. ISMB 8 (2000), 93--103.
[46]
Vladimir Cherkassky and Filip M. Mulier. 2007. Learning from Data: Concepts, Theory, and Methods. John Wiley 8 Sons.
[47]
Michael Cochez and Hao Mou. 2015. Twister tries: Approximate hierarchical agglomerative clustering for average distance in linear time. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 505--517.
[48]
Ronan Collobert and Samy Bengio. 2001. SVMTorch: Support vector machines for large-scale regression problems. The Journal of Machine Learning Research 1 (2001), 143--160.
[49]
Nick Craswell and Martin Szummer. 2007. Random walks on the click graph. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 239--246.
[50]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113.
[51]
Hongbo Deng and Jiawei Han. 2013. Probabilistic models for clustering. In Data Clustering: Algorithms and Applications, Charu C. Aggarwal and Chandan K. Reddy (Eds.). CRC Press, 61.
[52]
Inderjit S. Dhillon. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 269--274.
[53]
Chris Ding and Xiaofeng He. 2004. K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning. ACM, 29.
[54]
Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst D. Simon. 2002. Adaptive dimension reduction for clustering high dimensional data. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02). IEEE, 147--154.
[55]
Chris H. Q. Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D. Simon. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the IEEE International Conference on Data Mining (ICDM’01). IEEE, 107--114.
[56]
Hristo N. Djidjev and Melih Onus. 2013. Scalable and accurate graph clustering and community structure detection. IEEE Transactions on Parallel and Distributed Systems 24, 5 (2013), 1022--1029.
[57]
Chuong B. Do and Serafim Batzoglou. 2008. What is the expectation maximization algorithm? Nature Biotechnology 26, 8 (2008), 897--899.
[58]
Richard C. Dubes. 1993. Cluster analysis and related issues. In Handbook of Pattern Recognition 8 Computer Vision, C. H. Chen, L. F. Pau, and P. S. P. Wang (Eds.). World Scientific Publishing Co., Inc., 3--32.
[59]
Jordi Duch and Alex Arenas. 2005. Community detection in complex networks using extremal optimization. Physical Review E 72, 2 (2005), 027104.
[60]
Richard O. Duda, Peter E. Hart, and David G. Stork. 2001. Pattern Classification (2nd ed.). Wiley.
[61]
Jack Edmonds. 1965. Paths, trees, and flowers. Canadian Journal of Mathematics 17, 3 (1965), 449--467.
[62]
Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 95, 25 (1998), 14863--14868.
[63]
Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. 2013. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML’13), 334--342.
[64]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, Vol. 96. 226--231.
[65]
Brian Everitt and Torsten Hothorn. 2011. Cluster analysis. In An Introduction to Applied Multivariate Analysis with R, Robert Gentleman, Kurt Hornik, and Giovanni Parmigiani (Eds.). Springer, 163--200.
[66]
Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, Sebti Foufou, and Abdelaziz Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing 2, 3 (2014), 267--279.
[67]
Gary William Flake, Robert E. Tarjan, and Kostas Tsioutsiouliklis. 2004. Graph clustering and minimum cut trees. Internet Mathematics 1, 4 (2004), 385--408.
[68]
Chris Fraley and Adrian E. Raftery. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 8 (1998), 578--588.
[69]
Laurent Galluccio, Olivier Michel, Pierre Comon, Mark Kliger, and Alfred O. Hero. 2013. Clustering with a new distance measure based on a dual-rooted tree. Information Sciences 251 (2013), 96--113.
[70]
Laurent Galluccio, Olivier Michel, Pierre Comon, Mark Kliger, and Alfred O. Hero. 2013. Hybrid clustering algorithm with modifications enhanced K-means and hierarchal clustering. International Journal of Advanced Research in Computer Science and Software Engineering 3, 5 (2013), 166--170.
[71]
Junhao Gan and Yufei Tao. 2015. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 519--530.
[72]
Esther Garcia, Francisco Pedroche, and Miguel Romance. 2013. On the localization of the personalized PageRank of complex networks. Linear Algebra and Its Applications 439, 3 (2013), 640--652.
[73]
Andreas Geyer-Schulz and Michael Ovelgönne. 2014. The randomized greedy modularity clustering algorithm and the core groups graph clustering scheme. In German-Japanese Interchange of Data Analysis Results, Wolfgang Gaul, Andreas Geyer-Schulz, Yasumasa Baba, and Akinori Okada (Eds.). Springer, 17--36.
[74]
K. Chidananda Gowda and Edwin Diday. 1991. Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24, 6 (1991), 567--578.
[75]
Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. 2000. Clustering data streams. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. IEEE, 359--366.
[76]
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD’98), Vol. 27. ACM, 73--84.
[77]
Michael Hahsler and Matthew Bolaños. 2016. Clustering data streams based on shared density between micro-clusters. IEEE Transactions on Knowledge and Data Engineering 28, 6 (2016), 1449--1461.
[78]
Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems 17, 2 (2001), 107--145.
[79]
Greg Hamerly and Charles Elkan. 2002. Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the 11th International Conference on Information and Knowledge Management. ACM, 600--607.
[80]
Jiawei Han, Micheline Kamber, and Jian Pei. 2011. Data Mining: Concepts and Techniques. Elsevier.
[81]
Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, and Samee Ullah Khan. 2015. The rise of big data on cloud computing: Review and open research issues. Information Systems 47 (2015), 98--115.
[82]
Richard J. Hathaway, James C. Bezdek, and Yingkang Hu. 2000. Generalized fuzzy c-means clustering strategies using L p norm distances. IEEE Transactions on Fuzzy Systems 8, 5 (2000), 576--582.
[83]
Yaobin He, Haoyu Tan, Wuman Luo, Shengzhong Feng, and Jianping Fan. 2014. MR-DBSCAN: A scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science 8, 1 (2014), 83--99.
[84]
Zengyou He, Xiaofei Xu, and Shengchun Deng. 2008. k-ANMI: A mutual information based clustering algorithm for categorical data. Information Fusion 9, 2 (2008), 223--233.
[85]
Monika Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 284--291.
[86]
Alexander Hinneburg and Daniel A. Keim. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), Vol. 98. 58--65.
[87]
Alexander Hinneburg and Daniel A. Keim. 1999. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Proceedings of the 25th VLDB Conference.
[88]
Chenping Hou, Feiping Nie, Dongyun Yi, and Dacheng Tao. 2015. Discriminative embedded clustering: A framework for grouping high-dimensional data. IEEE Transactions on Neural Networks and Learning Systems 26, 6 (2015), 1287--1299.
[89]
Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW’10). IEEE, 41--51.
[90]
Aapo Hyvarinen. 1999. Survey on independent component analysis. Neural Computing Surveys 2, 4 (1999), 94--128.
[91]
Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc.
[92]
Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. 2000. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1 (2000), 4--37.
[93]
Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. 1999. Data clustering: A review. ACM Computing Surveys 31, 3 (1999), 264--323.
[94]
Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 538--543.
[95]
Huidong Jin, Jie Chen, Hongxing He, Graham J. Williams, Chris Kelman, and Christine M. O’Keefe. 2008. Mining unexpected temporal associations: Applications in detecting adverse drug reactions. IEEE Transactions on Information Technology in Biomedicine 12, 4 (2008), 488--500.
[96]
Karin Kailing, Hans-Peter Kriegel, and Peer Kröger. 2004. Density-connected subspace clustering for high-dimensional data. In Proceedings of the 2004 SIAM International Conference on Data Mining, Vol. 4. SIAM.
[97]
George Karypis, Eui-Hong Han, and Vipin Kumar. 1999. Chameleon: Hierarchical clustering using dynamic modeling. Computer 32, 8 (1999), 68--75.
[98]
Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344. John Wiley 8 Sons.
[99]
Yoonsoo Kim and Mehran Mesbahi. 2006. On maximizing the second smallest eigenvalue of a state-dependent graph Laplacian. IEEE Transactions on Automatic Control 51, 1 (2006), 116--120.
[100]
Jon Kleinberg. 2003. An impossibility theorem for clustering. In Proceedings of the 15th International Conference on Neural Information Processing Systems. 463--470.
[101]
Teuvo Kohonen. 1990. The self-organizing map. Proceedings of the IEEE 78, 9 (1990), 1464--1480.
[102]
Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojärvi, Jukka Honkela, Vesa Paatero, and Antti Saarela. 2000. Self organization of a massive document collection. IEEE Transactions on Neural Networks 11, 3 (2000), 574--585.
[103]
Teuvo Kohonen, M. R. Schroeder, and T. S. Huang. 2001. Self-Organizing Maps. Springer-Verlag, New York, Inc., Secaucus, NJ, 43.
[104]
Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2008. Detecting clusters in moderate-to-high dimensional data: Subspace clustering, pattern-based clustering, and correlation clustering. Proceedings of the VLDB Endowment 1, 2 (2008), 1528--1529.
[105]
Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2009. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data 3, 1 (2009), 1.
[106]
G. N. Lance and W. T. Williams. 1967. A general theory of classification sorting strategies: 1= hierarchical systems, 2= clustering systems. Computer Journal 10, 3 (1967), 271--277.
[107]
Peter Langfelder, Bin Zhang, and Steve Horvath. 2008. Defining clusters from a hierarchical cluster tree: The dynamic tree cut package for R. Bioinformatics 24, 5 (2008), 719--720.
[108]
Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with MapReduce: A survey. ACM SIGMOD Record 40, 4 (2012), 11--20.
[109]
Deyi Li, Shuliang Wang, Wenyan Gan, and Deren Li. 2012. Data field for hierarchical clustering. Developments in Data Extraction, Management, and Analysis (2012), 303.
[110]
Jiuyong Li, Xiaodi Huang, Clinton Selke, and Jianming Yong. 2007. A fast algorithm for finding correlation clusters in noise data. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 639--647.
[111]
Ning Li, Li Zeng, Qing He, and Zhongzhi Shi. 2012. Parallel implementation of Apriori algorithm based on MapReduce. In Proceeding of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel 8 Distributed Computing (SNPD). IEEE, 236--241.
[112]
Xin-Ye Li and Li-jie Guo. 2012. Constructing affinity matrix in spectral clustering based on neighbor propagation. Neurocomputing 97 (2012), 125--130.
[113]
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7, 1 (2003), 76--80.
[114]
Bing Liu, Yiyuan Xia, and Philip S. Yu. 2000. Clustering through decision tree construction. In Proceedings of the 9th International Conference on Information and Knowledge Management. ACM, 20--29.
[115]
Chung Laung Liu. 1968. Introduction to Combinatorial Mathematics, Vol. 181. McGraw-Hill, New York.
[116]
Yuechang Liu and Yong Tang. 2015. Network based framework for author name disambiguation applications. International Journal of u-and e-Service, Science and Technology 8, 9 (2015), 75--82.
[117]
Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129--137.
[118]
James MacQueen and others. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. Oakland, CA, 281--297.
[119]
Sara C. Madeira and Arlindo L. Oliveira. 2004. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1, 1 (2004), 24--45.
[120]
Jianchang Mao and Anil K. Jain. 1996. A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Transactions on Neural Networks 7, 1 (1996), 16--29.
[121]
Marie-Hélène Masson and Thierry Denoeux. 2011. Ensemble clustering in the belief functions framework. International Journal of Approximate Reasoning 52, 1 (2011), 92--109.
[122]
Geoffrey J. McLachlan and Kaye E. Basford. 1988. Mixture Models: Inference and Applications to Clustering. Statistics: Textbooks and Monographs. Dekker, New York, Dekker.
[123]
Ryszard S. Michalski and Robert E. Stepp. 1983. Automated construction of classifications: Conceptual clustering versus numerical taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 4 (1983), 396--410.
[124]
Zijian Ming, Chunjie Luo, Wanling Gao, Rui Han, Qiang Yang, Lei Wang, and Jianfeng Zhan. 2013. BDGS: A scalable big data generator suite in big data benchmarking. In Workshop on Big Data Benchmarks, Tilmann Rabl, Nambiar Raghunath, Meikel Poess, Milind Bhandarkar, Hans-Arno Jacobsen, and Chaitanya Baru (Eds.). Springer, 138--154.
[125]
Jiawei Han and Micheline Kamber. 2001. Data Mining: Concepts and Techniques. Elsevier.
[126]
Priyanka Mukhopadhyay and Bidyut B. Chaudhuri. 2015. A survey of hough transform. Pattern Recognition 48, 3 (2015), 993--1010.
[127]
T. M. Murali and Simon Kasif. 2003. Extracting conserved gene expression motifs from gene expression data. In Pacific Symposium on Biocomputing, Vol. 8. 77--88.
[128]
Fionn Murtagh. 1983. A survey of recent advances in hierarchical clustering algorithms. Computer Journal 26, 4 (1983), 354--359.
[129]
Mor Naaman. 2012. Social multimedia: Highlighting opportunities for search and mining of multimedia data in social media applications. Multimedia Tools and Applications 56, 1 (2012), 9--34.
[130]
Mohammad Hossein Nadimi and Mostafa Mosakhani. 2015. A more accurate clustering method by using co-author social networks for author name disambiguation. Journal of Computing and Security 1, 4 (2015), 307--317.
[131]
Mark E. J. Newman. 2004. Detecting community structure in networks. The European Physical Journal B-Condensed Matter and Complex Systems 38, 2 (2004), 321--330.
[132]
Andrew Y. Ng, Michael I. Jordan, Yair Weiss, and others. 2001. On spectral clustering: Analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS’01), Vol. 14. 849--856.
[133]
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94). Santiago, Chile, 144--155.
[134]
Raymond T. Ng and Jiawei Han. 2002. CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering 14, 5 (2002), 1003--1016.
[135]
Feiping Nie, Chris Ding, Dijun Luo, and Heng Huang. 2010. Improved minmax cut graph clustering with nonnegative relaxation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 451--466.
[136]
Feiping Nie, Xiaoqian Wang, and Heng Huang. 2014. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 977--986.
[137]
Feiping Nie, Xiaoqian Wang, Michael I. Jordan, and Heng Huang. 2016. The constrained Laplacian rank algorithm for graph-based clustering. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16). Citeseer, 1969--1976.
[138]
Feiping Nie, Dong Xu, and Xuelong Li. 2012. Initialization independent clustering with actively self-training method. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42, 1 (2012), 17--27.
[139]
Feiping Nie, Zinan Zeng, Ivor W. Tsang, Dong Xu, and Changshui Zhang. 2011. Spectral embedded clustering: A framework for in-sample and out-of-sample spectral clustering. IEEE Transactions on Neural Networks 22, 11 (2011), 1796--1808.
[140]
Liadan O’callaghan, Adam Meyerson, Rajeev Motwani, Nina Mishra, and Sudipto Guha. 2002. Streaming-data algorithms for high-quality clustering. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 0685.
[141]
Erkki Oja. 1992. Principal components, minor components, and linear neural networks. Neural Networks 5, 6 (1992), 927--935.
[142]
Nikhil R. Pal, James C. Bezdek, and Eric C. K. Tsao. 1993. Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Transactions on Neural Networks 4, 4 (1993), 549--557.
[143]
Divya Pandove and Shivani Goel. 2015. A comprehensive study on clustering approaches for big data mining. In Proceedings of the 2015 2nd International Conference on Electronics and Communication Systems (ICECS). IEEE, 1333--1338.
[144]
Divya Pandove and Shivani Goel. 2015. Prototyping and in-depth analysis of big data benchmarking. In Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM). IEEE, 1222--1229.
[145]
Lance Parsons, Ehtesham Haque, and Huan Liu. 2004. Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter 6, 1 (2004), 90--105.
[146]
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, 165--178.
[147]
Adriano Pereira, Leonardo Rocha, Fernando Mourão, Paulo Góes, and Wagner Meira Jr. 2009. Reactivity based model to study online auctions dynamics. Information Technology and Management 10, 1 (2009), 21--37.
[148]
K. Rajendra Prasad and B. Eswara Reddy. 2013. Assessment of clustering tendency through progressive random sampling and graph-based clustering results. In Proceedings of the 2013 IEEE 3rd International Advance Computing Conference (IACC). IEEE, 726--731.
[149]
Aaron Quigley and Peter Eades. 2000. FADE: Graph drawing, clustering, and visual abstraction. In International Symposium on Graph Drawing. Springer, 197--210.
[150]
M. Kuchaki Rafsanjani, Z. Asghari Varzaneh, and N. Emami Chukanlo. 2012. A survey of hierarchical clustering algorithms. The Journal of Mathematics and Computer Science 5, 3 (2012), 229--240.
[151]
Anand Rajaraman, Jeffrey D. Ullman, Jeffrey David Ullman, and Jeffrey David Ullman. 2012. Mining of Massive Datasets, Vol. 1. Cambridge University Press, Cambridge.
[152]
I. K. Ravichandra Rao. 2003. Data mining and clustering techniques. In Proceedings of DRTC Workshop on Semantic Web, Vol. 8.
[153]
Bellman Richard. 1961. Adaptive Control Processes: A Guided Tour. Princeton University Press.
[154]
Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of EMNLP-CoNLL, Vol. 7. 410--420.
[155]
Satu Elisa Schaeffer. 2007. Graph clustering. Computer Science Review 1, 1 (2007), 27--64.
[156]
John Scott. 2012. Social Network Analysis. Sage.
[157]
M. Omair Shafiq and Eric Torunski. 2016. A parallel K-medoids algorithm for clustering based on MapReduce. In Proceedings of 15th IEEE International Conference on Machine Learning and Applications (ICMLA’16). IEEE, 502--507.
[158]
B. A. Shboul and Sung-Hyon Myaeng. 2009. Initializing k-means using genetic algorithms. (2009).
[159]
Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang. 1998. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of VLDB, Vol. 98. 428--439.
[160]
Peter H. A. Sneath. 1957. The application of computers to taxonomy. Microbiology 17, 1 (1957), 201--226.
[161]
Thorvald Sørensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske skrifter 5 (1948), 1--34.
[162]
Michael Steinbach, George Karypis, Vipin Kumar, and others. 2000. A comparison of document clustering techniques. In Proceedings of KDD Workshop on Text Mining, Vol. 400. Boston, 525--526.
[163]
Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. Handbook of Latent Semantic Analysis 427, 7 (2007), 424--440.
[164]
Eric A. Stone and Julien F. Ayroles. 2009. Modulated modularity clustering as an exploratory tool for functional genomic inference. PLoS Genetics 5, 5 (2009), e1000479.
[165]
Mu-Chun Su and Chien-Hsing Chou. 2001. A modified version of the K-means algorithm with a distance based on cluster symmetry. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 6 (2001), 674--680.
[166]
Yizhou Sun and Jiawei Han. 2013. Meta-path-based search and mining in heterogeneous information networks. Tsinghua Science and Technology 18, 4 (2013), 329--338.
[167]
Yizhou Sun and Jiawei Han. 2013. Mining heterogeneous information networks: A structural analysis approach. ACM SIGKDD Explorations Newsletter 14, 2 (2013), 20--28.
[168]
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. In Proceedings of the VLDB Endowment. 11.
[169]
Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, and Tianyi Wu. 2009. Rankclus: Integrating clustering with ranking for heterogeneous information network analysis. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, 565--576.
[170]
Yizhou Sun, Brandon Norick, Jiawei Han, Xifeng Yan, Philip S. Yu, and Xiao Yu. 2013. Pathselclus: Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Transactions on Knowledge Discovery from Data (TKDD) 7, 3 (2013), 11.
[171]
Yizhou Sun, Yintao Yu, and Jiawei Han. 2009. Ranking-based clustering of heterogeneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 797--806.
[172]
Rashish Tandon and Suvrit Sra. 2010. Sparse nonnegative matrix approximation: New formulations and algorithms. Rapport Technique 193 (2010), 38--42.
[173]
Zhuo Tang, Kunkun Liu, Jinbo Xiao, Li Yang, and Zheng Xiao. 2017. A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience 29, 20 (2017), 1--18.
[174]
Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319--2323.
[175]
Anthony K. H. Tung, Xin Xu, and Beng Chin Ooi. 2005. Curler: Finding and visualizing nonlinear correlation clusters. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, 467--478.
[176]
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, and others. 2014. Bigdatabench: A big data benchmark suite from internet services. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 488--499.
[177]
Wei Wang, Jiong Yang, Richard Muntz, and others. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of VLDB, Vol. 97. 186--195.
[178]
William J. Welch. 1982. Algorithmic complexity: Three NP-hard problems in computational statistics. Journal of Statistical Computation and Simulation 15, 1 (1982), 17--25.
[179]
Douglas Brent West and others. 2001. Introduction to Graph Theory, Vol. 2. Prentice Hall Upper Saddle River.
[180]
Tom White. 2012. Hadoop: The Definitive Guide. O’Reilly Media, Inc.
[181]
Rui Xu and Donald Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 3 (2005), 645--678.
[182]
Rui Xu, Donald Wunsch, and others. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 3 (2005), 645--678.
[183]
Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. 2007. Map-reduce-merge: Simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. ACM, 1029--1040.
[184]
Forrest W. Young. 2013. Multidimensional Scaling: History, Theory, and Applications. Psychology Press.
[185]
Jane Yang Yu and Peter Han Joo Chong. 2005. A survey of clustering schemes for mobile ad hoc networks. IEEE Communications Surveys 8 Tutorials 7, 1 (2005), 32--48.
[186]
Btissam Zerhari, Ayoub Ait Lahcen, and Salma Mouline. 2015. Big data clustering: Algorithms and challenges. In Proceedings of the International Conference on Big Data, Cloud and Applications (BDCA’15).
[187]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of ACM Sigmod Record, Vol. 25. ACM, 103--114.
[188]
Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on MapReduce. In Proceedings of IEEE International Conference on Cloud Computing. Springer, 674--679.
[189]
Ding Zhou, Sergey A. Orshanskiy, Hongyuan Zha, and C. Lee Giles. 2007. Co-ranking authors and documents in a heterogeneous network. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 739--744.
[190]
Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment 2, 1 (2009), 718--729.
[191]
Xinhua Zhuang, Yan Huang, Kannappan Palaniappan, and Yunxin Zhao. 1996. Gaussian mixture density modeling, decomposition, and applications. IEEE Transactions on Image Processing 5, 9 (1996), 1293--1302.
[192]
Arthur Zimek. 2009. Correlation clustering. ACM SIGKDD Explorations Newsletter 11, 1 (2009), 53--54.

Cited By

View all
  • (2025)Mining user privacy concern topics from app reviewsJournal of Systems and Software10.1016/j.jss.2025.112355(112355)Online publication date: Jan-2025
  • (2025)Towards scalable topic detection on web via simulating Lévy walks nature of topics in similarity spaceInformation Sciences10.1016/j.ins.2024.121544690(121544)Online publication date: Mar-2025
  • (2025)Bundle fragments into a whole: Mining more complete clusters via submodular selection of interesting webpages for web topic detectionExpert Systems with Applications10.1016/j.eswa.2024.125125260(125125)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. Systematic Review of Clustering High-Dimensional and Large Datasets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 12, Issue 2
    Survey Papers and Regular Papers
    April 2018
    376 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3178544
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 January 2018
    Accepted: 01 August 2017
    Revised: 01 June 2017
    Received: 01 September 2016
    Published in TKDD Volume 12, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cluster analysis
    2. clustering tendency
    3. data clustering applications
    4. data clustering process
    5. dimensionality reduction
    6. large scale data mining

    Qualifiers

    • Tutorial
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)164
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Mining user privacy concern topics from app reviewsJournal of Systems and Software10.1016/j.jss.2025.112355(112355)Online publication date: Jan-2025
    • (2025)Towards scalable topic detection on web via simulating Lévy walks nature of topics in similarity spaceInformation Sciences10.1016/j.ins.2024.121544690(121544)Online publication date: Mar-2025
    • (2025)Bundle fragments into a whole: Mining more complete clusters via submodular selection of interesting webpages for web topic detectionExpert Systems with Applications10.1016/j.eswa.2024.125125260(125125)Online publication date: Jan-2025
    • (2025)Entropy-weighted medoid shift: An automated clustering algorithm for high-dimensional dataApplied Soft Computing10.1016/j.asoc.2024.112347169(112347)Online publication date: Jan-2025
    • (2024)Co-clustering: A Survey of the Main Methods, Recent Trends, and Open ProblemsACM Computing Surveys10.1145/369887557:2(1-33)Online publication date: 4-Oct-2024
    • (2024)Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering AlgorithmsProcedia Computer Science10.1016/j.procs.2024.10.185244(121-132)Online publication date: 2024
    • (2024)Enhancing cluster analysis via topological manifold learningData Mining and Knowledge Discovery10.1007/s10618-023-00980-238:3(840-887)Online publication date: 1-May-2024
    • (2023)Comparative Assessment of the Efficacy of the Five Kinds of Models in Landslide Susceptibility Map for Factor Screening: A Case Study at Zigui-Badong in the Three Gorges Reservoir Area, ChinaSustainability10.3390/su1501080015:1(800)Online publication date: 1-Jan-2023
    • (2023)Subspace Clustering in High-Dimensional Data Streams: A Systematic Literature ReviewComputers, Materials & Continua10.32604/cmc.2023.03598775:2(4649-4668)Online publication date: 2023
    • (2023)A novel self-learning framework for fault identification of wind turbine drive bearingsProceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering10.1177/09596518231153231237:7(1296-1312)Online publication date: 5-Feb-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media