Abstract
Cluster extraction is a vital part of data mining; however, humans and computers perform it very differently. Humans tend to estimate, perceive or visualize clusters cognitively, while digital computers either perform an exact extraction, follow a fuzzy approach, or organize the clusters in a hierarchical tree. In real data sets, the clusters are not only of different densities, but have embedded noise and are nested, thus making their extraction more challenging. In this paper, we propose a density-based technique for extracting connected rectangular clusters that may go undetected by traditional cluster extraction techniques. The proposed technique is inspired by the human cognition approach of appropriately scaling the level of detail, by going from low level of detail, i.e., one-way clustering to high level of detail, i.e., biclustering, in the dimension of interest, as in online analytical processing. A number of experiments were performed using simulated and real data sets and comparison of the proposed technique made with four popular cluster extraction techniques (DBSCAN, CLIQUE, k-medoids and k-means) with promising results.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Card SK, Mackinlay JD, Shneiderman B. Readings in information visualization—using vision to think. San Francisco: Morgan Kaufmann Publishers; 1999.
Ravindra K, Naik D. Multivariate data reduction and discrimination with SAS software. Cary: SAS Institute; 2000. p. 2.
Hartigan JA. Direct clustering of a data matrix. J Am Stat Assoc. 1972;67(337):123–9.
Mirkin B. Mathematical classification and clustering. Norwell: Kluwer Academic Publishers; 1996.
Heather T. www.heatherturner.net/turnerchapter3.pdf; 2013.
Sheikholeslami G et al. WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta A, Shmueli O, Widom J, editors. In: Proceedings of 24th international conference very large data bases. New York City, Morgan Kaufmann; 1998, p. 428–438.
Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. New York: Wiley-Interscience; 2009.
Ester M, Kriegel HP, Xu X. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. In: Egenhofer M, Herring J, editors. Advances in spatial databases. Berlin: Springer; 1995. p. 67–82.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, vol 27; 1998. p. 94–105.
Ahsan A, Amir H. A new biclustering technique based on crossing minimization. Neurocomputing. 2006;69(16):1882–96.
Pang-Ning T, Steinbach M, Kumar V. Introduction to data mining. Boston: Addison-Wesley Publishers; 2006.
Card SK, Mackinlay JD, Shneiderman B, editors. Readings in information visualization: using vision to think. San Francisco: Morgan Kaufmann; 1999.
Chen K, Liu L. VISTA: validating and refining clusters via visualization. J Inf Visual. 2004;3(4):257–70.
Kaski S, Sinkkonen J, Peltonen J. Data visualization and analysis with self-organizing maps in learning metrics. DaWaK 2001, LNCS 2114, (2001), p. 162–173.
Böhm C, Kailing K, Kröger P, Zimek A. Computing clusters of correlation connected objects. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data; 2004, p. 455–466.
Appice A, Lanza A, Varlaro A. Spatial clustering of related structured objects for topographic map interpretation. In: Proceedings of the workshop on mining spatio-temporal data (MSTD) in conjunction with ECML/PKDD 2005, 9–21, Porto, Portugal.
Lu W, Han J et al. Discovery of general knowledge in large spatial databases, 2005, In: Proceedings of far east workshop on geographic information systems. Singapore; 1993. p. 275–289.
Ng RT, Han J. Efficient and effective clustering methods for spatial data mining. In: Bocca JB, Jarke M, Zaniolo C, editors. In: Proceedings of the 20th international conference very large data bases (VLDB’94). Santiago de Chile: Morgan Kaufmann; 1994. p. 144–155.
Ankerst M et al. OPTICS: ordering points to identify the clustering structure. In Delis A, Faloutsos C, Ghandeharizadeh S, editors. In: Proceedings 1999 ACM SIGMOD international conference on management of data Philadelphia: ACM Press; 1999. p. 49–60.
Hinneburg A, Keim DA. An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of international conference on knowledge discovery and data mining, KDD-98, New York: AAAI Press; 1998. p. 58–65.
Gibson D, Kleinberg JM, Raghavan P. Clustering categorical data: an approach based on dynamical systems. In: Gupta A, Shmueli O, Widom J, editors. In: Proceedings of 24th international conference on very large data bases. New York City: Morgan Kaufmann; 1998. p. 311–322.
Codd EF, Codd SB, Salley CT. Providing OLAP to user-analysts: an IT mandate. Technical report. E. F. Codd & Associates; 1993.
Rivest Sonia, Bedard Yvan, Marchand Pierre. Toward better support for spatial decision making: defining the characteristics of spatial on-line analytical processing (SOLAP). GEOMATICA-OTTAWA. 2001;55(4):539–55.
Tucker LR. The extension of factor analysis to three-dimensional matrices. In: Frederiksen N, Gulliksen H, editors. Contributions to mathematical psychology. New York: Holt, Rinehart, and Winston; 1964. p. 109–27.
Cheng Y, Church GM. Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, vol 8; 2000, p. 93–103.
Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. Comput Biol Bioinform IEEE/ACM Trans. 2004;1(1):24–45.
Orlin J. Containment in graph theory: covering graphs with cliques. Nederl Akad Wetensch Indag Math. 1977;39:211–8.
Yang D, Rundensteiner EA, Ward MO. Summarization and matching of complex patterns in streaming environment. In: Proceedings of the VLDB endowment, vol 5; 2011. p. 121–132. http://vldb.org/pvldb/vol5/p121_diyang_vldb2012.pdf.
Qiu BZ, Zhang L. An effective nonparametric grid-based clustering algorithm. J Inform Comput Sci. 2008;5(1):1-6.
Xiaoyun C, Yi C, Xiaoli Q, Min Y, Yanshan H. PGMCLU: a novel parallel grid-based clustering algorithm for multi-density datasets. In Web Society, SWS’09. 1st IEEE Symposium on, 2009, p. 166–171.
Akodjènou-Jeannin MI, Salamatian K, Gallinari P. Flexible grid-based clustering. In: Kok JN, Koronacki J, Lopez de Mantaras R, Matwin S, Mladenič D, Skowron A, editors. Knowledge discovery in databases: PKDD. Berlin: Springer; 2007. p. 350–357.
Schikuta E. Grid-clustering: an efficient hierarchical clustering method for very large data sets. In: Proceedings of the 13th international conference on pattern recognition, vol. 2; 1996. p. 101–105.
Chu SC, Roddick JF, Pan JS. An efficient k-medoids-based algorithm using previous medoid index, triangular inequality elimination criteria, and partial distance search. In Data warehousing and knowledge discovery, Berlin: Springer; 2002, p. 63–72.
Zhang Q, Couloigner I. A new and efficient k-medoid algorithm for spatial clustering. In: Computational science and its applications—ICCSA. Berlin: Springer; 2005, p. 181–189.
Achtert E, Böhm C, David J, Kröger P, Zimek A. Robust clustering in arbitrarily oriented subspaces. In: Proceedings of SDM. 2008.
Tan J, Zhang J, Li W. An improved clustering algorithm based on density distribution function. Comput Inf Sci. 2010;3(3):23.
Erten C, Sözdinler M. Biclustering expression data based on expanding localized substructures. In: Bioinformatics and computational biology. Berlin: Springer; 2009. p. 224–235.
Zhou J, Lazarevic A, Hsu KW, Srivastava J, Fu Y, Wu Y. Unsupervised learning based distributed detection of global anomalies. Int J Inf Technol Decis Mak. 2010;9(06):935–57.
Zhou A, Zhou S, Cao J, Fan Y, Hu Y. Approaches for scaling DBSCAN algorithm to large spatial databases. J Comput Sci Technol. 2000;15(6):509–26.
Sander J, Ester M, Kriegel H-P, Xu X. Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min Knowl Disc. 1998;2(2):169–94.
Qian Weining, Gong XueQing, Zhou AoYing. Clustering in very large databases based on distance and density. J Comput Sci Technol. 2003;18(1):67–76.
Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press; 1996, p. 226–231.
Viswanath P, Pinkesh R. l-dbscan: a fast hybrid density based clustering method. In: Pattern recognition, 2006. ICPR 2006. 18th international conference, vol 1, 2006. p. 912–915.
Ilango MR, Mohan V. A survey of grid based clustering algorithms. Int J Eng Sci Technol. 2010;2(8):3441–6.
Sander J, Qin X, Lu Z, Niu N, Kovarsky A. Automatic extraction of clusters from hierarchical clustering representations. In: Kyu-Young W, Jongwoo J, Kyuseok S, Jaideep S, editors. Advances in knowledge discovery and data mining. Berlin: Springer; 2003, p. 75–87.
Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. New York: Cambridge University Press; 2008.
Lee DJ, Lane RM, Chang GH. Three-dimensional reconstruction for high-speed volume measurement. In: Proceedings of the international society for optical engineering, machine vision and three-dimensional imaging systems for inspection and metrology, vol 4189; 2001. p. 258–267.
Liu B. A fast density-based clustering algorithm for large databases. In: Proceedings of International Conference on Machine Learning and Cybernetics; 2006, p. 996–1000.
Dash M, Liu H, Xu X. ‘1 + 1 > 2’: merging distance and density based clustering. In: Proceedings of seventh international conference on database systems for advanced applications; 2001, p. 32–39.
Bach JR, Horowitz B. Indexing method for image search engine. U.S. Patent No. 6,084,595. 4 Jul. 2000.
Welton B, Samanas E, Miller BP. Mr. scan: extreme scale density-based clustering using a tree-based network of gpgpu nodes. In Proceedings of SC13: international conference for high performance computing, networking, storage and analysis. 2013;13:84.
Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl. 2004;6(1):90–105.
Walker RJ. An enumerative technique for a class of combinatorial problems. In: Bellman R, Hall M Jr, editors. Combinatorial analysis. In: Proceedings of symposium applied mathematics 10. Providence, Rhode Island: Ame. Math. Society; 1960. p. 91–94.
Buckner C. A property cluster theory of cognition. Philos Psychol (ahead-of-print); 2013;1–30.
Hartigan JA, Wong MA. Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C. 1979;28(1):100–8.
Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record. 25(2):103–114; 1996.
Acknowledgments
This project was supported by the NSTIP strategic technologies program in the Kingdom of Saudi Arabia—Project No. (12-AGR2709-3). The authors also, acknowledge with thanks Science and Technology Unit, King Abdulaziz University for technical support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Abdullah, A., Hussain, A. A Cognitively Inspired Approach to Two-Way Cluster Extraction from One-Way Clustered Data. Cogn Comput 7, 161–182 (2015). https://doi.org/10.1007/s12559-014-9281-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-014-9281-0