Abstract
Large-scale data are any data that cannot be loaded into the main memory of the ordinary. This is not the objective definition of large-scale data, but it is easy to understand what the large-scale data is. We first introduce some present algorithms to clustering large-scale data, some data stream clustering algorithms based on FCM algorithms are also introduced. In this paper, we propose a new structure to cluster large-scale data and two new data stream clustering algorithms based on the structure are propose in Sects. 3 and 4. In our method, we load the objects in the dataset one by one. We set a threshold of the membership, if the membership of one object and a cluster center is bigger than the threshold, the object is assigned to the cluster and the location of nearest cluster center will be updated, else the object is put into the temporary matrix; we call it pool. When the pool is full, we cluster the data in the pool and update the location of cluster centers. The two algorithms are based on the data stream structure. The difference of the two algorithms is the how the objects in the data are weighed. We test our algorithms on handwritten digits images dataset and several large-scale UCI datasets and make a comparison with some presented algorithms. The experiments proved that our algorithm is more suitable to cluster large-scale datasets.
Similar content being viewed by others
References
Angluin D, Laird P (1988) Learning from noisy examples. Mach Learn, pp 343–370. doi:10.1007/BF00116829
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Soc Ind Appl Math, pp 1027–1035. http://dl.acm.org/citation.cfm?id=1283494
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York. doi:10.1007/978-1-4757-0450-1
Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. KDD. 1998: 9–15. http://www.aaai.org/Library/KDD/1998/kdd98-002.php
Cannon R, Dave J, Bezdek JC (1986) Efficient implementation of fuzzy c-means algorithm. IEEE Tans Patten Anal March Intell PAMI–8(2):248–255. doi:10.1109/TPAMI.1986.4767778
Cheng T, Goldgof D, Hall L (1995) Fast clustering with application to fuzzy rule generation. In: Proceedings of IEEE international conference fuzzy system, Tokyo, Japan, pp 2289–2295. doi:10.1109/FUZZY.1995.409998
Chu C, Kim SK, Lin YA (2007) Map-reduce for machine learning on multicore. Adv Neural Inf Process Syst, 19: 281. http://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore
Duda RO, Peter EH, D GS (1999) attern classification. Wiley, New York. http://as.wiley.com/WileyCDA/WileyTitle/productCd-0471056693.html
Edelstein HA (1999) Introduction to data mining and knowledge discovery. 3rd Edition, Crows Corporation, Potomac. Two Crows Corporation. ISBN:1-892095-02-5. http://www.twocrows.com/intro-dm.pdf
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM-SIGMOD international conference management of data (SIGMOD’ 98), ACM Press. New York, pp 73–84. doi:10.1016/S0306-4379(01)00008-4
Han JW, Micheline K, Jian P (2011) Data mining: concepts and techniques. The Morgan Kaufmann series in data management systems. July 2011. ISBN: 978-0123814791. http://web.engr.illinois.edu/~hanj
Hansen H, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79:191–215. doi:10.1007/BF02614317
Henzinger MR, Raghavan P, Rajagopalan S (1998) Computing on data streams, SRC technical notes. http://www.eecs.harvard.edu/~michaelm/E210/datastreams.pdf
Hathaway RJ, Bezdek JC (2006) Extending fuzzy and probabilistic clustering to very large data sets. Comput Stat Data Anal 51(1):215–234. doi:10.1016/j.csda.2006.02.008
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108. doi:10.2307/2346830
Hore P, Hall LO, Goldgof DB (2007) Single pass fuzzy c means. IEEE international fuzzy systems conference, Imperial College, London, UK, 23–26 July, 2007, Proceedings pp 1–7. doi:10.1109/FUZZY.2007.4295372
Hore P, Hall LO, Goldgof DB (2009) A scalable framework for segmenting magnetic resonance images. J Signal Process Syst 54(1–3):183–203. doi:10.1007/s11265-008-0243-1
Huber PJ (1996) Massive data sets workshop: the morning after[C] Massive data sets. In: Proceedings of a workshop. National Academy Press, Washington, DC. http://www.nap.edu/openbook.php?record_id=5505&page=169
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis. Wiley, New York. doi:10.1002/9780470316801
Kolen J, Hutcheson T (2002) Reducing the time complexity of fuzzy c-mean algorithm. IEEE Tans Fuzzy Syst 10(2):263–267. doi:10.1109/91.995126
Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng, 14(5), pp 1003–1016. doi:10.1109/TKDE.2002.1033770
Richard OD (2008) Sequential k-means clustering. http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/sk_means.htm
Shankar BU, Pal NR FFCM (1994) An effective approach for large data sets. In: Proceedings of international conference fuzzy logic neural nets soft comput., Fukuoka, Japan, pp 332. http://www.researchgate.net/publication/246178981_Ffcm_An_effective_approach_for_large_data_sets
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington. doi:10.1145/507338.507355
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings ACM SIGMOD conference, Montreal, Canada, pp 103–114. doi:10.1145/233269.233324
Zhong S (2005) Efficient online spherical k-means clustering. Neural Networks, IJCNN’05. Proceedings. IEEE international joint conference, 5: 3180-3185. doi:10.1109/IJCNN.2005.1556436
Acknowledgments
This work was supported by the Program for New Century Excellent Talents in University (No. NCET-12-0920), the Program for New Scientific and Technological Star of Shaanxi Province (No. 2014KJXX-45), the National Natural Science Foundation of China (Nos. 61272279, 61272282, 61371201, and 61203303), the Fundamental Research Funds for the Central Universities (Nos. K5051302049, K5051302023, K50511020011, K5051302002 and K5051302028), the Provincial Natural Science Foundation of Shaanxi of China (No. 2011JQ8020), the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048) and EU IRSES project (No. 247619).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Li, Y., Yang, G., He, H. et al. A study of large-scale data clustering based on fuzzy clustering. Soft Comput 20, 3231–3242 (2016). https://doi.org/10.1007/s00500-015-1698-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-015-1698-1