Skip to main content
Log in

Research on large data set clustering method based on MapReduce

  • Brain- Inspired computing and Machine learning for Brain Health
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The similarities and differences between the K-means algorithm and the Canopy algorithm’s MapReduce implementation are described in detail, and the possibility of combining the two to design a better algorithm suitable for clustering analysis of large data sets is analyzed in this paper. Different from the previous literature’s improvement ideas for K-means algorithm, it proposes new ideas for sampling and analyzes the selection of relevant thresholds in this paper. Finally, it introduces the MapReduce implementation framework based on Canopy partitioning and filtering K-means algorithm and analyzes some pseudocode in this chapter. Finally, it briefly analyzes the time complexity of the algorithm in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Alexey B, Dmytro I, Oleg R et al (2018) Constraints on decaying dark matter from XMM-Newton observations of M31. Mon Not R Astron Soc 387(4):1361–1373

    Google Scholar 

  2. Treu T, Dutton AA, Auger MW et al (2018) The SWELLS survey-I. A large spectroscopically selected sample of edge-on late-type lens galaxies. Mon Not R Astron Soc 417(3):1601–1620

    Article  Google Scholar 

  3. Efstathiou G, Gratton S, Paci F (2018) Impact of Galactic polarized emission on B-mode detection at low multipoles. Mon Not R Astron Soc 397(3):1355–1373

    Article  Google Scholar 

  4. Driver SP, Robotham ASG (2018) Quantifying cosmic variance. Mon Not R Astron Soc 407(4):2131–2140

    Article  Google Scholar 

  5. Humphrey PJ, Buote DA, Brighenti F et al (2018) Reconciling stellar dynamical and hydrostatic X-ray mass measurements of an elliptical galaxy with gas rotation, turbulence and magnetic fields. Mon Not R Astron Soc 430(3):1516–1528

    Article  Google Scholar 

  6. Barentsen G, Vink JS, Drew JE et al (2018) Bayesian inference of T Tauri star properties using multi-wavelength survey photometry. Mon Not R Astron Soc 429(3):1981–2000

    Article  Google Scholar 

  7. Littlefair SP, Naylor T, Mayne NJ et al (2018) Rotation of young stars in Cepheus OB3b. Mon Not R Astron Soc 403(2):545–557

    Article  Google Scholar 

  8. Clark CD (2017) Emergent drumlins and their clones: from till dilatancy to flow instabilities. J Glaciol 51(200):1011–1025

    Article  Google Scholar 

  9. Peng H, Li B, Ling H et al (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832

    Article  Google Scholar 

  10. Mukherjee AP, Tirthapura S (2017) Enumerating maximal bicliques from a large graph using MapReduce. IEEE Trans Serv Comput 10(5):771–784

    Article  Google Scholar 

  11. Kim Y, Shim K, Kim MS et al (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(2):15–35

    Article  Google Scholar 

  12. Río SD, López V, Benítez JM et al (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8(3):422–437

    Article  Google Scholar 

  13. Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18

    Article  Google Scholar 

  14. Xiaoshan YU, Yangyang WU (2014) Parallel text hierarchical clustering based on MapReduce. J Comput Appl 34(6):1595–1599

    Google Scholar 

  15. Fan T (2017) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 1:1–15

    Google Scholar 

  16. Leng YL, Zhang QC (2014) A big graph clustering algorithm based on MapReduce. Adv Mater Res 1049–1050:1467–1470

    Article  Google Scholar 

  17. Xia D, Wang B, Li Y et al (2015) An efficient MapReduce-based parallel clustering algorithm for distributed traffic subarea division. Discrete Dyn Nat Soc 2015(6018):1–18

    Article  MathSciNet  MATH  Google Scholar 

  18. Lamari Y, Slaoui SC (2017) Clustering categorical data based on the relational analysis approach and MapReduce. J Big Data 4(1):28

    Article  Google Scholar 

  19. Hajkacem MAB, N’Cir CEB, Essoussi N (2017) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 2:1–18

    Google Scholar 

  20. Sun Z, Fox G, Gu W et al (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Chongqing Big Data Engineering Laboratory for Children, Chongqing Electronics Engineering Technology Research Center for Interactive Learning, the Science and Technology Research Project of Chongqing Municipal Education Commission of China (No. KJ1601401), the Science and Technology Research Project of Chongqing University of Education (No. KY201725C), Basic research and Frontier Exploration of Chongqing Science and Technology Commission (CSTC2014jcyjA40019), Project of Science and Technology Research Program of Chongqing Education Commission of China (No. KJZD-K201801601).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fangcheng He.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, P., He, F., Li, L. et al. Research on large data set clustering method based on MapReduce. Neural Comput & Applic 32, 93–99 (2020). https://doi.org/10.1007/s00521-018-3780-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3780-y

Keywords

Navigation