Skip to main content

Efficient Approaches for Solving the Large-Scale k-Medoids Problem: Towards Structured Data

  • Conference paper
  • First Online:
Computational Intelligence (IJCCI 2017)

Abstract

The possibility of clustering objects represented by structured data with possibly non-trivial geometry certainly is an interesting task in pattern recognition. Moreover, in the Big Data era, the possibility of clustering huge amount of (structured) data challenges computer science and pattern recognition researchers alike. The aim of this paper is to bridge the gap on large-scale structured data clustering. Specifically, following a previous work, in this paper a parallel and distributed k-medoids clustering implementation is proposed and tested on real-world biological structured data, namely pathway maps (graphs) and primary structure of proteins (sequences). Furthermore, two methods for medoids’ evaluation are proposed and compared in terms of scalability, based on exact and approximate procedures, respectively. Computational results show that the proposed implementation is flexible with respect to the dissimilarity measure and the input space adopted, with satisfactory results in terms of scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In Spark, workers are the elementary computational units, which can be either single cores on a CPU or entire computers, depending on the environment configuration (i.e. the Spark Context).

  2. 2.

    For a more extensive list, we refer the interested reader to the official Apache Spark RDD API at https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.

  3. 3.

    In the following subsections and pseudocodes every variable whose name ends with RDD shall be considered as an RDD, therefore a data structure whose shards are distributed across the workers or, similarly, with no guarantee that a single worker has its entire content in memory.

  4. 4.

    Not to be confused with i in datasetRDD, distancesRDD and nearestClusterRDD, which is a within-dataset pattern ID.

  5. 5.

    Indeed, in many cases the pattern-to-medoid assignments rather than the WCSoD are compared between consecutive iterations.

  6. 6.

    In the following, we suppose that the dissimilarity measure \(d(\cdot ,\cdot )\), although it might not be metric, is at least symmetric, i.e. \(d(a,b)=d(b,a)\). For symmetric dissimilarity measures the pairwise distance matrix is symmetric by definition and the sum by rows or columns leads to the same result. In case of not-symmetric dissimilarity measures, it is possible to ‘force’ symmetry by letting (for example) \(\bar{d}(a,b)=\frac{1}{2}(d(a,b)+d(b,a))\).

  7. 7.

    Indeed, for very small clusters a Spark-driven parallelisation is discouraged as the majority of the execution time will be spent on thread scheduling and worker communication rather than processing.

  8. 8.

    http://www.kegg.jp/kegg/rest/keggapi.html.

  9. 9.

    http://www.kegg.jp/kegg/xml/docs/.

  10. 10.

    It is sufficient for a given node to exist in a single network to be included.

  11. 11.

    For the sake of comparison, it is worth remarking that both the Euclidean distance and the Hamming distance have complexity \(\mathcal {O}(n)\), where n is the number of attributes. However, in this work, the number of attributes ranges from \(2136^2\) to \(4431^2\) whereas in [12] it ranged from 4 to 28102.

References

  1. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  2. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In Cam, L.M.L., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)

    Google Scholar 

  3. Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, pp. 368–374. MIT Press, Cambridge, MA, USA (1996)

    Google Scholar 

  4. Kaufman, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis Based on the L1-Norm and Related Methods (1987)

    Google Scholar 

  5. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  6. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, pp. 49–60. ACM, New York, NY, USA (1999)

    Google Scholar 

  7. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96, pp. 103–114. ACM, New York, NY, USA (1996)

    Google Scholar 

  8. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 73–84 (1998)

    Article  MATH  Google Scholar 

  9. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  10. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)

    Google Scholar 

  11. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  12. Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Efficient approaches for solving the large-scale k-medoids problem. In: Proceedings of the 9th International Joint Conference on Computational Intelligence—Volume 1: IJCCI, INSTICC, pp. 338–347. SciTePress (2017)

    Google Scholar 

  13. Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin, Heidelberg (2009)

    Chapter  Google Scholar 

  14. Yue, X., Man, W., Yue, J., Liu, G.: Parallel k-medoids++ spatial clustering algorithm based on mapreduce (2016). arXiv:1608.06861

  15. Arbelaez, A., Quesada, L.: Parallelising the k-medoids clustering problem using space-partitioning. In: Sixth Annual Symposium on Combinatorial Search (2013)

    Google Scholar 

  16. Jiang, Y., Zhang, J.: Parallel k-medoids clustering algorithm based on Hadoop. In: 2014 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 649–652. IEEE (2014)

    Google Scholar 

  17. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley (2009)

    Google Scholar 

  18. Martino, A., Rizzi, A., Mascioli, F. M. F.: Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)

    Google Scholar 

  19. Park, H.S., Jun, C.H.: A simple and fast algorithm for k-medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009)

    Article  Google Scholar 

  20. Del Vescovo, G., Livi, L., Frattale Mascioli, F.M., Rizzi, A.: On the problem of modeling structured data with the minsod representative. Int. J. Comput. Theory Eng. 6, 9 (2014)

    Article  Google Scholar 

  21. Martino, A., Giuliani, A., Rizzi, A.: Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. In: Pedrycz, W., Chen, S.M. (eds.) Computational Intelligence for Pattern Recognition, pp. 53–81. Springer International Publishing, Cham (2018)

    Chapter  Google Scholar 

  22. Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)

    Article  MATH  Google Scholar 

  23. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  24. Bianchi, F.M., Livi, L., Rizzi, A.: Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal. Appl. 3, 745–763 (2016)

    Article  MathSciNet  Google Scholar 

  25. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association (2012)

    Google Scholar 

  26. Thorndike, R.L.: Who belongs in the family? Psychometrika 18, 267–276 (1953)

    Article  Google Scholar 

  27. Livi, L., Rizzi, A.: The graph matching problem. Pattern Anal. Appl. 16, 253–283 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  28. Livi, L., Del Vescovo, G., Rizzi, A.: Graph recognition by seriation and frequent substructures mining. In: ICPRAM 2012—Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, vol. 1, pp. 186–191 (2012)

    Google Scholar 

  29. van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011)

    Article  Google Scholar 

  30. Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for Python (2001). Accessed 13 Mar 2018

    Google Scholar 

  31. Millman, K.J., Aivazis, M.: Python for scientists and engineers. Comput. Sci. Eng. 13, 9–12 (2011)

    Article  Google Scholar 

  32. Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9 (2007)

    Google Scholar 

  33. Cokelaer, T., Pultz, D., Harder, L.M., Serra-Musach, J., Saez-Rodriguez, J.: Bioservices: a common Python package to access biological web services programmatically. Bioinformatics 29, 3241–3242 (2013)

    Article  Google Scholar 

  34. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2016)

    Article  Google Scholar 

  35. Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000)

    Article  Google Scholar 

  36. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., Tanabe, M.: KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2015)

    Article  Google Scholar 

  37. Tun, K., Dhar, P.K., Palumbo, M.C., Giuliani, A.: Metabolic pathways variability and sequence/networks comparisons. BMC Bioinform. 7, 24 (2006)

    Article  Google Scholar 

  38. Hamming, R.W.: Error detecting and error correcting codes. Bell Labs Tech. J. 29, 147–160 (1950)

    Article  MathSciNet  MATH  Google Scholar 

  39. Giuliani, A., Krishnan, A., Zbilut, J.P., Tomita, M.: Proteins as networks: usefulness of graph theory in protein science. Curr. Protein Pept. Sci. 9, 28–38 (2008)

    Article  Google Scholar 

  40. Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., Giuliani, A.: Protein contact networks: an emerging paradigm in chemistry. Chem. Rev. 113, 1598–1613 (2012)

    Article  Google Scholar 

  41. Livi, L., Giuliani, A., Sadeghian, A.: Characterization of graphs for protein structure modeling and recognition of solubility. Curr. Bioinform. 11, 106–114 (2016)

    Article  Google Scholar 

  42. Livi, L., Maiorino, E., Giuliani, A., Rizzi, A., Sadeghian, A.: A generative model for protein contact networks. J. Biomol. Struct. Dyn. 34, 1441–1454 (2016)

    Article  Google Scholar 

  43. Maiorino, E., Rizzi, A., Sadeghian, A., Giuliani, A.: Spectral reconstruction of protein contact networks. Phys. A Stat. Mech. Appl. 471, 804–817 (2017)

    Article  Google Scholar 

  44. Martino, A., Maiorino, E., Giuliani, A., Giampieri, M., Rizzi, A.: Supervised approaches for function prediction of proteins contact networks from topological structure information. In: Sharma, P., Bianchi, F.M. (eds.) Image Analysis, pp. 285–296. Springer International Publishing, Cham (2017)

    Chapter  Google Scholar 

  45. Martino, A., Rizzi, A., Mascioli, F. M. F.: Supervised approaches for protein function prediction by topological data analysis. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)

    Google Scholar 

  46. De Santis, E., Martino, A., Rizzi, A., Mascioli, F. M. F.: Dissimilarity space representations and automatic feature selection for protein function prediction. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)

    Google Scholar 

  47. The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017)

    Article  Google Scholar 

  48. Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, vol. 10, pp. 707–710. Soviet Physics Doklady (1966)

    Google Scholar 

  49. Cinti, A., Bianchi, F. M., Martino, A., & Rizzi, A. (2017). A novel algorithm for online inexact string matching and its FPGA implementation. arXiv preprint arXiv:1712.03560

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alessio Martino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martino, A., Rizzi, A., Frattale Mascioli, F.M. (2019). Efficient Approaches for Solving the Large-Scale k-Medoids Problem: Towards Structured Data. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds) Computational Intelligence. IJCCI 2017. Studies in Computational Intelligence, vol 829. Springer, Cham. https://doi.org/10.1007/978-3-030-16469-0_11

Download citation

Publish with us

Policies and ethics