Efficient Approaches for Solving the Large-Scale k-Medoids Problem: Towards Structured Data

Martino, Alessio; Rizzi, Antonello; Frattale Mascioli, Fabio Massimo

doi:10.1007/978-3-030-16469-0_11

Part of the book series: Studies in Computational Intelligence ((SCI,volume 829))

Included in the following conference series:

International Joint Conference on Computational Intelligence

327 Accesses
13 Citations

Abstract

The possibility of clustering objects represented by structured data with possibly non-trivial geometry certainly is an interesting task in pattern recognition. Moreover, in the Big Data era, the possibility of clustering huge amount of (structured) data challenges computer science and pattern recognition researchers alike. The aim of this paper is to bridge the gap on large-scale structured data clustering. Specifically, following a previous work, in this paper a parallel and distributed k-medoids clustering implementation is proposed and tested on real-world biological structured data, namely pathway maps (graphs) and primary structure of proteins (sequences). Furthermore, two methods for medoids’ evaluation are proposed and compared in terms of scalability, based on exact and approximate procedures, respectively. Computational results show that the proposed implementation is flexible with respect to the dissimilarity measure and the input space adopted, with satisfactory results in terms of scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In Spark, workers are the elementary computational units, which can be either single cores on a CPU or entire computers, depending on the environment configuration (i.e. the Spark Context).
2.
For a more extensive list, we refer the interested reader to the official Apache Spark RDD API at https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.
3.
In the following subsections and pseudocodes every variable whose name ends with RDD shall be considered as an RDD, therefore a data structure whose shards are distributed across the workers or, similarly, with no guarantee that a single worker has its entire content in memory.
4.
Not to be confused with i in datasetRDD, distancesRDD and nearestClusterRDD, which is a within-dataset pattern ID.
5.
Indeed, in many cases the pattern-to-medoid assignments rather than the WCSoD are compared between consecutive iterations.
6.
In the following, we suppose that the dissimilarity measure \(d(\cdot ,\cdot )\), although it might not be metric, is at least symmetric, i.e. \(d(a,b)=d(b,a)\). For symmetric dissimilarity measures the pairwise distance matrix is symmetric by definition and the sum by rows or columns leads to the same result. In case of not-symmetric dissimilarity measures, it is possible to ‘force’ symmetry by letting (for example) \(\bar{d}(a,b)=\frac{1}{2}(d(a,b)+d(b,a))\).
7.
Indeed, for very small clusters a Spark-driven parallelisation is discouraged as the majority of the execution time will be spent on thread scheduling and worker communication rather than processing.
8.
http://www.kegg.jp/kegg/rest/keggapi.html.
9.
http://www.kegg.jp/kegg/xml/docs/.
10.
It is sufficient for a given node to exist in a single network to be included.
11.
For the sake of comparison, it is worth remarking that both the Euclidean distance and the Hamming distance have complexity \(\mathcal {O}(n)\), where n is the number of attributes. However, in this work, the number of attributes ranges from \(2136^2\) to \(4431^2\) whereas in [12] it ranged from 4 to 28102.

References

Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)
Article MathSciNet MATH Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In Cam, L.M.L., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Google Scholar
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, pp. 368–374. MIT Press, Cambridge, MA, USA (1996)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis Based on the L1-Norm and Related Methods (1987)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, vol. 96, pp. 226–231 (1996)
Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, pp. 49–60. ACM, New York, NY, USA (1999)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96, pp. 103–114. ACM, New York, NY, USA (1996)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 73–84 (1998)
Article MATH Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 95 (2010)
Google Scholar
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)
MathSciNet MATH Google Scholar
Martino, A., Rizzi, A., Frattale Mascioli, F.M.: Efficient approaches for solving the large-scale k-medoids problem. In: Proceedings of the 9th International Joint Conference on Computational Intelligence—Volume 1: IJCCI, INSTICC, pp. 338–347. SciTePress (2017)
Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on mapreduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin, Heidelberg (2009)
Chapter Google Scholar
Yue, X., Man, W., Yue, J., Liu, G.: Parallel k-medoids++ spatial clustering algorithm based on mapreduce (2016). arXiv:1608.06861
Arbelaez, A., Quesada, L.: Parallelising the k-medoids clustering problem using space-partitioning. In: Sixth Annual Symposium on Combinatorial Search (2013)
Google Scholar
Jiang, Y., Zhang, J.: Parallel k-medoids clustering algorithm based on Hadoop. In: 2014 5th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 649–652. IEEE (2014)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley (2009)
Google Scholar
Martino, A., Rizzi, A., Mascioli, F. M. F.: Distance matrix pre-caching and distributed computation of internal validation indices in k-medoids clustering. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Google Scholar
Park, H.S., Jun, C.H.: A simple and fast algorithm for k-medoids clustering. Expert Syst. Appl. 36, 3336–3341 (2009)
Article Google Scholar
Del Vescovo, G., Livi, L., Frattale Mascioli, F.M., Rizzi, A.: On the problem of modeling structured data with the minsod representative. Int. J. Comput. Theory Eng. 6, 9 (2014)
Article Google Scholar
Martino, A., Giuliani, A., Rizzi, A.: Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. In: Pedrycz, W., Chen, S.M. (eds.) Computational Intelligence for Pattern Recognition, pp. 53–81. Springer International Publishing, Cham (2018)
Chapter Google Scholar
Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)
Article MATH Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Bianchi, F.M., Livi, L., Rizzi, A.: Two density-based k-means initialization algorithms for non-metric data clustering. Pattern Anal. Appl. 3, 745–763 (2016)
Article MathSciNet Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association (2012)
Google Scholar
Thorndike, R.L.: Who belongs in the family? Psychometrika 18, 267–276 (1953)
Article Google Scholar
Livi, L., Rizzi, A.: The graph matching problem. Pattern Anal. Appl. 16, 253–283 (2013)
Article MathSciNet MATH Google Scholar
Livi, L., Del Vescovo, G., Rizzi, A.: Graph recognition by seriation and frequent substructures mining. In: ICPRAM 2012—Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, vol. 1, pp. 186–191 (2012)
Google Scholar
van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13, 22–30 (2011)
Article Google Scholar
Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for Python (2001). Accessed 13 Mar 2018
Google Scholar
Millman, K.J., Aivazis, M.: Python for scientists and engineers. Comput. Sci. Eng. 13, 9–12 (2011)
Article Google Scholar
Oliphant, T.E.: Python for scientific computing. Comput. Sci. Eng. 9 (2007)
Google Scholar
Cokelaer, T., Pultz, D., Harder, L.M., Serra-Musach, J., Saez-Rodriguez, J.: Bioservices: a common Python package to access biological web services programmatically. Bioinformatics 29, 3241–3242 (2013)
Article Google Scholar
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., Morishima, K.: KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2016)
Article Google Scholar
Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000)
Article Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., Tanabe, M.: KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2015)
Article Google Scholar
Tun, K., Dhar, P.K., Palumbo, M.C., Giuliani, A.: Metabolic pathways variability and sequence/networks comparisons. BMC Bioinform. 7, 24 (2006)
Article Google Scholar
Hamming, R.W.: Error detecting and error correcting codes. Bell Labs Tech. J. 29, 147–160 (1950)
Article MathSciNet MATH Google Scholar
Giuliani, A., Krishnan, A., Zbilut, J.P., Tomita, M.: Proteins as networks: usefulness of graph theory in protein science. Curr. Protein Pept. Sci. 9, 28–38 (2008)
Article Google Scholar
Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., Giuliani, A.: Protein contact networks: an emerging paradigm in chemistry. Chem. Rev. 113, 1598–1613 (2012)
Article Google Scholar
Livi, L., Giuliani, A., Sadeghian, A.: Characterization of graphs for protein structure modeling and recognition of solubility. Curr. Bioinform. 11, 106–114 (2016)
Article Google Scholar
Livi, L., Maiorino, E., Giuliani, A., Rizzi, A., Sadeghian, A.: A generative model for protein contact networks. J. Biomol. Struct. Dyn. 34, 1441–1454 (2016)
Article Google Scholar
Maiorino, E., Rizzi, A., Sadeghian, A., Giuliani, A.: Spectral reconstruction of protein contact networks. Phys. A Stat. Mech. Appl. 471, 804–817 (2017)
Article Google Scholar
Martino, A., Maiorino, E., Giuliani, A., Giampieri, M., Rizzi, A.: Supervised approaches for function prediction of proteins contact networks from topological structure information. In: Sharma, P., Bianchi, F.M. (eds.) Image Analysis, pp. 285–296. Springer International Publishing, Cham (2017)
Chapter Google Scholar
Martino, A., Rizzi, A., Mascioli, F. M. F.: Supervised approaches for protein function prediction by topological data analysis. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Google Scholar
De Santis, E., Martino, A., Rizzi, A., Mascioli, F. M. F.: Dissimilarity space representations and automatic feature selection for protein function prediction. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Google Scholar
The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017)
Article Google Scholar
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals, vol. 10, pp. 707–710. Soviet Physics Doklady (1966)
Google Scholar
Cinti, A., Bianchi, F. M., Martino, A., & Rizzi, A. (2017). A novel algorithm for online inexact string matching and its FPGA implementation. arXiv preprint arXiv:1712.03560

Download references

Author information

Authors and Affiliations

Department of Information Engineering, Electronics and Telecommunications, University of Rome “La Sapienza”, Via Eudossiana 18, 00184, Rome, Italy
Alessio Martino, Antonello Rizzi & Fabio Massimo Frattale Mascioli

Authors

Alessio Martino
View author publications
You can also search for this author in PubMed Google Scholar
Antonello Rizzi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Massimo Frattale Mascioli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alessio Martino .

Editor information

Editors and Affiliations

IUT Sénart, Université Paris-Est Créteil (UPEC), Créteil, France
Christophe Sabourin
Department of Computer Architecture and Technology, University of Granada, Granada, Spain
Juan Julian Merelo
Université Paris-Est Créteil (UPEC), Créteil, France
Kurosh Madani
University of Reading, Reading, UK
Kevin Warwick

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martino, A., Rizzi, A., Frattale Mascioli, F.M. (2019). Efficient Approaches for Solving the Large-Scale k-Medoids Problem: Towards Structured Data. In: Sabourin, C., Merelo, J.J., Madani, K., Warwick, K. (eds) Computational Intelligence. IJCCI 2017. Studies in Computational Intelligence, vol 829. Springer, Cham. https://doi.org/10.1007/978-3-030-16469-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-16469-0_11
Published: 30 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16468-3
Online ISBN: 978-3-030-16469-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics