Skip to main content
Log in

Clustering based on median and closest string via rank distance with applications on DNA

  • ICONIP 2012
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper aims to present several clustering methods based on rank distance. Rank distance has applications in many different fields such as computational linguistics, biology and computer science. The K-means algorithm represents each cluster by a single mean vector. The mean vector is computed with respect to a distance measure. Two K-means algorithms based on rank distance are described in this paper. Hierarchical clustering builds models based on distance connectivity. This paper describes two hierarchical clustering techniques that use rank distance. Experiments using mitochondrial DNA sequences extracted from several mammals are performed to compare the results of the clustering methods. Results demonstrate the clustering performance and the utility of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Chimani M, Woste M, Bocker S (2011) A closer look at the closest string and closest substring problem. In: Proceedings of ALENEX, pp 13–24

  2. de la Higuera C, Casacuberta F (2000) Topology of strings: median string is np-complete. Theor Comput Sci 230:39–48

    Article  MATH  Google Scholar 

  3. Diaconis P, Graham RL (1977) Spearman footrule as a measure of disarray. J R Stat Soc Ser B (Methodological) 39(2):262–268

    MATH  MathSciNet  Google Scholar 

  4. Dinu LP (2003) On the classification and aggregation of hierarchies with different constitutive elements. Fundamenta Informaticae 55(1):39–50

    MATH  MathSciNet  Google Scholar 

  5. Dinu A, Dinu LP (2005) On the syllabic similarities of romance languages. In: Proceedings of CICLing 3406, pp 785–788

  6. Dinu LP, Ionescu RT (2012) An efficient rank based approach for closest string and closest substring. PLoS One 7(6):e37576

    Article  Google Scholar 

  7. Dinu LP, Ionescu RT (2012a) Clustering based on rank distance with applications on DNA. In: Proceedings of ICONIP 7667

  8. Dinu LP, Ionescu RT (2012b) Clustering methods based on closest string via rank distance. In: Proceedings of SYNASC, pp 207–214

  9. Dinu LP, Manea F (2006) An efficient approach for the rank aggregation problem. Theor Comput Sci 359(1–3):455–461

    Article  MATH  MathSciNet  Google Scholar 

  10. Dinu LP, Popa A (2012) On the closest string via rank distance. In: Proceedings of CPM 7354, pp 413–426

  11. Dinu LP, Sgarro A (2006) A low-complexity distance for DNA strings. Fundamenta Informaticae 73(3):361–372

    MATH  MathSciNet  Google Scholar 

  12. Frances M, Litman A (1997) On covering problems of codes. Theory Comput Syst 30(2):113–119

    MATH  MathSciNet  Google Scholar 

  13. Huang Z (1998) Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304

    Article  Google Scholar 

  14. Kailing K, Kriegel HP, Kroger P (2004) Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM international conference on data mining

  15. Koonin EV (1999) The emerging paradigm and open problems in comparative genomics. Bioinformatics 15:265–266

    Article  Google Scholar 

  16. Lanctot KJ, Li M, Ma B, Wang S, Zhang L (2003) Distinguishing string selection problems. Inf Comput 185(1):41–55

    Article  MATH  MathSciNet  Google Scholar 

  17. Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264

    Article  MathSciNet  Google Scholar 

  18. Liew AW, Yan H, Yang M (2005) Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit 38(11):2055–2073

    Article  Google Scholar 

  19. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD, pp 169–178

  20. Nicolas F, Rivals E (2003) Complexities of centre and median string 2676:315–327

    Google Scholar 

  21. Nicolas F, Rivals E (2005) Hardness results for the center and median string problems under the weighted and unweighted edit distances. J Discret Algorithms 3(2–4):390–415

    MATH  MathSciNet  Google Scholar 

  22. Palmer J, Herbon L (1988) Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. J Mol Evolut 28:87–89

    Article  Google Scholar 

  23. Popov YV (2007) Multiple genome rearrangement by swaps and by element duplications. Theor Comput Sci 385(1–3):115–126

    Article  MATH  Google Scholar 

  24. Reyes A, Gissi C, Pesole G, Catzeflis FM, Saccone C (2000) Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol Biol Evol 17(6):979–983

    Article  Google Scholar 

  25. Selim SZ, Ismail MA (1984) K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell PAMI 6(1):81–87

    Google Scholar 

  26. Smith T, Waterman M (1981) Comparison of biosequences. Adv Appl Math 2(4):482–489

    Article  MATH  MathSciNet  Google Scholar 

  27. States DJ, Agarwal P (1996) Compact encoding strategies for DNA sequence similarity search. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, pp 211–217

  28. Tian TZ, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114

    Article  Google Scholar 

  29. Wooley JC (1999) Trends in computational biology: a summary based on a recomb plenary lecture. J Comput Biol 6:459–474

    Article  Google Scholar 

  30. Yin C, Zhao X, Mu S, Tian S (2013) A fast multiclass classification algorithm based on cooperative clustering. Neural Process Lett 1–14. doi:10.1007/s11063-013-9278-9

Download references

Acknowledgments

The contribution of the authors to this paper is equal. Authors thank anonymous reviewers for helpful comments. The research of Liviu P. Dinu was supported by a grant of the Romanian National Authority for Scientific Research, CNCS UEFISCDI, project number PN-II-ID-PCE-2011-3-0959. Radu Tudor Ionescu thanks his Ph.D. supervisor Denis Enachescu from the University of Bucharest, for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radu Tudor Ionescu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dinu, L.P., Ionescu, R.T. Clustering based on median and closest string via rank distance with applications on DNA. Neural Comput & Applic 24, 77–84 (2014). https://doi.org/10.1007/s00521-013-1468-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-013-1468-x

Keywords

Navigation