Skip to main content

Evaluation of Descriptor Algorithms of Biological Sequences and Distance Measures for the Intelligent Cluster Index (ICIx)

  • Conference paper
  • First Online:
Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery (BDAS 2015, BDAS 2016)

Abstract

In hindsight of the previous decades, a rapid growth of data in all fields of life sciences is perceptible. Most notably is the general tendency of retaining well established techniques regarding specific biological requirements and common taxonomies for data classification. Therefore a change in perspective towards advanced technological concepts for persisting, organizing and analyzing these huge amounts of data is essential. The Intelligent Cluster Index (ICIx) is a modern technology capable of indexing multidimensional data through semantic criteria, qualified for this challenge. In this paper methodical approaches for indexing biological sequences with the ICIx are discussed and evaluated. This includes the examination of established methods concentrating on vector transformation as well as outlining the efficiency of different distance measures applied to these vectors. Based on our results, it becomes apparent that position conserving methods are superior to other approaches and that the applied distance measures heavily influence performance and quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Other commonly used notions for n-grams are k-, t- or n-tuples and k-, t- or n-mers.

References

  1. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  3. Baby, J., Kannan, T., Vinod, P., Gopal, V.: Distance indices for the detection of similarity in C programs. In: International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), pp. 462–467. IEEE (2014)

    Google Scholar 

  4. Bao, J., Yuan, R., Bao, Z.: An improved alignment-free model for dna sequence similarity metric. BMC Bioinform. 15(1), 321 (2014)

    Article  Google Scholar 

  5. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank. Nucleic Acids Res. 39(suppl 1), D32–D37 (2011)

    Article  Google Scholar 

  6. Bogan-Marta, A., Hategan, A., Pitas, I.: Language engineering and information theoretic methods in protein sequence similarity studies. Computational Intelligence in Medical Informatics, pp. 151–183. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Boratyn, G.M., Camacho, C., Cooper, P.S., Coulouris, G., Fong, A., Ma, N., Madden, T.L., Matten, W.T., McGinnis, S.D., Merezhuk, Y., Raytselis, Y., Sayers, E.W., Tao, T., Ye, J., Zaretskaya, I.: BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 41(W1), W29–W33 (2013)

    Article  Google Scholar 

  8. Cha, S.H.: Taxonomy of nominal type histogram distance measures. In: Proceedings of the American Conference on Applied Mathematics, pp. 325–330. World Scientific and Engineering Academy and Society (WSEAS) (2008)

    Google Scholar 

  9. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Heidelberg (2012)

    MATH  Google Scholar 

  10. Doreswamy, Manohar, M.G., Hemanth, K.S.: A study on similarity measure functions on engineering materials selection. AIAA 1, 157–168 (2011)

    Google Scholar 

  11. Ganapathiraju, M., Manoharan, V., Klein-Seetharaman, J.: BLMT - statistical sequence analysis using N-grams. Appl. Bioinform. 3(2–3), 193–200 (2004)

    Article  Google Scholar 

  12. Gilg, S., Neubert, R.: Semantische Indexierung mittels dynamisch-hierarchischer Neuronaler Netze. Master’s thesis, Chemnitz University of Technology (1999)

    Google Scholar 

  13. Görlitz, O., Neubert, R., Benn, W.: Access to distributed environmental databases with ICIx technology. Online Inf. Rev. J. 24(5), 364–370 (2000)

    Article  Google Scholar 

  14. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)

    Article  Google Scholar 

  15. Hassanat, A.B.: Dimensionality invariant similarity measure. J. Am. Sci. 10(8), 221–226 (2014)

    Google Scholar 

  16. Hatzigiorgaki, M., Skodras, A.N.: Compressed domain image retrieval: a comparative study of similarity metrics. In: Visual Communications and Image Processing 2003, pp. 439–448. International Society for Optics and Photonics (2003)

    Google Scholar 

  17. Kent, W.J.: BLAT - the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002)

    Article  MathSciNet  Google Scholar 

  18. Kolekar, P., Kale, M., Kulkarni-Kale, U.: Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping. Mol. Phylogenet. Evol. 65(2), 510–522 (2012)

    Article  Google Scholar 

  19. Leuoth, S., Adam, A., Benn, W.: Profit of extending standard relational database with the intelligent cluster index (ICIx). In: 11th ICARCV International Conference ond Control, Automation, Robotics and Vision, vol. 1, pp. 1198–1205 (2010)

    Google Scholar 

  20. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  21. Neubert, R., Görlitz, O., Benn, W.: Incorporating knowledge technology in databases. In: KnowTech 2000 Conference (2000)

    Google Scholar 

  22. Neubert, R., Görlitz, O., Benn, W., Teich, T.: Obstacles for application of neural networks in the ICIx database index. Int. Joint Conf. Neural Networks 1, 2351–2356 (2002)

    Google Scholar 

  23. Neubert, R., Görlitz, O., Benn, W.: Towards content-related indexing in databases. Datenbanksysteme in Büro, Technik und Wissenschaft. Informatik aktuell, pp. 305–321. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  24. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. PNAS USA 85(8), 2444–2448 (1988)

    Article  Google Scholar 

  25. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, A., Finn, R.D.: The pfam protein families database. Nucleic Acids Res. 40(D1), D290–D301 (2012)

    Article  Google Scholar 

  26. Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)

    Article  Google Scholar 

  27. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)

    Article  Google Scholar 

  28. Sun, W.K.: Algorithms in Bioinformatics - A practical Introduction. CRC Press, Boca Raton (2010)

    Google Scholar 

  29. Yao, Y., Han, J., Dai, Q., He, P.: A novel descriptor of protein sequences and its application. J. Theor. Biol. 347, 109–117 (2014)

    Article  Google Scholar 

  30. Zvelebil, M., Baum, J.O.: Understanding Bioinformatics. Garland Science (2008)

    Google Scholar 

Download references

Acknowledgement

The study has been supported by the Free State of Saxony, the University of Applied Sciences Mittweida and Chemnitz University of Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Schildbach .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Schildbach, S., Heinke, F., Benn, W., Labudde, D. (2016). Evaluation of Descriptor Algorithms of Biological Sequences and Distance Measures for the Intelligent Cluster Index (ICIx). In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34099-9_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34098-2

  • Online ISBN: 978-3-319-34099-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics