A bioinformatics approach to 2D shape classification

https://doi.org/10.1016/j.cviu.2015.11.011Get rights and content

Highlights

  • An alternative interaction between Pattern Recognition and Bioinformatics is studied.

  • 2D shape classification is faced using biological sequence analysis approaches.

  • Classification results are competitive with literature.

  • Other bioinformatics tools are used for understanding and interpretation.

Abstract

In the past, the huge and profitable interaction between Pattern Recognition and biology/bioinformatics was mainly unidirectional, namely targeted at applying PR tools and ideas to analyse biological data. In this paper we investigate an alternative approach, which exploits bioinformatics solutions to solve PR problems: in particular, we address the 2D shape classification problem using classical biological sequence analysis approaches – for which a vast amount of tools and solutions have been developed and improved in more than 40 years of research. First, we highlight the similarities between 2D shapes and biological sequences, then we propose three methods to encode a shape as a biological sequence. Given the encoding, we can employ standard biological sequence analysis tools to derive a similarity, which can be exploited in a nearest neighbor framework. Classification results, obtained on 5 standard datasets, confirm the potentials of the proposed unconventional interaction between PR and bioinformatics. Moreover, we provide some evidences of how it is possible to exploit other bioinformatics concepts and tools to interpret data and results, confirming the flexibility of the proposed framework.

Introduction

Research in Computational Biology and Bioinformatics experienced an unprecedented growth in the last years, mainly due to the fruitful interaction with many disciplines and fields of computer science. Among others, Pattern Recognition/Machine Learning techniques have been successfully exploited in this context [1], for many different reasons: it is possible to “learn from examples”, derive quantitative models, handle non vectorial data, and deal with many classification, clustering and detection problems commonly encountered in life sciences. In many cases the particular Pattern Recognition model has not been applied “as is”, but has been adapted and modified to take into account biological constraints and needs. Sometimes, this produced approaches that are very different from original methodology – a clear example is the profile-HMMs [2].

To some extent, it can be stated that this tight interaction has been mainly unidirectional, with biology/life science gaining the largest benefit1. In this paper, we explore an alternative direction, trying to answer the following question: can we reverse the typical direction of interaction between Pattern Recognition and Bioinformatics? Or, in other words, can we exploit advanced bioinformatics models and solutions to solve pattern recognition tasks?.

To the best of our knowledge, this perspective is rather new in the literature – the only relevant example is the video-genome project2 [4] – and it seems a promising direction for two different reasons. First, if we are able to encode the Pattern Recognition problem in biological terms then we can exploit the huge range of effective, optimized, and interpretable bioinformatics tools developed by more than 40 years of research. These tools heavily rely on the solution of general pattern recognition tasks such as matching, classification, retrieval, clustering, distance computation and so on. For example, in the video-genome project [4], authors established an analogy between biological sequences and videos, defining the so called “video-DNA”, a way to map features extracted from video frames into nucleotidic biological sequences. Having encoded the problem in biological terms, authors were then able to address the video retrieval task by using the famous BLAST [5] – an extremely fast and effective heuristic-driven algorithm for biological sequence retrieval. Second, and more important, the main goal in bioinformatics research is to derive knowledge from biological data: therefore, the interpretability of methods and solutions is a key feature, and many visualization, inspection and interpretation tools are available in the literature. These tools may be very useful also in the Pattern recognition scenarios, to better understand the different aspects of the data for a given problem: actually, in recent years interpretability has become a stringent need in Pattern Recognition [6].

This paper makes another step in this direction, providing some further evidence on the effectiveness and interpretability of bioinformatics approaches for Pattern Recognition problems. In particular, in this paper, we propose and discuss a bioinformatics approach to 2D shape classification. Analysis of 2D shapes represents an important and vibrant research area (often paving the way for 3D object classification). Many approaches appeared in the literature (see for example the reviews [7], [8]): very often, the 2D shape is encoded by the contour, which proved to be an effective and natural choice in many applications. Here we propose some methods to encode the shape contour as a biological sequence, employing tailored bioinformatics tools to perform classification. In the huge literature related to 2D shape analysis, many approaches exploit sequence alignments tools to perform shape matching ([9], [10], [11], [12], [13], just to cite a few) – some sequence matching-based approaches which start from shape-skeletons have also been proposed [14], [15], [16]. Focusing on our main target, i.e. to use biological sequence alignment tools, it should be noted that few approaches exist that employ techniques developed for biological sequences to perform shape classification or matching [17], [18]. Nevertheless, these approaches propose a very different perspective with respect to our approach (and the video genome project), where the main goal is to encode the PR problem in biological terms, hence exploiting tools developed for biological sequence analysis. In other words, to exploit Bioinformatics tools for Pattern Recognition, one can consider two main steps: (i) encoding the PR problem in biological terms; (ii) applying bioinformatics tools to solve the problem. From this point of view, the approaches in [17], [18] are rather poor, employing one particular technique for one particular purpose, and not considering a biological encoding which would allow the use of a wide class of algorithms for sequence analysis.

In this paper we do explicitly consider this aspect: first, we establish an analogy between 2D shapes and biological sequences, this motivating the employment of bioinformatics tools. Then we propose three ways for transforming a silhouette, encoded with the 8-directional chain code [19], into an aminoacidic sequence; given that, we can compute the similarity between shapes by using established biological sequence alignment tools. Such similarity is then exploited for classification in a K-nearest-neighbor setting. Finally, we show that other biological tools and concepts (such as multiple sequence alignment, conserved domains and locality and quality of alignment) can be used for a deeper analysis of the results. We performed different experiments with five standard shape datasets; on one hand, we show that classification results are very competitive with the state-of-the art. On the other hand, we show that poor results we obtained on a retrieval case can be analysed in a deeper way by exploiting other biological sequence mining tools.

Section snippets

Background

This section briefly summarizes the bioinformatics tools exploited in our analysis. First, we present a preliminary overview of biological sequence alignment, so to clarify notations and terminology. Then, we present the tools employed for pairwise sequence alignment and multiple sequence alignment, trying to highlight specific aspects which are useful for our task.

The proposed approach

In this section we present our approach: in particular, we first link 2D shapes and biological sequences, which may motivate the employment of bioinformatics tools in this context. Then we introduce the three methods used to encode shapes into biological sequences; finally, we detail how to transform alignments into a classification scheme.

Classification results and discussion

In this section we evaluate the proposed framework in the context of shape classification. In particular, we first describe the datasets we used and the corresponding evaluation protocols; then we provide some details on the parameters of the proposed framework; finally we present and discuss our classification results, putting them in perspective with respect to the state of the art.

Deeper analysis

In this part we provide an example of how it is possible to exploit the huge amount of bioinformatics tools to have a deeper understanding of the results. To do that, we evaluated our framework in a slightly different task (the retrieval task), trying to exploit bioinformatics tools and concepts to better understand results that were not satisfactory. Even if related, the retrieval task is slightly different from classification: given a testing object, the goal is to retrieve as many shapes as

Conclusions

In this paper we explored the possibility of exploiting bioinformatics concepts, tools and solutions to address the 2D shape classification problem. In our framework, the contour of a 2D shape is encoded using the chain code, and then transformed into biological sequences through three encoding strategies. We then employ biological sequence alignment tools to compute a similarity measure between sequences/shapes, and we use a KNN classification approach. We also proposed some tailoring of the

Acknowledgments

Authors would like to thank Nebojsa Jojic and Alessandro Farinelli for helpful discussions and suggestions. Authors are also grateful to the anonymous reviewers for their precious comments.

References (70)

  • M. Bicego et al.

    Component-based discriminative classification for hidden markov models

    Pattern Recogn.

    (2009)
  • M. Daliri et al.

    Classification of silhouettes using contour fragments

    Comput. Vis. Image Underst.

    (2009)
  • M. Daliri et al.

    Shape recognition based on kernel-edit distance

    Comput. Vis. Image Underst.

    (2010)
  • P. Baldi et al.

    Bioinformatics: The Mmachine Learning Approach

    (2001)
  • S. Eddy

    Profile hidden markov models

    Bioinformatics

    (1998)
  • S. Madeira et al.

    Biclustering algorithms for biological data analysis: a survey

    IEEE/ACM Trans. Comput. Biol. Bioinf.

    (2004)
  • A. Bronstein, M. Bronstein, R. Kimmel, The video genome, CoRR...
  • J. Chang et al.

    Reading tea leaves: How humans interpret topic models

    NIPS

    (2009)
  • H. Ling et al.

    Shape classification using the inner-distance

    IEEE Trans. Pattern Anal. Mach. Intell

    (2007)
  • J. Wang et al.

    Shape matching and classification using height functions

    Pattern Recogn. Lett.

    (2011)
  • M. Daliri et al.

    Robust symbolic representation for shape recognition and retrieval

    Pattern Recogn.

    (2008)
  • P. Felzenszwalb et al.

    Hierarchical matching of deformable shapes

    Proceedings of the International Conference on Computer Vision and Pattern Recognition

    (2007)
  • A. Torsello et al.

    Discovering shape classes using tree edit-distance and pairwise clustering

    Int. J. Comput. Vis.

    (2007)
  • X. Bai et al.

    Path similarity skeleton graph matching

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • L. Chen et al.

    Efficient partial shape matching using smith-waterman algorithm

    CVPR workshop on Non-Rigid Shape Analysis and Deformable Image Alignment

    (2008)
  • R. Huang et al.

    A profile hidden markov model framework for modeling and analysis of shape

    Proceedings of the International Conference on Image Processing

    (2006)
  • R. Gonzalez et al.

    Digital Image Processing

    (2002)
  • R. Durbin

    Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

    (1998)
  • M. Dayhoff et al.

    A model of evolutionary change in proteins

    Atlas of Protein Sequence and Structure

    (1978)
  • S. Henikoff et al.

    Amino acid substitution matrices from protein blocks

    Proc. Natl. Acad. Sci.

    (1992)
  • J. Pevsner

    Bioinformatics and Functional Genomics

    (2003)
  • M. Larkin

    Clustal w and clustal x version 2.0

    Bioinformatics

    (2007)
  • O. Poirot et al.

    3dcoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment

    Nucl. Acids Res.

    (2004)
  • J. Smagala et al.

    Confind: a robust tool for conserved sequence identification

    Bioinformatics

    (2005)
  • M. Bicego et al.

    2d shape recognition using information theoretic kernels

    Proceedings of the International Conference on Pattern Recognition

    (2010)
  • Cited by (29)

    • Multi-level contour combination features for shape recognition

      2023, Computer Vision and Image Understanding
    • An enhanced and interpretable feature representation approach to support shape classification from binary images

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Finally, Section 5 presents the conclusions and future work. For method comparison, we consider the following skeleton and contour-like benchmarks: multiresolution edit distance (MED) [8], kernel-edit distance (KeD) [9], multiscale distance matrix (MDM) [12], inner distance shape context and morphological strategies (IDSC+MS) [13], shape vocabulary (SV) [2], BoCF [26], line segment statistics (LSS) [14], BoCF and bag of skeleton paths (BoCF+BoSP) [24], bioinformatics (Bio) [3], contextual BOW model (ConBOW) [20], bag of skeleton-associated contour parts (BoSCP) [23], BoCF, BoSCP and its learning pooling function variants (BoCF-LP and BoSCP-LP) [22], distance transform network (DTN) [21], RNN [15], curvature bag of words (CBoW) [29], and enlacement and interlacement shape descriptor (EID) [6]. For the sake of clarity, we present the EIFR results concerning the enhanced spatial BI relevance from BoCF features.

    • Vide-omics: A genomics-inspired paradigm for video analysis

      2018, Computer Vision and Image Understanding
      Citation Excerpt :

      Despite encouraging performance (Bronstein et al., 2010), there is no evidence that further work was carried on based on that concept. Bicego et al. (2015); Bicego and Lovato (2012); 2016); Lovato et al. (2014) from the University of Verona have proposed encoding 2D and then 3D shapes as a biological sequence so that actual bioinformatics comparison tools could be used for shape recognition and classification. Their very competitive results have validated their approach.

    • Improved shape matching and retrieval using robust histograms of spatially distributed points and angular radial transform

      2017, Optik
      Citation Excerpt :

      WLD [17] is based on Weber’s law, which states that the change of a stimulus (such as sound, lighting) that will be just noticeable is a constant ratio of the original stimulus. WLD performs better on texture images, other recent descriptors include robust histogram based descriptor [18], bioinformatics based approach [19], image to class similarity [20], adaptive local binary patterns [21]. On the other hand, region based descriptors include: moment invariants (MI) [24], angular radial transform (ART) [25], grid descriptor [26], generic Fourier descriptor [27], Zernike moment descriptor [28], etc.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Sven Dickinson.

    View full text