Wavelet-based multifractal analysis of C.elegans sequences based on FCGS signal

https://doi.org/10.1016/j.bspc.2021.102915Get rights and content

Highlights

  • We performed a multifractal analysis of the C.elegans genome based on the FCGS2 signal.

  • We used the WTMM method for the multifractal analysis.

  • We observed that multifractality is variable according to the chromosome region.

  • The multifractality depends on the contents of repetitive DNA in the sequences.

  • This approach will help later to explore automatically unidentified sequences.

Abstract

The frequency chaos game signal (FCGS) is a new mapping technique of DNA sequences inspired by the Chaos Game theory. It has the particularity of exploiting the statistical properties of the genomic sequences composition which may serve in detecting interesting structures within the DNA sequences. Unlike the classical DNA sequences coding techniques, where the nucleotides are assigned numerical values depending on their chemical and structural characteristics, the advantage of the FCGS coding method is its invariance concerning the assignment of nucleotides to their numerical values. Mapping nucleotide sequences by the FCGS technique produces a multifractal landscape that can be studied quantitatively by applying the so-called wavelet transform modulus maxima method (WTMM). This method provides a natural generalization of the classical box-counting techniques for the multifractal signal analysis. In fact, the wavelets are playing the role of generalized oscillating boxes. In this paper, we use the WTMM method to perform a multifractal analysis of the C.elegans genome using FCGS signal with order two. First, we generate the FCGS signal of particulars C.elegans genome regions like exon, intron, Helitron, CERP3, and CEREP55. Next, we apply the WTMM to calculate the singularity spectrum. Finally, we prove, with the obtained results, the multifractal nature of this genome and the variability of this multifractal characteristic according to the region studied. We also discover that this variability was mainly dependent on differences in the contents of repetitive DNA in each DNA sequence. This approach will be used to characterize sequences to allow their automatic classification. The technique aims to characterize structural and functional regions of chromosomes in genomes and will help later to explore automatically and study unidentified sequences.

Introduction

Deoxyribonucleic acid (DNA) is a complex molecule that is found inside each organism cell. It contains the biological instructions that make each species unique [1]. Microscopically, the DNA reveals chains of characters constituted by the bases adenine (A), thymine (T), cytosine (C) and guanine (G). The order of these bases is what determines DNA’s instructions, or genetic code. The genius of DNA lies not only in its complex coded instructions for life but also in its incredibly well-designed architecture. This architecture allows it to contain billions of detailed instructions within a microscopic molecule. In fact, the DNA has a very particular fractal characteristic which is to have an infinite length, whereas it is contained only in a bounded and rather reduced surface. That’s why DNA is classified as a fractal [2], [3].

Fractal geometry is a useful approach for regularity studies observed in DNA sequences [3], [4], [5], [6]. However, due to the DNA complexity and heterogeneity, a simple fractal dimension only describes the overall fractal characteristic [3], [4]. Multifractal formalism is more suitable for the fractal object which is complex and heterogeneous [7].

Multifractal analysis initially appeared with Mandelbrot multiplicatives cascades models for the energy dissipation study in the fully developed turbulence context [7]. It was implemented to improve the theoretical and experimental fractal patterns’ spatial inhomogeneity characterizations. It is applied when many fractal subsets coexist simultaneously [7], [8]. In this case, the analysis object is divided into several fractal sets, each generating a fractal dimension which is then translated into a continuous exponents spectrum called singularity spectrum [9]. The multifractal analysis was applied extensively in the medical signal analysis [10], [11], [12], like electrocardiogram (ECG) and electroencephalogram (EEG) signals [13], [14], mammography [15] and DNA sequence [16].

The multifractal analysis is useful in studying different problems at DNA sequence. The first study of DNA multifractality was applied in pre-genomics times by Berthelsen and al. They made spectral and multifractal analysis of measurements [17]. This work was used next to reconstruct phylogeny from mitochondrial DNA [18], to analyse of complete bacterial genomes [19] and to distinguish coding and non-coding sequences in DNA sequences [20], [21]. Later, the multifractal analysis was applied to know how the genetic information is structured, in the Caenorhabditis elegans (C. elegans) genome model [22] and in the human genome [23]. Then, the analysis was extended to study the multifractal cross-correlation behavior of coding and non-coding DNA sequences whose lengths are not equal in size [16].

The multifractal spectrum calculation is an important step for signal multifractal analysis. There are several numerical methods to estimate the multifractal spectrum, such as the box-counting method (BC) [20], [22], [23], the wavelet leaders method (WLMF) [24], [25], detrending fluctuation analysis (DFA) [25], [26], [27], detrending moving average (DMA) [28], and wavelet transform model maxima (WTMM) [29], [30], [31], [32]. The BC, DFA, and WTMM methods are the most widely used for DNA sequences multifractal analysis [20], [22], [23], [29], [32], [33]. The BC has intrinsic limitations and fails to fully characterize the corresponding singularity spectrum [23], [29], [34], [35]. The DFA method is a physically appealing adhoc technique of easy implementation [33]. The WTMM is a mathematically well-established transform. It is based on the wavelet transform modulus maxima chains construction [32]. The DFA method and WTMM method are both more numerically stable than the box-counting method for multifractal spectrum calculation [23], [33], [29]. That’s why we opt-in this paper for the WTMM method.

To be able to apply the multifractal analysis to the DNA sequence, it is imperative to convert the ATCG string into a numerical signal [36]. This is done by assigning a fixed and specific numerical value to each nucleotide that constitutes the DNA sequence according to the user’s choice. This operation is the so-called coding technique. The coding technique objective is to improve the hidden information with a view to promoting investigation. Several coding techniques have been proposed. They are divided into two large DNA coding methods: experimental coding and synthetic coding [36]. The experimental coding techniques make use of experimental tables to reflect the chemical and the structural properties of DNA in the produced signal. As examples, the electron–ion interaction pseudo-potential (EIIP) mapping [37] and structural bending trinucleotide coding (PNUC) [38]. The PNUC coding is based on the scratches curvatures measurement related to tri-nucleotides during nucleosome positioning. It follows that this coding has characteristics derived from the DNA coiled structure. On the other hand, the EIIP mapping is based on the electrons energy measurement which is delocalized in nucleotides amino acids. Various studies have used these coding techniques and have shown their effectiveness in characterizing the DNA coding regions [36].

The binary coding and the random walk are two linear synthetic codings. The binary coding is based on simply assigning 0 or 1 to indicate the presence or the absence of a nucleotide base in the original sequence. The random walk is based on assigning the value -1 if the base is pyrimidine (C or T) and 1 in case if the base is purine (A or G).

The Frequency Chaos Game Signal (FCGSk) [39], [40], [41], [42] is also a linear synthetic coding. It is a new way to represent the DNA genomic sequence as a one-dimensional signal, inspired by the Chaos Game theory [39], [40], [43]. The FCGSk coding idea is based on assigning the occurrence frequency value of each sub-pattern to the same group of nucleotides existing in the DNA sequence. It gives the opportunity to produce several signals for the same input sequence, depending on the size k of the considered sub-patterns. The particularity of this technique resides in using the statistical properties of the genome itself, which may strongly reflect the main interesting features of the specific DNA structures.

An advantage of the FCGSk coding method is its invariance with respect to the assignment of nucleotides to their numerical values. Unlike the DNA walks, PNUC and EIIP where the nucleotides are assigned numerical values depending on their chemical and structural characteristics (purine, pyrimidine, weak-strong hydrogen bonds, keto-amino). I. Messaoudi and al have applied the continuous wavelet transform (CWT) to the FCGSk signal [40]. They proved the efficiency of this coding method in characterizing different genomic sites in the C.elegans genome independently from its size or biological function. This study was used later to analyze and classify transposable elements (TEs) of the C. elegans genome [41], [42].

Mapping nucleotide sequences by the FCGS technique produces a multifractal landscape. In the present paper, we perform a multifractal analysis of particular regions, exon, and intron sequences of the C. elegans genome based on the FCGS signal order two (FCGS2). We used the WTMM method for the multifractal spectrum calculation. It has been developed and used in several works. It was applied to DNA walk multifractal analysis and the authors showed its robustness [29], [32]. The novelty of this work resides not only in the multifractal analysis of DNA sequences based on this new coding technique, but also to study the relationship between the multifractal degree of the sequence and its repetitive patterns composition and to create a new methodological approach for the classification and exploration of unknown DNA sequences for use in our future work.

This paper is divided into four sections: In Section 2, we first review concepts related to multifractal analysis based on the WTMM method. In Section 3, we present the C.elegans genome particular sequences used in our multifractal analysis, we detail the steps required to generate the frequency chaos game signals and we describe the methodology adopted for the C.elegans genome multifractal analysis. We expose and discuss the results in Section 4. Finally, in Section 5, we conclude this paper.

Section snippets

A wavelet-based multifractal formalism

The continuous wavelet transform (CWT) was introduced by Morlet in 1983 to study seismic signal [44]. It is a powerful mathematical technique for complex signal studies, for its ability to filter out low-frequency trends in the analyzed signal. Since then, it has been early recognized as a mathematical microscope that is well adapted to reveal the hierarchy that governs the spatial distribution of the singularities of multifractal measures [32]. In this section, we focus on the wavelet

Materials

In this work, a multifractal analysis of C.elegans genome particular DNA sequences is presented. The DNA sequences of the C.elegans genome are available in the NCBI database [46]. We choose some particular DNA sequences of the C.elegans genome: HELITRON, CERP3 and CEREP55. Also, we selected a DNA segment from C.elegans chromosome I and we extracted the exons and Introns sequences.

Helitrons are a specific class of DNA. They belong to the category of transposable elements (TEs). They first

Results and discussion

This work aims to characterize particulars DNA sequences of the C.elegans genome described in the Table 1, by using multifractal analysis of the FCGS2 signal based on the WTMM method. We look also for a relationship between the number of different repetitive DNA in these particulars DNA sequences and the multifractal parameters.

For each DNA sequence, we first generate the FCGS2 signal. We illustrate, for example, in the Fig. 3, the FCGS2 of one exon and intron sequence of T01A4.2 gene, HELITRON

Conclusion

The FCGSk is a new coding technique of DNA sequences, it consists of assigning the frequency of occurrence of each sub-pattern to the same group of nucleotides that exist in the DNA sequence. It is a simple and powerful visualization method to amplify the difference between DNA sequences. It provides a multitude of signals which give the possibility to treat the DNA sequence from different views, taking into account the resident oligomers’ statistical properties. This mapping shows a

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Zeineb Chebbi Babachia: Ph.D. in electrical engineering from the National Engineering School of Tunisia (ENIT). Received her master's degree in applied mathematics from the National Engineering School of Tunisia (ENIT). Her research interest includes issues related to signal and image processing applied in biomedical and genomic fields.

References (51)

  • L.Q. Zhou et al.

    A fractal method to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation

    J. Theoret. Biol.

    (2005)
  • G. Xiong et al.

    Time-singularity multifractal spectrum distribution based on detrended fluctuation analysis

    Physica A

    (2015)
  • G. Xiong et al.

    Multifractal spectrum distribution based on detrending moving average

    Chaos Solitons Fractals

    (2014)
  • A. Arneodo et al.

    Wavelet based fractal analysis of DNA sequences

    Physica D

    (1996)
  • A. Arneodo et al.

    What can we learn with wavelets about DNA sequences?

    Physica A

    (1998)
  • P. Venkatakrishnan et al.

    Singularity detection in human EEG signal using wavelet leaders

    Biomed. Signal Process. Control

    (2014)
  • A. Arneodo et al.

    Multi-scale coding of genomic information: From DNA sequence to genome structure and function

    Phys. Rep.

    (2011)
  • V.V. Kapitonov et al.

    Helitrons on a roll: eukaryotic rolling-circle transposons, TRENDS in Genetics

    (2007)
  • J. Watson et al.

    A structure for ADN

    Nature

    (1953)
  • R.F. Voss, Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett., 68...
  • L.G.A. Alves, P.B. Winter, L.N. Ferreira, R.M. Brielmann, R.I. Morimoto and L.A.N. Amaral, Long-range correlations and...
  • B.B. Mandelbrot, Intermittent turbulence in self-similar cascades: divergence of high moments and dimension of the...
  • D. Schertzer, S. Lovejoy, F. Schmitt, Y. Chigirinskaya and D. Marsan, Multifractal cascade dynamics and turbulent...
  • E. Gerasimova et al.
    (2014)
  • C.L. Berthelsen, J.A. Glazier and S. Raghavachari, Effective multifractal spectrum of a random walk, Phys. Rev. E, 49...
  • Cited by (2)

    Zeineb Chebbi Babachia: Ph.D. in electrical engineering from the National Engineering School of Tunisia (ENIT). Received her master's degree in applied mathematics from the National Engineering School of Tunisia (ENIT). Her research interest includes issues related to signal and image processing applied in biomedical and genomic fields.

    Afef Elloumi Oueslati: PhD in electrical engineering from the National Engineering School of Tunisia (ENIT). She is Associate Professor at the National School of Engineers of Carthage (ENICarthage). Her research interest includes issues related to signal and image processing applied in biomedical and genomic fields.

    View full text