Wavelet-based multifractal analysis of C.elegans sequences based on FCGS signal
Introduction
Deoxyribonucleic acid (DNA) is a complex molecule that is found inside each organism cell. It contains the biological instructions that make each species unique [1]. Microscopically, the DNA reveals chains of characters constituted by the bases adenine (A), thymine (T), cytosine (C) and guanine (G). The order of these bases is what determines DNA’s instructions, or genetic code. The genius of DNA lies not only in its complex coded instructions for life but also in its incredibly well-designed architecture. This architecture allows it to contain billions of detailed instructions within a microscopic molecule. In fact, the DNA has a very particular fractal characteristic which is to have an infinite length, whereas it is contained only in a bounded and rather reduced surface. That’s why DNA is classified as a fractal [2], [3].
Fractal geometry is a useful approach for regularity studies observed in DNA sequences [3], [4], [5], [6]. However, due to the DNA complexity and heterogeneity, a simple fractal dimension only describes the overall fractal characteristic [3], [4]. Multifractal formalism is more suitable for the fractal object which is complex and heterogeneous [7].
Multifractal analysis initially appeared with Mandelbrot multiplicatives cascades models for the energy dissipation study in the fully developed turbulence context [7]. It was implemented to improve the theoretical and experimental fractal patterns’ spatial inhomogeneity characterizations. It is applied when many fractal subsets coexist simultaneously [7], [8]. In this case, the analysis object is divided into several fractal sets, each generating a fractal dimension which is then translated into a continuous exponents spectrum called singularity spectrum [9]. The multifractal analysis was applied extensively in the medical signal analysis [10], [11], [12], like electrocardiogram (ECG) and electroencephalogram (EEG) signals [13], [14], mammography [15] and DNA sequence [16].
The multifractal analysis is useful in studying different problems at DNA sequence. The first study of DNA multifractality was applied in pre-genomics times by Berthelsen and al. They made spectral and multifractal analysis of measurements [17]. This work was used next to reconstruct phylogeny from mitochondrial DNA [18], to analyse of complete bacterial genomes [19] and to distinguish coding and non-coding sequences in DNA sequences [20], [21]. Later, the multifractal analysis was applied to know how the genetic information is structured, in the Caenorhabditis elegans (C. elegans) genome model [22] and in the human genome [23]. Then, the analysis was extended to study the multifractal cross-correlation behavior of coding and non-coding DNA sequences whose lengths are not equal in size [16].
The multifractal spectrum calculation is an important step for signal multifractal analysis. There are several numerical methods to estimate the multifractal spectrum, such as the box-counting method (BC) [20], [22], [23], the wavelet leaders method (WLMF) [24], [25], detrending fluctuation analysis (DFA) [25], [26], [27], detrending moving average (DMA) [28], and wavelet transform model maxima (WTMM) [29], [30], [31], [32]. The BC, DFA, and WTMM methods are the most widely used for DNA sequences multifractal analysis [20], [22], [23], [29], [32], [33]. The BC has intrinsic limitations and fails to fully characterize the corresponding singularity spectrum [23], [29], [34], [35]. The DFA method is a physically appealing technique of easy implementation [33]. The WTMM is a mathematically well-established transform. It is based on the wavelet transform modulus maxima chains construction [32]. The DFA method and WTMM method are both more numerically stable than the box-counting method for multifractal spectrum calculation [23], [33], [29]. That’s why we opt-in this paper for the WTMM method.
To be able to apply the multifractal analysis to the DNA sequence, it is imperative to convert the ATCG string into a numerical signal [36]. This is done by assigning a fixed and specific numerical value to each nucleotide that constitutes the DNA sequence according to the user’s choice. This operation is the so-called coding technique. The coding technique objective is to improve the hidden information with a view to promoting investigation. Several coding techniques have been proposed. They are divided into two large DNA coding methods: experimental coding and synthetic coding [36]. The experimental coding techniques make use of experimental tables to reflect the chemical and the structural properties of DNA in the produced signal. As examples, the electron–ion interaction pseudo-potential (EIIP) mapping [37] and structural bending trinucleotide coding (PNUC) [38]. The PNUC coding is based on the scratches curvatures measurement related to tri-nucleotides during nucleosome positioning. It follows that this coding has characteristics derived from the DNA coiled structure. On the other hand, the EIIP mapping is based on the electrons energy measurement which is delocalized in nucleotides amino acids. Various studies have used these coding techniques and have shown their effectiveness in characterizing the DNA coding regions [36].
The binary coding and the random walk are two linear synthetic codings. The binary coding is based on simply assigning 0 or 1 to indicate the presence or the absence of a nucleotide base in the original sequence. The random walk is based on assigning the value if the base is pyrimidine (C or T) and 1 in case if the base is purine (A or G).
The Frequency Chaos Game Signal () [39], [40], [41], [42] is also a linear synthetic coding. It is a new way to represent the DNA genomic sequence as a one-dimensional signal, inspired by the Chaos Game theory [39], [40], [43]. The coding idea is based on assigning the occurrence frequency value of each sub-pattern to the same group of nucleotides existing in the DNA sequence. It gives the opportunity to produce several signals for the same input sequence, depending on the size k of the considered sub-patterns. The particularity of this technique resides in using the statistical properties of the genome itself, which may strongly reflect the main interesting features of the specific DNA structures.
An advantage of the coding method is its invariance with respect to the assignment of nucleotides to their numerical values. Unlike the DNA walks, PNUC and EIIP where the nucleotides are assigned numerical values depending on their chemical and structural characteristics (purine, pyrimidine, weak-strong hydrogen bonds, keto-amino). I. Messaoudi and al have applied the continuous wavelet transform (CWT) to the signal [40]. They proved the efficiency of this coding method in characterizing different genomic sites in the C.elegans genome independently from its size or biological function. This study was used later to analyze and classify transposable elements (TEs) of the C. elegans genome [41], [42].
Mapping nucleotide sequences by the FCGS technique produces a multifractal landscape. In the present paper, we perform a multifractal analysis of particular regions, exon, and intron sequences of the C. elegans genome based on the FCGS signal order two (). We used the WTMM method for the multifractal spectrum calculation. It has been developed and used in several works. It was applied to DNA walk multifractal analysis and the authors showed its robustness [29], [32]. The novelty of this work resides not only in the multifractal analysis of DNA sequences based on this new coding technique, but also to study the relationship between the multifractal degree of the sequence and its repetitive patterns composition and to create a new methodological approach for the classification and exploration of unknown DNA sequences for use in our future work.
This paper is divided into four sections: In Section 2, we first review concepts related to multifractal analysis based on the WTMM method. In Section 3, we present the C.elegans genome particular sequences used in our multifractal analysis, we detail the steps required to generate the frequency chaos game signals and we describe the methodology adopted for the C.elegans genome multifractal analysis. We expose and discuss the results in Section 4. Finally, in Section 5, we conclude this paper.
Section snippets
A wavelet-based multifractal formalism
The continuous wavelet transform (CWT) was introduced by Morlet in 1983 to study seismic signal [44]. It is a powerful mathematical technique for complex signal studies, for its ability to filter out low-frequency trends in the analyzed signal. Since then, it has been early recognized as a mathematical microscope that is well adapted to reveal the hierarchy that governs the spatial distribution of the singularities of multifractal measures [32]. In this section, we focus on the wavelet
Materials
In this work, a multifractal analysis of C.elegans genome particular DNA sequences is presented. The DNA sequences of the C.elegans genome are available in the NCBI database [46]. We choose some particular DNA sequences of the C.elegans genome: HELITRON, CERP3 and CEREP55. Also, we selected a DNA segment from C.elegans chromosome I and we extracted the exons and Introns sequences.
Helitrons are a specific class of DNA. They belong to the category of transposable elements (TEs). They first
Results and discussion
This work aims to characterize particulars DNA sequences of the C.elegans genome described in the Table 1, by using multifractal analysis of the signal based on the WTMM method. We look also for a relationship between the number of different repetitive DNA in these particulars DNA sequences and the multifractal parameters.
For each DNA sequence, we first generate the signal. We illustrate, for example, in the Fig. 3, the of one exon and intron sequence of gene, HELITRON
Conclusion
The is a new coding technique of DNA sequences, it consists of assigning the frequency of occurrence of each sub-pattern to the same group of nucleotides that exist in the DNA sequence. It is a simple and powerful visualization method to amplify the difference between DNA sequences. It provides a multitude of signals which give the possibility to treat the DNA sequence from different views, taking into account the resident oligomers’ statistical properties. This mapping shows a
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Zeineb Chebbi Babachia: Ph.D. in electrical engineering from the National Engineering School of Tunisia (ENIT). Received her master's degree in applied mathematics from the National Engineering School of Tunisia (ENIT). Her research interest includes issues related to signal and image processing applied in biomedical and genomic fields.
References (51)
- et al.
Fractal landscape analysis of DNA walks
Physica A
(1992) - et al.
Fractals related to long DNA sequences and complete genomes
Chaos Solitons Fractals
(2000) - et al.
Dimensions of fractals related to languages defined by tagged strings in complete genomes
Chaos Solitons Fractals
(2000) - et al.(2009)
- et al.
Differentiation of early mild cognitive impairment in brainstem MR images using multifractal detrended moving average singularity spectral features
Biomed. Signal Process. Control
(2020) - et al.
Multifractal characterization of healing process after bone loss
Biomed. Signal Process. Control
(2019) - et al.
Mutifractals based multimodal 3D image registration
Biomed. Signal Process. Control
(2019) - et al.
Multifractal analysis of electronic cardiogram taken from healthy and unhealthy adult subjects
Physica A
(2003) - et al.
Epilepsy and seizure characterisation by multifractal analysis of EEG subbands
Biomed. Signal Process. Control
(2018) - et al.
Multifractal detrended cross-correlation analysis of coding and non-coding DNA sequences through chaos-game representation
Physica A
(2015)
A fractal method to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation
J. Theoret. Biol.
Time-singularity multifractal spectrum distribution based on detrended fluctuation analysis
Physica A
Multifractal spectrum distribution based on detrending moving average
Chaos Solitons Fractals
Wavelet based fractal analysis of DNA sequences
Physica D
What can we learn with wavelets about DNA sequences?
Physica A
Singularity detection in human EEG signal using wavelet leaders
Biomed. Signal Process. Control
Multi-scale coding of genomic information: From DNA sequence to genome structure and function
Phys. Rep.
Helitrons on a roll: eukaryotic rolling-circle transposons, TRENDS in Genetics
A structure for ADN
Nature
Cited by (2)
CNN for bacteria and archaea classification using FCGR images
2022, 2022 IEEE Information Technologies and Smart Industrial Systems, ITSIS 2022Exploring the C.elegans genome genes' composition of exon and intron sequences using the multifractal analysis
2022, 2022 IEEE Information Technologies and Smart Industrial Systems, ITSIS 2022
Zeineb Chebbi Babachia: Ph.D. in electrical engineering from the National Engineering School of Tunisia (ENIT). Received her master's degree in applied mathematics from the National Engineering School of Tunisia (ENIT). Her research interest includes issues related to signal and image processing applied in biomedical and genomic fields.
Afef Elloumi Oueslati: PhD in electrical engineering from the National Engineering School of Tunisia (ENIT). She is Associate Professor at the National School of Engineers of Carthage (ENICarthage). Her research interest includes issues related to signal and image processing applied in biomedical and genomic fields.