Spectral domain characterization of genome sequences

Enormous amount of genomic and proteomic data is available to us in the public domain and there is need for faster and more sensitive algorithms for genomic study in view of this rapidly increasing amounts of genomic sequences. Basically genomic data falls under the category of ‘Big Data’ and thus processing these large data has become a difficult task. In this regard, people working in the area of data mining and pattern recognition have realized the importance of ‘Machine Learning’ techniques in analyzing such big data. In this context, this paper proposes a novel technique of ‘Spectral characterization of genomic sequences’ for DNA analysis purpose.


I. INTRODUCTION
Genome data is now treated as 'Big Data'.Parallel Programming would then be the most important tool to calculate and compute like a supercomputer and yet, be economical.The power spectrum of a sequence is a transformation of that sequence of variables in the frequency space.It has a significant advantage that periodic patterns in the sequence -hidden or latent -become evident after transformation.That is, hidden periodic information are shown as peaks in the spectrum.Periodicity of DNA sequences have already been examined using various methods like autocorrelation and auto convolution.Various discrete transforms haven tried on genomic sequences like Discrete Fourier transform, Wavelet Transform and their power spectra examined.One such transform is Rajan Transform which unlike other transform exhibits homomorphism property in addition to being an isomorphic map.Homomorphism means classification or pattern recognition.A case study was made on a genome sequence of Brucella Suis 1330.This sequence was obtained from NCBI.Discrete Fourier Transform based power spectrum of the adjoint sequences were obtained and the spectral data analyzed for extracting hidden information pertaining to various codons.Rajan Transform based power spectrum of the adjoint sequences were also obtained and examined for codon information.Results were compared and analytical reports presented in this paper.

II. POWER SPECTRA OF GENOME STRANDS
A genome strand has to be converted into a number sequence so that signal processing methods could be applied to it for analysis of genomic data and extraction of features of chromosome that would be difficult to obtain using standard statistical methods.The question that arises at this juncture is whether spectral analyses useful for DNA sequence analysis.The answer is affirmative.Spectral analysis plays an important role in detecting latent periodicities and in distinguishing them from other tandem repeats.Usually latent periodicities indicate biological meanings.One can filter a genome sequence based on single nucleotide, base type, dinucleotides, to name a few, and can enhance the interpretation of signal.Spectral analysis could also be used to characterize large-scale fluctuations of base compositions and thus one can analyze genome-wide patterns.Let X[n]: 0<n<N-1 denote DNA of length N. X[l] represents a sequence residue.A DNA sequence may be thought of as the one comprising m different residues, which come from an alphabet set A m = {r; r =1, 2,.., m}.Typical alphabet examples are (i) for m = 4, the nucleotide alphabet is A 4 = {A, C, G, T} and (ii) for m = 20, A 20 is the amino acid set.This generalized designation allows one to have additional sequence residues to be studied, like codons or other functional groups.Let us consider a nucleotide sequence X(n) = GCCAAAAATCAGCTAAT CGC.Now adjoint of A is defined as A(n) = { x = 1 for X(i) = A ; 0 for X(i) 6= A}.Table 1 shows this coding pattern.Table 1 Triplets of nucleotides are called CODONS.Given a genome character string of ATGC, that is a strand, one can obtain codon adjoint by searching for a specific codon and replacing a match by 1 and others by 0s.Four of the 64 codon adjoints obtained from X This amounts to saying that one can construct 64 triplet codon adjoints for any given genome strand.From these 64 codon adjoints, one can construct a 'character sequence' using the procedure given below.

Procedure for obtaining character sequence
Find all the 64 codon adjoints.Find the sum of each adjoint.Obtain all 64 sum values.This sequence is called 'character sequence' of the strand.The character sequence for X After obtaining the given strand's character sequence of length 64, one can apply all signal processing tools like 'autocorrelation', 'auto convolution', 'Fourier Transform', 'Wavelet Transform', 'Rajan Transform' to name a few to analyze the sequence.This paper discusses the spectral analysis of genome character sequences using Discrete Fourier Transform and Rajan Transform.The basic definitions of these two transforms are given below.

Discrete Fourier Transform
Consider a sequence x(n) of length N.Then, its DFT (Discrete Fourier Transform pair) is given by the expressions (1) (2) All details relevant to this transform could be obtained from standard literature.

Fig. 1 :
Fig. 1: Signal flow graph for computing RT of x(n) = 3, 8, 5, 6, 0, 2, 9, 6 Adjoints of A, T, G and C of the genome and 64 codon adjoints were obtained from the genome sequence.Sums of all 64 codon adjoints were calculated and the character sequence of the genome formed.For brevity, we call this character sequence as 'codon sequence'.Codon sequence of the genome sequence of length 5806 is given below.Codon sequence: Fig. 2: Graphs of adjoints of X(n) Auto correlation of a sequence y(n) of length N is calculated using the formula

Fig. 3 :
Fig.3: Auto correlation of adjoints of X(n) Auto convolution of a sequence y(n) of length N is calculated using the formula R yy (t) = Σy(t)y(n−t).The auto convolution of y(n) would be of length 2N-1.The auto convolution of adjoints of X(n) are shown in Fig.4

Fig. 4 :
Fig. 4: Auto convolution of adjoints of X(n) Cross correlation of sequences y(n) and z(n) each of length N is calculated using the formula .The cross

Fig, 5 :
Fig,5: Fourier Power Spectra of R A ( ), R G ( ), R T ( ), R C ( ) Similar to what have been discussed so far, one can apply Rajan Transform to adjoints of X(n) so that the resulting spectra could be called as Rajan Spectra.Fig.6shows all such Rajan Spectra of the adjoins of X(n).

Fig. 8 :Fig. 9 :
Fig. 8: Fourier and Rajan spectra of the codon sequence of X(n) with auto correlation and auto convolution Fig. 9 shows Fourier and Rajan power spectra of the auto correlation and auto convolution of codon sequence