As mentioned earlier that the sequences of digital symbols are the transformed biopolymers. Indirectly the sequence data means the structure of biopolymer, and structure expresses the function. It shows a reductionist approach. Therefore, the sequence data can be used as context free.

1. The IUPAC Symbols :

The International Union of Pure and Applied Chemistry (IUPAC) has made certain recommendations. The nomenclature system in bioinformatics is based on these recommendations.

i. Different laboratories of the world follow nomenclature system of IUPAC so that their data set can uniformly and easily be compared.

ADVERTISEMENTS:

ii. For rapid reproducibility and uniformity, the database institution and editors (who publish journals and research findings) also follow the recommendations of IUPAC.

For routine work, the basic IUPAC nomenclature system of nucleic acids and proteins has been discussed in this section. For detail you should go through the IUPAC web site. Language used in bioinformatics.

2. Nomenclature of DNA Sequences :

It is obvious that nucleotides are the building blocks of DNA, and the nucleotides are constituted by four bases (A, G, T and C). Symbols of these four bases and basis of their nomenclature are used as much as they are spelt.

ADVERTISEMENTS:

Their meaning and bases of nucleic acid sequences. Often the identity of sequences at specific positions is not clearly identifiable when sequence data are experimentally determined.

It happens due to the problems related to other secondary structures or ‘compression’ artifacts. In compression secondary structure in DNA fragments causes them to move in the gel so that more than one size of fragments may migrate to the same position.

Generally by repeating the experiment and sequencing the complementary strand, this problem can be solved.

However, if ambiguities persist in some cases, the probable possibility can be deduced from the gel reads i.e. forward and reverse readings give data from opposite strands of DNA. They provide information about the relative orientations of the read pairs (i.e. pair of reading) from the same template of fragments.

ADVERTISEMENTS:

A new symbol ‘S’ is used when there is doubt for the presence of G or C but there is surety for absence of A or T. Except a few viruses all the cellular organisms consist of double stranded DNA.

The two strands are complementary and antiparallel (running from 5′-8′ direction) to each other. This is called Watson and Crick base pairing. When one encounters the symbol, the problem arises due to more than one bases at a position.

These problems are resolved following IUPAC system of nomenclature. At certain positions the identical symbols in the strand and its complement are used. This shows that they are the same set of bases.

3. Nomenclature of Protein Sequences :

ADVERTISEMENTS:

You know that there are 20 amino acids which built the protein. But there are a few symbols which represent more than one amino acids.

4. Directionality of Sequences :

In nucleic acids (DNA and RNA) the nucleotide sequences are synthesised in 5′-8′ direction. The 5′ primer represents the presence of phosphate group at 5th carbon of sugar, and 3′ primer represents the presence of hydroxyl group at 3rd carbon of sugar.

It is a universal phenomenon. Hence, this information is used to collect the data and store it in sequence database. Because data of nucleotide sequences are deposited in the database in the same form as these have been submitted or published.

ADVERTISEMENTS:

Always the nucleotide sequences are listed in 5′-S’ direction, irrespective of the published order. The nucleotide bases are numbered sequentially starting from 5′ end i.e. from 5′ to 3′ direction. A word ‘C’ is indicated for complementary strand which also shows the orientation of chain in 5′-3′ direction.

Both the chains ran antiparallely i.e. one in 5direction and the other in 3′-5′ direction. While depositing sequence data, information on nucleotide sequence of only one strand is submitted in database. The nucleotide sequence of complementary strand is deduced from different web sites or programmes in different packages.

The three letter alphabets of the nucleotide act as codes. Each code represents an amino acid. In nature each cell synthesises proteins from N-terminus to C- terminus (N’-C) where N’ represents -NH2 group and C’ represents -COOH group of the amino acids.

These fundamental phenomenons are universal in all organisms. Hence, this conventional sequence of protein is entered in database. The concept of directionality is a universal fundamental process which is used by different database institutions.

ADVERTISEMENTS:

5. Types of Sequences used in Bioinformatics :

There are different types of sequences which are known to have genetic information. Therefore, such sequences are used in bioinformatics. These sequences have been described in this context.

(i) Genomic DNA:

The genomic DNA acts as the reservoir of genetic information of all organisms. In recent years it is routinely sequenced in many laboratories of Molecular Biology. Genomic DNA of prokaryotes differs from that of eukaryotes, as the later differs with respect to location and consists of introns.

(ii) cDNA:

The double stranded molecules prepared by using mRNA as template and reverse transcriptase are called cDNA. These are expressed genes of the genomic DNA. By using cDNA molecules, substantial number of sequences have been determined and deposited in database.

You have to tick at the right position when sequence entry form is to be filled up. This shows that the sequence, which is to be deposited, is the cDNA. Moreover, if you desire to retrieve the sequence this data need to be provided.

(iii) Organellar DNA:

Eukaryotic cells consist of different types of organelles e.g. chloroplast, mitochondria, Golgi complex, nucleus, etc. In eukaryotes genomic DNA is found in nucleus and organellar DNA molecules are located in mitochondria and chloroplasts.

The organellar DNA stores their own information. Their DNA contains a few genes hence only a few proteins are expressed.

(iv) ESTs:

It was Crag Venter who initiated first the sequencing of cDNA molecules using mRNA. The cDNA is cloned into a vector and cDNA library is constructed. For preparation of expressed sequence tags (ESTs) individual clones are picked up from cDNA library and one sequence is generated from each end of cDNA insert.

Normally each clone has 5′ and 3′ ESTs associated with it. The average length of sequence is of about 400 bases. While the ESTs are short representing only fragments of genes, but not completes coding sequence. Many sequencing centres have automated the EST production where ESTs are produced rapidly.

The contaminating vectors, mitochondria and bacterial sequences are removed before depositing the ESTs into the public database (dbEST). In database, the ESTs are identified by their clone number and presence of 5′ or 3′ orientation.

So far the ESTs that have been submitted to the public sequence databases were created from .thousands of different cDNA libraries representing over 250 organisms.

(v) Gene Sequencing Tags (GSTs):

It has been found that the enzyme mungbean nuclease (Mnase) cleaves between the genes of Plasmodium falciparum. Therefore, by digesting P. falciparum genome a genomic library was established.

It helps in identifying the genes of P. falciparum. The approach for construction of GSTs is similar to ESTs. It is constructed by isolating one read of sequence from any of the ends 5′ or 3′. The sequences obtained through this approach are called as GSTs.

(vi) Other Biomolecules:

The databases also consist of sequences of tRNA and small sized rRNAs. For example, 16S rRNA sequencing is done in tracing phylogenetic relationship among the species. A similar approach can also be made by using the other molecules. Like mRNAs, rRNA can be copied into DNA but this practice is rarely done.