Molecular biology has undergone an initial slow phase of growth and progress following the elucidation of the molecular structure of DNA by J.D.Watson, F.H.C.Crick and M.H.F.Wilkins in 1953.

Following this landmark discovery, DNA became the main focus of intense research in biological sciences in the subsequent three decades. Consequently, the intricate mechanism of functioning of the DNA was understood. The fundamental dogma that DNA contains all necessary information in coded form for the synthesis of proteins (polypeptides) of a living organism was established.

In the first step, it transfers the coded information to its recruited agents, the messenger RNAs and in the second, the messenger RNA decodes or translates the code into the language of a polypeptide (a specific sequence of amino acids) on a platform, the ribosomes.

The first step is known as transcription, while the second, translation. Protein has been referred to as the central dogma in molecular biology. Proteins finally carry out the necessary functions of the body. This dogma operates as though it is regulated by a sense of time (when) and space (where).


This function is so faithfully executed that biologists designated the DNA as the blue print of life. The part of the DNA, responsible for a polypeptide was referred to as a gene. There will be at least as many genes as the number of polypeptides.

Each gene has a specific code of message of the four alphabets (A, T, G and C), which we know as nucleotides. Next, biologists set out to ascertain the number and the order (sequence) of the nucleotides in each code or gene. In molecular biology, this is known as DNA sequencing.

Hand in hand with DNA sequencing, the proteins were also sequenced and the amino acid sequences of a wide variety of proteins known.

The amino acid sequence of a protein was correlated with the nucleotide sequence of its gene. Each protein has a specific amino acid sequence. The amino acid sequence determines the three dimensional (3-D) structure and the 3-D structure determines the function. These things may be summarized as under:


1. A DNA sequence determines the amino acid sequence of a protein.

2. The amino acid sequence determines its 3-D structure.

3. The 3-D structure determines its biological function.

The discovery of restriction end nucleases, plasmids and cloning of genes in


The last quarter of the last century set the stage for an exponential growth of molecular biology and consequently gave birth to recombinant DNA technology.

The benefits of this technology were soon applied to the human society and consequently, emerged an integrated branch of science, Biotechnology. Not just gene functions, but gene sequences were also important in looking for all the secrets of the blue print of life. Then the idea of sequencing the entire human genome was hatched. Finally, the human genome sequencing project took shape.

Prior to the execution of this project, the genomes of some model organisms like Escherichia coli, Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (nematode), Drosophila melanogaster (fruit fly), Arabidopsis thaliana (weed) and Mus musculus (mouse) were taken up separately to understand the sequencing strategy of the entire genome. The nucleotide sequence of the entire genome of these organisms was determined.

Following the success, three million base pairs of the entire human genome were sequenced with utmost precision by two rival groups of molecular biologists. As time ticked, new genomes were also sequenced and the sequences were added to the large repertoire of existing base pair sequences.


The same was the case with proteins. Thousands of known proteins were sequenced and their amino acid sequences presented. The depicts an exponential growth of the sequence data of nucleic acids and proteins over a period 15 years. The amino acid

Growth of GenBank and Protein Data Bank data structure over a period of 15 years, (a) GenBank nucleic acid data structure; and (b) Protein Data Bank data structure

Also predicted the 3-D structures of these proteins. Corresponding to the number and sequence of amino acid sequences, the 3-D structures were also projected.

There was a tremendous difficulty in storing and managing this huge body of data. Application of information technology was the obvious answer.


A computer could store, manage and analyze a very large volume of data in a short span of time, which a human could not. Moreover, these data could be shared by many people by a networking of several computers through the INTERNET. Thus there was a marriage between biological science and information technology.

This gave birth to a new integrated discipline known as “BIOINFOMATICS”. This term was coined in the mid-1980s to encompass the application of information technology for storing, managing and analyzing biological data. It is synonymous with the storage, management and analysis of DNA, RNA and protein sequence data.

In essence, it is all about the sequence analysis of DNA, RNA and proteins. Computational tools are required for analysis of the data. These tools were available since 1960s, but were of little interest in biological application until advances in the sequencing technology led to an increase in the number of sequences and data bases.