Complete Information on Historical Background of Protein Databases

Historically, the protein databases were prepared first, then nucleotide databases. In 1959, V.M. Ingram first made attempt to compare sickle cell haemoglobin and normal haemoglobin, and demonstrated their homology. In due course of time the other proteins associated with similar biological function were also compared.

This resulted in more protein sequencing and accumulation of vast information. Hence, it is realised to have databases so that using computation software the proteins can be quickly compared.

In 1962, using sequence variability, Zuckerkandl and Pauling proposed a new strategy to study evolutionary relations between the organisms which is called ‘molecular evolution’. This theory was based on the facts that similarity exists among the functionally related (homologous) protein sequences.

Margaret O. Dayhoff found that during evolution protein sequences undergo changes according to certain patterns such as: (i) preferential alteration (replacement) in amino acids with amino acids of similar physico-chemical characteristics (but not randomly), (ii) no replacement of some amino acids (e.g. tryptophan) by any other amino acids, and (iii) development of a point accepted mutation (PAM) on the basis of several homologous sequences.

ADVERTISEMENTS:

Further work on sequence comparison on the basis of quantitative strategy was carried out. In 1965, Dayhoff and co-workers collected all the protein sequences known at that time and catalogued them as the Atlas of Protein Sequence and Structure which was first published by the National Biomedical Research Foundation (Silver Sring MD).

Later on collection of such macromolecular sequences was published under the above title from 1965 to 1978. The above printed book laid the foundation for the resources that the entire biotechnology community now depends for day-to-day work in computational biology.

The development of computer methods pioneered by Dayhoff and her research group is applicable: (i) in comparing protein sequences, (ii) detecting distantly related sequences and duplication within sequences, and (iii) deducing the evolutionary histories from alignment of protein sequences.

In 1980, the advent of the DNA sequence database led to the next phase in database sequence information through establishment of a data library by the European Molecular Biology Laboratory (EMBL).

ADVERTISEMENTS:

The purpose of establishing data library was to collect, organise and distribute data on nucleotide sequence and other information related to them. The European Bioinformatics Institute (EBI) is its successor that is situated at Hinxton, Cambridge, United Kingdom.

In 1984, the National Biomedical Research Foundation (NBRF) established the protein information resource (PIR). The NBRF helps the scientists in identifying and interpreting the information of protein sequences.

In 1988, the National Institute of Health (NIH), U.S.A. developed the National Centre for Biotechnology Information (NCBI) as a division of the National Library of Medicine (NLM) to develop information system in molecular biology. The DNA Databank of Japan (DDBJ) at Mishima joined the data collecting collaboration a few years later.

The NCBI built the GenBank, the National Institute of Health (NIH) genetic sequence database. GenBank is an annotated collection of all publically available nucleotide and protein sequences. The record within GenBank represents single contig (contiguous) selectioji of DNA or RNA with annotations.

ADVERTISEMENTS:

In 1988, the three partners (DDBJ, EMBL and GenBank) of the International Nucleotide Sequence Database Collaboration had a meeting and agreed to use a common format. All the three centres provide separate points of data submission, yet exchange this information daily making the same database available at large.

All the three centres are collecting, direct submitting and distributing them so that each centre has copies of all the sequences. Hence, they can act as a primary distribution centre for these sequences. Moreover, all the databases have collaboration with each other. They regularly exchange their data.

Now sequence data are accumulating day-by-day. Therefore, there is a need of powerful software so that sequences can be analysed. For the development of algorithms [any sequence of actions (e.g. computational steps) that perform a particular task] firm basis of mathematics is needed.

Now, mathematicians, biologists and computer scientists are taking much interest in bioinformatics. Moreover, biologists are curious to ask reservoir of all such information because they are widely interconnected through network.

ADVERTISEMENTS:

Thus bioinformatics is aimed at (0 the development of powerful software for data analysis, and (ii) benefit the researchers through disseminating the scientifically investigated knowledge, etc. The nucleotide and amino acid monomers are represented by limited alphabets.

The properties of biopolymers i.e. macromolecules (e.g. DNA, RNA proteins) are such that they can be transformed into sequences having digital symbols. Genetic data and other biological data are differentiated by these digital data. This resulted in the progress of bioinformatics.