The transformation of raw genomic data into the organised knowledge (which provide new and improved understanding of genome organisation and regulation) is called genome annotation. For computation biologist, genome annotation refers to the process of assigning ‘features’ or ‘label’ to raw DNA sequences.
It is done by integrating information from the sequence with computation tools, auxiliary data and biological knowledge. Gene prediction requires a combination of algorithms with different types of biological databases.
In early 1980s, in silico gene prediction has evolved from simple methods based on coding region statistics to sophisticated methodologies that can incorporate biological constrains into computational algorithms.
In silico gene prediction developed due to Human Genome Project. In silico gene prediction refers to the computation tools and algorithms which are useful in this step of genome annotation. Moreover, gene prediction is still important and widely used of all genome annotations.
Using known genes as training data the various algorithms carryout gene prediction. Most of the information’s are gathered from the genes which have been experimentally identified. You know that genes are present in genome but you cannot exactly count their number.
It is unclear how to count them? However, you can predict the number of genes that organisms possess. On the basis of counting of predicted genes you can give the final result.
It shows that human genome consists of less number of genes (-30,000) in spite of having largest genome size of -3xl09 bp, whereas the worm C. elegans consists of 18,000 genes in 1×106 bp long genome.
The functional genes present in human are <5% of total genome, while in C. elegans 27% geneome is functional. Scientists are of the opinion that the number of genes in human should be around -40,000 to 50,000.
In case of microbial genome 40-50% of genes may code for proteins of unknown function. 20- 30% genes may encode unknown proteins that are unique to the species.
1. Gene Prediction Algorithms
There are several algorithms for gene prediction as given below:
(a) Homology-based Gene Prediction:
It is traditionally the first and most commonly used tool to discover new genes. Homology- based gene prediction falls into two categories as below:
(i) Gene Prediction through Detection of Homology to know the Proteins:
This method uses sequence alignment of the translated DNA sequence (using 6 possible reading frames) with databases of known proteins.
(ii) Gene Prediction through Comparison with Expressed Sequence Tags (EST) Database:
The EST has been described earlier. With the appropriate use of sequence alignment parameters about 90% of the genes annotated on human genomic DNA are detected by ESTs.
(b) Ab Initio Gene Prediction:
It includes the class of ‘statistical learning’ algorithms which are used for in silico gene recognition. There are several strategies of ab initio gene prediction based on oligonucleotide usage, marker models, statistical pattern recognition and classification, neural networks.
(c) Systenic Gene Prediction:
Systenic gene prediction is gene recognition by using cross-species sequence comparisons to identify and align relevant regions. The presence of exonic features at corresponding positions is searched out in both species simultaneously. The reason behind systenic gene prediction is simple.
During evolution the exons (i.e. functional regions of DNA sequence) tend to more highly conserved than non-functional regions. Hence local conservation identified through comparisons of genomes of related species indicates biological function. Fig. 4.3 shows the genes in human chromosome that are systenic to mouse chromosome.
2. Accuracy and Validity of Gene Prediction Algorithms :
In accuracies of in silico gene prediction algorithms will travel down the line. These results in errors at the transcriptional level (the proteome level) and could ultimately affect or at least hinder our understanding of biology of species.