There are several well developed data repositories that have facilitated the dissemination of genome and protein resources of humans and other organisms. Some of the major biological databases are given in. The most comprehensive resources are the genome database (GDB), NCBI and Mouse Genome Database (MGD).
1. The GDB :
It is the official central repository for genome mapping data created by Human Genome Project. Its central node is located at the Hospital for Sick Children. The GDB holds a vast quantity of data submitted by hundreds of investigators. The GDB has many useful genome resource Web-links on its Resource Page.
2. The MGD :
It is the primary public mouse genomic catalogue resource. The MGD includes information on mouse genetic markers and nomenclature, molecular segments, phenotypes, comparative mapping data, graphical display of linkage, cytogenetic and physical maps.
3. The National Centre of Biotechnology Information (NCBI) :
On November 4, 1988, for the development of information systems in molecular biology, the NCBI was established at the National Institute of Health (USA). The NCBI is the foremost repository of publically available genomic and proteomic data.
After its establishment the services at NCBI have fully expanded. The NCBI makes available the different kinds of biological data, computational resources for analysis of GenBank data and data retrieval system.
The NCBI has developed many useful resources and tools. It may be grouped into the following types:
1. Database retrieval tools
2. BLAST family – for search of DNA sequences
4. Gene level sequences
5. Chromosomal sequences
6. Genome analysis
7. Analysis of gene expression patterns
8. Molecular structure
9. LocusLink: Used in genome catalogue information about gene and gene-based markers.
These tools have their own web sites which many be used free of cost. You will learn the practical aspects of some of these tools in your practical.
Out of the above resources and tools, three sets of resources are discussed in this regard. Using these resources most of the cases of bioinformatics activity can be carried out. While doing advanced studies, the other resources may be used.
4. Data Retrieval Tools
GenBank contains 7 millions sequence record covering 9 million nucleotide bases. Unless the databases are easily searched and entries retrieved in a usable and meaningful format, the biological databases serve a little purpose.
Moreover, efforts made on sequencing will not be meaningful if biological community as a whole cannot make use of the information hidden within millions of bases and amino acids. There are several database retrieval tools such as ENTREZ, LOCUSLINK, TAXONOMY BROWSER, etc.
The integrated information database retrieval system of NCBI is called Entrez. It is most utilised of all biological database systems. Using Entrez system you can access literature, sequences (both protein and nucleotides) and structure (3 D). To be very clear, Entrez is not a database, but it is the interface through which all of its component databases can be accessed and traversed.
The Entrez information includes PubMed records, nucleotide and protein sequence data, 3D structure, information and mapping. The hard link relationships between databases. All the information can be accessed by issuing only one query.
For complete review of the features and complexities you may refer to a tutorial on the Entrez system at http://www. ncbi. nlm.nih.gov: 80 / entrez / query / static / help / helpdoc.html.
(b) Taxonomy Browsers:
Diversity of organisms is such that millions of species are known. It is hoped that millions of .organisms are also unknown. After a species is known, its various features are studied and information is restored in database. So far information of over 79,000 organisms is restored in database.
(c) Locus Link:
Locus Link is an NCBI project to link information applicable to specific genetic loci from several disparate databases. Locus Link provides a single query interface to various types of information regarding to a given genetic locus such as phenotypes, map locations and homologies to other genes.
Currently Locus Link search space includes information from human, mice, rats, fruitfly and zebrafish. It carries information on mouse homologue of a given human gene, you cannot get.
Beginning with Locus Link query, simply by typing the name of the gene into the query box which appears on the top of Locus Link home page, you can select the gene of interest from an alphabetical list.
(d) Sequence Retrieval System (SRS):
The SRS has been created by Swiss Institute of Bioinformatics and the European Bioinformatics Institute, who have also created the Swiss-PROT database. SRS allows retrieval from an extensive catalogue of more than 75 public biological databases.
The link button in SRS will allow you to get all the entries in one databank which is linked to an entry (or entries) in another databank. Hyperlinks made links between the entries.
5. Similarity-based Database Searching :
(a) Basic Local Alignment Search Tool (BLAST):
Due to genome searching projects on a large scale, the flood of DNA sequence data coming into public databases is staggering. Scientists are relying on deducing the function of putative genes through similarity to well characterised proteins.
There are several tools to analyse sequence information among the BLAST family of similarity search programme. Sequence similarity searches use alignments to determine a ‘match’. The basic operation in database searching is to sequentially align a query sequence to each subject sequence in the database.
Most users prefer BLAST or PASTA which relies on heuristic strategies to speed up alignment searches. The theory of BLAST systems is rather complex and out of scope of this book.
FASTA was the first widely-used programme for database similarity search. FASTS performs optimised search for local alignment using a substitution matrix. This programme uses the observed pattern of word hits to identify potential matches before attempting optimised search.
FASTA format contains a defimilion line and sequence characters. It may be used as input to many analysis programmes. FASTA format is used in a variety of molecular biology software suites.
In general BLAST tends to be faster and are more sensitive in detecting more alignments, but FASTA returns fewer false hits.
6. Resources for Gene Level Sequences
There are several tools among the resources for gene level sequences e.g. UniGene, HomoloGene, RefSeq, etc.
(a) UniGene Database:
The ESTs have been described in previous section. Many redundant ESTs are generated during the course of its production. Because several cDNA clones represent the same gene (Fig. 5.4). Therefore, UniGene (one gene) database was developed at NCBI to control redundancy in EST data.
The UniGene clusters ESTs and other mRNA sequences along with coding sequences (CDSs) annotated on genomic DNA into subsets of related sequences. The clusters are specific to organisms. At present clusters are available for human, mouse, rat, zebrafish and cattle. The scheme for clustering ESTs is shown in.
The scheme for clustering ESTs is shown in Fig. 5.4 and steps are given below:
1. First search the sequences for contaminants e.g. ribosomal, mitochondrial, repetitive and vector sequences.
2. Enter the sequences (those contain about 100 bases) into UniGene. The mRNA and genomic DNA are clustered into gene links.
3. A second sequence comparison links ESTs to each other and to the gene links. All clusters are anchored and contain either a sequence with a polyadenylation site (poly A) or two ESTs labelled as coming from 3′ end of a clone.
4. Clone based edges are added by linking the 5′ and 3′ ESTs that derive from the same clone.
5. Finally unanchored ESTs and gene clusters of size 1 are compared with UniGene clusters at lower stringency. The UniGene built is updated weekly. Then the sequence cluster may change.
(b) HomoloGene Database:
A new UniGene resource has been created which is called ‘HomoloGene’ (homologous gene). HomoloGene database includes curated and calculated orthologs and homologs for genes from human, mouse, rat, zebrafish and cow.
This database is also available in Locus Link. Using HomoloGene, homologous relations can easily be inferred. Homologs are identified as the best match between a HomoloGene cluster in one organism and a cluster in a second organism.
When two sequences in different organisms are best matched to another the HomoloGene clusters corresponding to the pair of sequences are considered putative orthologs.
(c) Ref Seq Database:
It is a database of NCBI which provides a curated nomenclature set of reference sequence standard for naturally occurring biological molecules ranging from chromosomes to transcripts to proteins. Ref Seq (reference sequence) project provides a stable reference sequence for all the molecules corresponding to central dogma of biology i.e. flow of information from DNA ->RNA ->Protein.
Do you know, a curator is appointed in Botanical Gardens, Archives and Museum to look after the reservoirs available at these centres?
Similarly, for curation process, a curator or annotator is appointed for bioinformatics work. The curators have extensive training in biology. They are very aware of the databases. They ensure that no sequence data is lost during the process of submission.
Here, the curator reviews and checks the newly submitted data and ensures that: (i) the biological features are described adequately, (ii) the conceptual translation of coding regions follow the universal translation rules, and (iii) all mandatory information has been given.
The NCBI has provided many database resources. Discussion of all the databases is not desirable and meaningful at this stage of the book. However, some of the important databases used in bioinformatics works are given in.