RegTransBase is a database of regulatory sequences and regulatory interactions on the transcriptional and posttranscriptional levels in prokaryotic genomes. RegTransBase contains experimental data and predicted sites published in scientific journals. We also plan to include our own unpublished predictions in the nearest future.
Each article in RegTransBase was annotated independently. Each annotation includes a list of sequence and regulatory elements as well as a number of experiments (with a short description).
There are five main classes of sequence elements in RegTransBase: site, gene, transcript, operon and locus.
Site. Any sequence fragment can be defined as a site, independently of its real function.
Gene. For a protein-coding gene it is its CDS. For RNAs (rRNA, tRNA etc.) "gene" is a region from the promoter to the terminator (where they are known)
Transcript. It is a single transcription unit starting at a promoter and ending at a terminator.
Operon. This element includes one or more overlapping transcriptional units.
Locus. Locus is a sequence region that can include elements of any class mentioned above, located in a continous fragment.
Sequence elements can be linked positionally (like genes that are inside of the operon) or logically (like sites that regulate expression of one gene). To depict such links, we set "parent-to-child" relations where "child" is a "subelement". The types of elements form a hierarchy locus->operon->transcript->gene-> site, so that a lower-level element can be a subelement of any higher-level element. RegTransBase, see table below for summary.
There are three additional classes of sequence elements but they are rarely used: regulon, RNA secondary structure and helix.
Regulon. Regulon is a set of all sequence elements regulated by one transcriptional regulator. An element of any class can by its subelement.
RNA secondary structure. As follows from its name, this element describes the secondary structures of terminators, regulatory regions of mRNA etc. It also can be used for any other RNA. This element can be a subelement of any class mentioned above.
Helix. RNA double helix is a basic unit of the RNA secondary structure. This element corresponds to a single continous stem in a RNA structure, so it can be a subelement of RNA secondary structure.
We use regulatory elements to describe regulatory interactions controlling expression of bacterial genes and operons. There are two classes of regulatory elements in RegTransBase: regulators and effectors.
Regulator. Regulator is a protein or RNA molecule directly binding the regulated gene.
Effector. Effector is a molecule (or a physical effect, like heat shock) affecting gene expression that is not a regulator. It can be a small molecule (like a metal ion) or a protein (an indirect regulator, like sensor kinase) or even a macromolecular complex (RNA polymerase, if it stabilizes binding of a sigma factor with a promoter region).
Usually, an experiment includes experimental results describing regulation of a gene or an operon studied with one technique. However, some exceptions exists such as site predictions or array experiments where multiple genes are studied in a single experiment.
Annotation of each experiment includes formal description of the aim of the experiment, the list of experimental techniques used in the experiment and the list of regulatory elements studied. A short textual description is also provided.
We collect the following types of experimental data:
We are not interested in the following types of experimental data:
Data in RegTransBase is obtained from a variety of methods. These methods are described here.
Articles are obtained and then examined by a "Curator". This curator will read an article and determine the experiments, techniques, results and different elements involved in the experiment. All of this information is then entered using an application called the "Curators Interface" and then deposited in the dbRegulation schema.
Genome sequence and annotations are taken directly from GenBank genome files. Little attempt is made to alter the annotations from the GenBank files unless a manual inspection is done and a particular gene was determined to be missing from the annotation. This data is imported into a BioSQL schema.
There are three separate storage schemas for RegTransBase. They are individualy described below.
This is the original schema in which the article annotators deposited their information. It includes separate tables for each regulatory element type as well as tables for experiments, articles, genomes, and the relationships between all of these elements. A document explaining the dbRegulation schema can be found here and another document with a more detailed description of the fields can be found here.
Elements include genes, sites, regulators, transcripts, operons, loci, regulons, and effectors. For each article annotated, a new element is created, such that a particular gene could be described multiple times if it was mentioned in multiple articles. Each of these elements carry a unique id (guid) and a type, as well as other descriptors depending on the element type.
Genes and sites will carry a signature sequence which the annotator entered into the Curator Interface in order to uniquely identify a particular location on a genome. For genes, this could be as few as 10 amino acid or nucleotide residues to as many as 255 residues. For sites, if the actual site sequence is too small in that it would not be uniquely located, a sequence longer then the initial site (but still containing the site) will be annotated. Some site elements will instead include positional information based on the positions of other site or gene elements in order to locate them on the genome (eg. -230bp from gene tauA).
Each element may also describe a parent/child relationship between itself and any other element.
Articles and Experiments
Each article has a single record that is linked to multiple experiment records. Each experiment will include a number of elements associated with it, as well as a short descriptor of the experiment result and technique used. The experiment result can be any of the following (multiple selections are allowed): Gene/operon activation, Gene/operon repression, Operon structure characterization, Promoter mapping, Regulatory site mapping, Terminator mapping, Regulatory site prediction, or Plasmid replication. Techniques can be chosen from a large list (multiple selections are allowed) and include qualifiers such as Southern blot, Array Analysis, Chemical cross-linking, DNA immunoprecipitation assay, as well as many other technique types.
We use a BioSQL database to hold GenBank genome sequences and annotations. BioSQL is a part of the OBDA standard and was developed as a common sequence database schema for the different language projects within the Open Bioinformatics Foundation. Full genome GenBank files are imported into the database. We then place our annotations on top of the initial import. All information different from the original genbank annotations are annotated in the seqfeature table and given a location in the location table and are stored with a source qualifier of RegTransBase. This database also contains the ncbi taxonomy database, which includes the relationships between the various species. Some additional information on BioSQL can be obtained here.
The RegTransBase schema is the glue between the dbRegulation schema and the BioSQL database. It includes tables that link the following:
- dbRegulation genome id to ncbi taxonomy id (genome_guid2ncbi_taxid).
- dbRegulation elements (genes, sites, transcripts, operons, loci, regulators) to BioSQL seqfeatures (regulatory2genomeloc). In the case of genes, this is usually the already annotated gene from the GenBank genome file (unless it was not annotated, in which case a new gene was added manually). In the case of the rest of the elements, a new seqfeature is created in the BioSQL database and a link is created in the regulatoryguid2genomeloc table. Genes are mapped to locations on the genome using a signature sequence and then an appropriate gene is picked in the location of the maping (see here for more information on this process).
- Blast Reports for each of the gene and site signature sequences (search_results_full, search_results).
For each gene in the dbRegulation database, the following actions are performed.
For each site in the dbRegulation database, the following actions are performed.
Other elements are assigned locations based on their child elements. Following the heirarchy of sites and genes -> transcripts -> operons -> loci, each element is assigned its location based on the upper and lower bounds of its child elements. The following algorithm is used in determining correct locations of the regulatory elements.
Each record in the database comprises a TFBS training set (alignment) created by an expert curator. The curator first gathered information about a known transcription factor where a set of binding sites was known, summarized a description of this factor by scanning published articles, and recorded its genomic location. The curator then annotated binding sites and their sequence, downstream gene, location in a published genome, and any published experimental evidence.
In addition, curators supplied groups of organisms that they believe could be used when searching for homologous binding sites based on phylogenetic distance of organism and presence of a conserved transcription factor. Lastly, the curator recorded default scores and the expected distance a binding site would be from the start of a gene based on examination of the existing binding sites.
With each record, we provide the binding site location with reference to a published sequence, the sequence, the gene which is affected by the binding site, the evidence for the binding, any relevant articles pertaining to that site, and the transcription factor which binds the site. We also provide for download the sequence logo for the alignment, and profiles in many different formats as well as suggest recommended options in using the profiles (cut-off scores, distance from gene, taxonomy).