Help

What is RegTransBase?

RegTransBase is a database of regulatory sequences and regulatory interactions on the transcriptional and posttranscriptional levels in prokaryotic genomes. RegTransBase contains experimental data and predicted sites published in scientific journals. We also plan to include our own unpublished predictions in the nearest future.

Each article in RegTransBase was annotated independently. Each annotation includes a list of sequence and regulatory elements as well as a number of experiments (with a short description).

How to use RegTransBase PWMs to search other genomes

  1. Open a list of the binding site alignments (http://regtransbase.lbl.gov/cgi-bin/regtransbase?page=alignment_browse).
  2. Find a regulator of interest (for example, ABC0302). Open the page with the ABC0302 binding sites alignment (http://regtransbase.lbl.gov/cgi-bin/regtransbase?page=show_alignment&matrix_id=95).
  3. Download an alignment in FASTA format (First option in Download section at the bottom of the page).
  4. Go to the RegPredict website (http://regpredict.lbl.gov).
  5. Start RegPredict (click Start Application)
  6. Click "Select genomes".
  7. Find recommended taxonomical group (Bacillales - see the Recommended options section on ABC0302 page in RegTransBase) and add all genomes from that group (or as many genomes as possible).
  8. Click "Run Profile".
  9. Select the "Sequences" tab and paste your alignment of binding sites in the FASTA format.
  10. Click "Generate profile".
  11. Set search parameters "Position from" and "Position to" (see Recommended options section on ABC0302 page in RegTransBase).
  12. Click "Run".

How to use RegTransBase Putative Regulons to find binding sites

  1. Find genome of interest on the Putative Regulons page.
  2. Find regulon of interest based on the regulator name.
  3. Get a set of upstream sequences by clicking the "Download" link in the "Upstream sequences column of regulons table.
  4. Start RegPredict, select genomes of interest.
  5. Open "Discover Profiles", paste upstream sequences (at least three sequences).
  6. Select profile parameters (palindrome recommended), start search.
  7. Select profile with highest informational content and run search for sites in selected genomes.

Sequence elements and their hierarchy

There are five main classes of sequence elements in RegTransBase: site, gene, transcript, operon and locus.

Site. Any sequence fragment can be defined as a site, independently of its real function.
Gene. For a protein-coding gene it is its CDS. For RNAs (rRNA, tRNA etc.) "gene" is a region from the promoter to the terminator (where they are known)
Transcript. It is a single transcription unit starting at a promoter and ending at a terminator.
Operon. This element includes one or more overlapping transcriptional units.
Locus. Locus is a sequence region that can include elements of any class mentioned above, located in a continous fragment.

Sequence elements can be linked positionally (like genes that are inside of the operon) or logically (like sites that regulate expression of one gene). To depict such links, we set "parent-to-child" relations where "child" is a "subelement". The types of elements form a hierarchy locus->operon->transcript->gene-> site, so that a lower-level element can be a subelement of any higher-level element. RegTransBase, see table below for summary.

 ELEMENTS
SiteGeneTranscriptOperonLocus
S
U
B
E
L
E
M
E
N
T
S
Site+++++
Gene-++++
Transcript--+++
Operon---++
Locus----+

There are three additional classes of sequence elements but they are rarely used: regulon, RNA secondary structure and helix.

Regulon. Regulon is a set of all sequence elements regulated by one transcriptional regulator. An element of any class can by its subelement.
RNA secondary structure. As follows from its name, this element describes the secondary structures of terminators, regulatory regions of mRNA etc. It also can be used for any other RNA. This element can be a subelement of any class mentioned above.
Helix. RNA double helix is a basic unit of the RNA secondary structure. This element corresponds to a single continous stem in a RNA structure, so it can be a subelement of RNA secondary structure.

Regulatory elements in RegTransBase

We use regulatory elements to describe regulatory interactions controlling expression of bacterial genes and operons. There are two classes of regulatory elements in RegTransBase: regulators and effectors.

Regulator. Regulator is a protein or RNA molecule directly binding the regulated gene.

Effector. Effector is a molecule (or a physical effect, like heat shock) affecting gene expression that is not a regulator. It can be a small molecule (like a metal ion) or a protein (an indirect regulator, like sensor kinase) or even a macromolecular complex (RNA polymerase, if it stabilizes binding of a sigma factor with a promoter region).

What is "experiment" in RegTransBase?

Usually, an experiment includes experimental results describing regulation of a gene or an operon studied with one technique. However, some exceptions exists such as site predictions or array experiments where multiple genes are studied in a single experiment.
Annotation of each experiment includes formal description of the aim of the experiment, the list of experimental techniques used in the experiment and the list of regulatory elements studied. A short textual description is also provided.

What can I find in RegTransBase?

We collect the following types of experimental data:

  • Experiments investigating the activation or repression of a gene's (or operon's) transcription by an identified direct regulator.
  • Regulation of the gene's (or operon's) expression on the posttranscriptional level.
  • Promoter or terminator mapping.
  • Characterization of an operons' structure (cotranscription, complementation etc.).
  • Experimental evidence for the transcriptional regulatory function of a protein (or RNA) directly binding to DNA (RNA).
  • Mapping of a binding site of a regulatory protein.
  • Characterization of a regulatory mutation if the regulated gene was identified.
  • Prediction of binding sites of a regulatory protein (including alternative sigma factors).
  • Experiments investigating the RNA secondary structure of terminators and mRNA regulatory regions.

 

What I will not find in RegTransBase?

We are not interested in the following types of experimental data:

  • Experiments investigating regulation of the gene (or operon) transcription where the regulator was not identified.
  • Regulation of the protein function on the posttranslational level.
  • Regulation of the gene (or operon) expression by a regulator not binding DNA (RNA) directly (e.g. a sensor kinase).
  • Study of the protein function (except direct transcriptional regulators).
  • Experiments on cloning and sequencing.
  • Study of regulatory mutations if the regulated gene was not identified.
  • Mapping of translation starts.
  • Mutations affecting well-known regulatory proteins.
  • Experiments on regulation of unknown gene, even if the regulator was identified (for instance, regulation of a biochemical fuction).
  • Prediction of promoters (except promoters of alternative sigma factors) or terminating stem-loops by sequence analysis.

Technical Questions

How is the data obtained for RegTransBase?

Data in RegTransBase is obtained from a variety of methods. These methods are described here.

Manual Curation

Articles are obtained and then examined by a "Curator". This curator will read an article and determine the experiments, techniques, results and different elements involved in the experiment. All of this information is then entered using an application called the "Curators Interface" and then deposited in the dbRegulation schema.

Data Import

Genome sequence and annotations are taken directly from GenBank genome files. Little attempt is made to alter the annotations from the GenBank files unless a manual inspection is done and a particular gene was determined to be missing from the annotation. This data is imported into a BioSQL schema.

How is data stored in RegTransBase?

There are three separate storage schemas for RegTransBase. They are individualy described below.

dbRegulation

This is the original schema in which the article annotators deposited their information. It includes separate tables for each regulatory element type as well as tables for experiments, articles, genomes, and the relationships between all of these elements. A document explaining the dbRegulation schema can be found here and another document with a more detailed description of the fields can be found here.

Elements

Elements include genes, sites, regulators, transcripts, operons, loci, regulons, and effectors. For each article annotated, a new element is created, such that a particular gene could be described multiple times if it was mentioned in multiple articles. Each of these elements carry a unique id (guid) and a type, as well as other descriptors depending on the element type.

Genes and sites will carry a signature sequence which the annotator entered into the Curator Interface in order to uniquely identify a particular location on a genome. For genes, this could be as few as 10 amino acid or nucleotide residues to as many as 255 residues. For sites, if the actual site sequence is too small in that it would not be uniquely located, a sequence longer then the initial site (but still containing the site) will be annotated. Some site elements will instead include positional information based on the positions of other site or gene elements in order to locate them on the genome (eg. -230bp from gene tauA).

Each element may also describe a parent/child relationship between itself and any other element.

Articles and Experiments

Each article has a single record that is linked to multiple experiment records. Each experiment will include a number of elements associated with it, as well as a short descriptor of the experiment result and technique used. The experiment result can be any of the following (multiple selections are allowed): Gene/operon activation, Gene/operon repression, Operon structure characterization, Promoter mapping, Regulatory site mapping, Terminator mapping, Regulatory site prediction, or Plasmid replication. Techniques can be chosen from a large list (multiple selections are allowed) and include qualifiers such as Southern blot, Array Analysis, Chemical cross-linking, DNA immunoprecipitation assay, as well as many other technique types.

BioSQL

We use a BioSQL database to hold GenBank genome sequences and annotations. BioSQL is a part of the OBDA standard and was developed as a common sequence database schema for the different language projects within the Open Bioinformatics Foundation. Full genome GenBank files are imported into the database. We then place our annotations on top of the initial import. All information different from the original genbank annotations are annotated in the seqfeature table and given a location in the location table and are stored with a source qualifier of RegTransBase. This database also contains the ncbi taxonomy database, which includes the relationships between the various species. Some additional information on BioSQL can be obtained here.

RegTransBase

The RegTransBase schema is the glue between the dbRegulation schema and the BioSQL database. It includes tables that link the following:

  • dbRegulation genome id to ncbi taxonomy id (genome_guid2ncbi_taxid).
  • dbRegulation elements (genes, sites, transcripts, operons, loci, regulators) to BioSQL seqfeatures (regulatory2genomeloc). In the case of genes, this is usually the already annotated gene from the GenBank genome file (unless it was not annotated, in which case a new gene was added manually). In the case of the rest of the elements, a new seqfeature is created in the BioSQL database and a link is created in the regulatoryguid2genomeloc table. Genes are mapped to locations on the genome using a signature sequence and then an appropriate gene is picked in the location of the maping (see here for more information on this process).
  • Blast Reports for each of the gene and site signature sequences (search_results_full, search_results).

How are gene regulatory elements assigned to genome locations in RegTransBase?

For each gene in the dbRegulation database, the following actions are performed.

  • The gene's signature sequence is obtained from the database. If there is no signature sequence, it is reported as an error and the next gene is obtained.
  • If the signature sequence is a nucleotide sequence, the sequence is blasted using the NCBI blastall program against a collection of sequenced bacterial genomes downloaded from the NCBI RefSeq database (options p=blastn, e=10, I=T, F=F). If the sequence was an amino acid sequence, different options were used (p=tblastn, F=F, e=1000, I=T).
  • In addition to blasting against the subset of bacterial genomes, the sequence is blasted against the full NCBI non-redundant nucleotide database (nt).
  • For each result obtained from the bacterial database search, the following are checked:
    • Is the species of the organism similar to the species described in the experiment? (To be similar, we check up the taxonomy tree until the 'species' entry and compare those for both the query and hit. This allows us, for example, to map all E. coli experiments to all E. coli strains.)
    • Are there no gaps in the match alignment?
    • Is the percent identity at least 97% over the full length of the query sequence?
  • If all of the above questions are true, the hit is marked as valid. We then attempt to find the candidate gene in the genome.
  • Each gene that overlaps the query sequence and is annotated as a 'gene' in the GenBank genome file is pulled out and set aside. From that set of genes, the overlap is calculated, and the gene with the highest overlap is marked as the correct gene. If there is a tie, the gene that is in the same orientation as the signature is used as the correct hit. If no 'gene' sequences are found, 'CDS' sequences are searched and put under the algorithm as noted for the 'gene' sequences. The marked gene's seqfeature_id, location_id, and bioentry_id (BioSQL unique ids) are then stored in the database.
  • If no candidate gene is found, this hit is marked for manual inspection.
  • If upon manual inspection no gene is found in the location of the hit, the inspector will investigate the situation. If it is an annotation error on the part of the GenBank file, a new gene sequence is added in place of the empty space and that gene is assigned to the experiment gene id.

How are site regulatory elements assigned to genome locations in RegTransBase?

For each site in the dbRegulation database, the following actions are performed.

  • The sites's signature sequence is obtained from the database. If there is no signature sequence, the actual site sequence is used. If there is no site sequence or signature sequence, it is checked to see if it contains relative mapping information, which describes the location of the site using distances from other reported elements. If this is also blank, it is reported as an error and the next site is obtained.
  • The signature sequence can contain ambiguities for stretches of unknown sequence. This is marked in the signature sequence with a number which represnts a string of N's of the numbers' size. If there are ambiguities in the middle of the sequence (ex. ACTGACTAG 260 ATCAGCTAGC), the sequence is searched against the bacterial genome database using WuBlast with a modified nucliotide substitution matrix which allows for no penatly for matches against an N in a sequence (p=WuBlast-blastn, E=10, gi).
  • If there is no ambiguities, the sequence is searched using NCBI blastall against the bacterial genome database (p=blastn, F=F, I=T, e=10).
  • In addition to blasting against the subset of bacterial genomes, the sequence is blasted against the full NCBI non-redundant nucleotide database (nt, same options as above).
  • For each result obtained from the bacterial database search, the following are checked:
    • Is the species of the organism similar to the species described in the experiment? (To be similar, we check up the taxonomy tree until the 'species' entry and compare those for both the query and hit. This allows us, for example, to map all E. coli experiments to all E. coli strains.)
    • Are there no gaps in the match alignment except ones that line up with an ambiguity?
    • Is the percent identity at least 97% over the full length of the query sequence?
  • If all of the above questions are true, the hit is marked as valid. If a signature sequence was used, we then attempt to find the site sequence within the signature sequence. We search both the forward and reverse orientation of the hit for the actual site sequence. If more then one is found, the first one that is in the same orientation of the signature sequence is marked as the correct one.
  • If the site contained relative mapping information, the site is mapped after all other sites and genes are mapped.

How are other (transcript,operon,locus) regulatory elements assigned to genome locations in RegTransBase?

Other elements are assigned locations based on their child elements. Following the heirarchy of sites and genes -> transcripts -> operons -> loci, each element is assigned its location based on the upper and lower bounds of its child elements. The following algorithm is used in determining correct locations of the regulatory elements.

  • A element is obtained. All children of that element are obtained. All locations in all genomes where any of the children are located are obtained (this could include multiple genomes if multiple strains were sequenced, or if multiple entries for this organism).
  • The locations are grouped based on the distance from each element to it's closest neighbor. If that neighbor is at least 10,000 base pairs away, it is brought into the group. If it is further then 10,000 base pairs away, it is placed in its own group.
  • For each of these groups, the number of child elements are counted. All genomes where all children are present are marked as a possible genome for this element. If all children are not present in one genome, while they are present in other similar genomes, the locations in the genome where they are not present is discarded.
  • If all children are not present in any of the genomes, the locations with the largest number of children elements are marked as the correct element. These are also logged and subject to manual examination.
  • In genomes where all child elements are present, the locations with the largest number of child elements are used as the correct location. If there is a tie, then all locations that are tied are used.

Information on how the precomputed alignments and position weight matricies were created.

Each record in the database comprises a TFBS training set (alignment) created by an expert curator. The curator first gathered information about a known transcription factor where a set of binding sites was known, summarized a description of this factor by scanning published articles, and recorded its genomic location. The curator then annotated binding sites and their sequence, downstream gene, location in a published genome, and any published experimental evidence.

In addition, curators supplied groups of organisms that they believe could be used when searching for homologous binding sites based on phylogenetic distance of organism and presence of a conserved transcription factor. Lastly, the curator recorded default scores and the expected distance a binding site would be from the start of a gene based on examination of the existing binding sites.

With each record, we provide the binding site location with reference to a published sequence, the sequence, the gene which is affected by the binding site, the evidence for the binding, any relevant articles pertaining to that site, and the transcription factor which binds the site. We also provide for download the sequence logo for the alignment, and profiles in many different formats as well as suggest recommended options in using the profiles (cut-off scores, distance from gene, taxonomy).