CisView: Browser of regulatory regions

Help on CisView          

1. General Information

The goal of CisView is to provide visualization and query of cis-regulatory regions in the genome. Transcription factor binding sites were identified in the entire mouse genome using 134 matrices and 219 patterns from various sources. We identified 26,611 promoters, 690,044 potential distal transcription regulators, and 22,419 modules associated with 3'UTR. Potential distal cis-regulatory modules were defined as clusters of conserved transcription factor binding sites. The CisView browser provides detailed information on regulatory sequences at various scales, and supports queries on gene features, and location of transcription factor binding sites.

Terminology

Cis-regulatory element
Binding site (BS) on the DNA for transcription factor (TF) protein which is a trans-element
Cis-regulatory module (CRM)
Cluster of TFBSs that controls complex genetic programs (Aerts et al. 2003)
Distal CRM (DCRM)
CRM located upstream or downstream of the promoter (e.g., enhancer, silencer, insulator).

2. Feedback

Use
Feedback webpage to report problems with the website, suggest additional TFBS and/or CRMs. We will consider your suggestions during the next revision/assembly of the web site.

3. Legend for the browser

1. Seven scales for gene location in the chromosome: whole chromosome, 3-Mb, 300-Kb, 60-Kb, 4-Kb, 500-bp, and 80-bp windows. You can navigate in all windows by clicking at a gene or location. Navigation in the whole chromosome, 3-Mb, and 300-Kb windows will lead to a new 60-bp region that is centered at a TSS of another gene or alternative TSS ofv the same gene. Navigation in all other windows will only change the zoom area (yellow band). At 500 bp and 80 bp scales, click on the sequence to change the position.
2. Magenta triangle shows the location of the gene/position.
3. Upper line in each pair of lines represent the positive strand, the lower line represents the negative strand. Magenta boxes in the 3-Mb window are individual genes. In other window scales, genes are shown together with their exon-intron structure. Coding region of genes is shown by blue boxes, non-coding by magenta. Projected transcription start sites (TSS) are shown by small circles colored red, light green, or light blue depending on their quality (high, medium and low, respectively). In addition, TSS identified using
FirstEF software are shown as small black vertical lines.
4. Transcription start positions from DBTSS database.
5. Cis regulatory modules (CRMs). Promoters are colored dark yellow, distal CRMs are colored red, and 3'UTR CRMs are colored light blue. In high-resolution windows, the name of the CRM is shown below. Click on the CRM to get the sequence and a list of transcription factor binding sites.
6. Selected transcription factor binding sites (TFBS). In 60- and 4-Kb windows, selected TFBS are shifted up if they match to the positive DNA strand, and down if they match to the negative DNA strand. In the 500 bp window, the strand of selected TFBS is indicated by an arrow. In the 80 bp window selected TFBS have a color border, and the strand is shown as (+) or (-). To select a particular TFBS or a class of TFBS, use the form at the bottom of the screen. This form also gives an option to hide a group of TFBS or change conservation and mismatch thresholds.
7. Yellow area indicates the region that is zoomed-in below.
8. Names of assembled transcripts. The first part (e.g., U000006) indicates a U-cluster (gene or transcribed non-gene), and the second part (after dash) is the transcript number. To see details of assembly go to the gene index by clicking on the U-cluster name in the header of the page.
9. Conservation scores compared with other mammals (from UCSC).
10. Abundance of specific sequence patterns/motifs: CpG pairs (CG), G-stretches, AT/TA, and A-stretches.
11. Transcription factor binding sites (TFBS). Selected TFBS are shown in color, non-selected TFBS are black. Color bars are shifted up if TFBS matched to the positive DNA strand, and down if they matched to the negative DNA strand. To select another TFBS (to be shown in a different color) or change other viewing options use the form at the bottom of the screen.
12. DNA sequence in the 500 bp and 80 bp windows is color-coded: A-magenta, T-blue, C-yellow, G-green. CpG pairs are shown by vertical black lines.
13. Transcription factor binding sites (TFBS). Selected TFBS are shown in color. Their strand/orientation is shown by arrow. Non-selected TFBS are black. To select another TFBS (to be shown in a different color) or change other viewing options use the form at the bottom of the screen.
14. Transcription factor binding sites (TFBS) are shown as boxes with a color-coded position-weight matrix (or pattern). Click on it to get information on this particular type of TFBS. Below the box there is a name of the TFBS, orientation in parenthesis, and mismatch score (e.g. D=0.056). If no mismatches, then the mismatch score (D=0) is not shown. Selected TFBS have a thick color border (e.g., TF_OCT has a magenta border in the picture).
15. Transcription start site (TSS) is shown by a small circle colored red, light green, or light blue depending on the quality of TSS (high, medium and low, respectively). In addition, TSS identified using FirstEF software are shown as small black vertical lines.
16. Repeats identified using Repeat Masker program (results downloaded from UCSC). The color of repeats is gray for SINE, dark yellow for LINE, green-blue for LTR, light green-blue for DNA, and light-red for simple repeats.

4. Methods: TSS and promoters

Analysis of regulatory regions is based on the mouse genome sequence assembled in March 2005 (mm6). Transcription start sites (TSS) were compiled from several databases in attempt to cover main and alternative transcripts of protein coding genes in the mouse henome. TSS from DBTSS database ver. 5.2 (N = 18,503) were considered high-quality because they were identified using a large set of full-length cDNA. Because the DBTSS database was applied to an older version of mouse genome (mm5) we used BLAT to remap TSS coordinates to genome mm6. Medium-quality TSS were identified as matches between independent data sources which were >500 bp away from high-quality TSS. The first subset (N = 4712) of medium-quality TSS was taken from protein-coding transcripts (ORF >= 100 aa, or known function) in the NIA Mouse Gene Index, ver. mm6 if they matched with FirstEF software predictions within 300 bp. We used 300 bp distance threshold as a matching criterion because it corresponds to the false discovery rate (FDR) of ca. 1% according to the following estimation. If 52,503 TSS predicted by FirstEF were randomly distributed in the entire genome (3 Gb), then 387 of them in average would appear within 300 bp of 36,829 TSS identified by aligning mRNA and EST sequences to the genome. Thus, the FDR = 387/36,829 = 1%. The FirstEF software uses discriminant functions to identify potential donor splice sites and TSS based on frequency distributions of short motifs in the DNA sequence. The second subset (N = 4219) of medium-quality TSS was taken from protein-coding transcripts in the NIA Mouse Gene Index if they started within a CpG island but did not match with FirstEF predictions. CpG islands were detected as regions with a minimum of 8 CpG pairs within 250 bp. This threshold was selected based on the frequency distribution of CpG pairs in promoters

.

The third subset (N = 27) of medium-quality TSS was taken from RefSeq sequences if they matched with FirstEF software predictions. Finally, low-quality TSS (N = 12960) were taken from the NIA Mouse Gene Index if they did not match to other data sources

. Recent experimental data with CAGE tags showed that many promoters had a cluster of transcription starts rather than a single TSS20. However in the current version of CisView we use only one TSS per promoter as identified by DBTSS, NIA Mouse Gene Index, or FirstEF unless TSS have opposite orientation or separated by >500 bp. Considering all possible transcription starts within a promoter is not feasible currently because this will make analysis too long for an interactive web-based software. Most functions of CisView (e.g., finding binding sites within 1 kb upsteam of TSS) are not critically affected with uncertainty in TSS within 100 bp

. Tentative promoter boundaries for high- and medium-quality TSS were set to the bounds of a CpG island if it was present at TSS, otherwise they were assumed to span from -200 to +100 bp. The promoter boundaries were then adjusted by excluding transposon-related repeats and CDS, followed by merging with potential CRMs (see below). Promoters for low-quality TSS were considered only if they coincided with a potential CRM

.

5. Methods: TFBS

Transcription factor binding sites (TFBS) were identified in the entire mouse genome using either patterns or position-weight matrices that were compiled from various sources including the TRANSFAC database, public version 7.0 (Matys et al. 2003). Because TRANSFAC database has many redundant entries we combined 291 vertebrate matrices into 115 groups. Also we trimmed regions with low-information or with inconsistencies between various versions of the same TFBS as it is documented in the web site. Then one matrix was built for each group. The second major source of TFBS was the set of 174 patterns over-represented in conserved regions of mammalian promoters (Xie et al. 2005). Out of these, 69 patterns corresponded to known TFBS. References to additional TFBSs are available from the web site. In total, 134 matrices and 219 patterns were used for identifications of TFBS.

Search for patterns allowed no mismatches, although patterns themselves were degenerate (i.e., contained symbols R, Y, N, etc.). Patterns with >=18 bit information content (N = 19) had too few hits, thus we treated them as matrices and allowed mismatches. Matrix-based search was implemented in 2 steps: (1) search for the exact match of the core pattern, and (2) estimate the similarity measure using the matrix. The core pattern consisted of 3 or 4 elements characterized by 2 most dramatic changes in nucleotide frequency between positions measured by

,

where cj is the degree of change from position j to position j+1, pij is of nucleotide i at position j, and Ij is the information measure at position j. For example, the core pattern for the SP1 binding site was GCG. Core patterns were allowed to be degenerate and included nucleotides that occured at frequencies greater than 50% of the maximum frequency at that position. Some core patterns had 2 pairs of nucleotides separated by some distance. For example the TF_CDP binding site had the core ATNNAT. Exact match of the core ensured the proper position of the matrix and reduced the number of false positives. The similarity score is equal to the sum of character heights in a sequence logo divided by the sum of maximum heights at all positions:

,

where n(j) is the nucleotide in the sequence at position j. It is equivalent to the score used in the MatInspector (Quandt 1995). The minimum allowed similarity threshold was 0.8 (i.e., 20% mismatch), however, for abundant BSs we used higher similarity thresholds adjusted so that the frequency of matches in CpG rich and CpG poor semi-random sequences did not exceed 1 per 500 bp and 2000 bp, respectively. We used different thresholds for CpG rich and CpG poor sequences because CpG rich sequences are usually more rich in functional TFBSs. Semi-random sequences were generated using 3rd order Markov models with transition probabilities estimated from CpG rich and CpG poor mouse promoters.

6. Methods: cis-regulatory modules

Potential cis-regulatory module (below we refer to it simply as CRM) was defined as a genomic region with at least 4 conserved TFBSs within each 200 bp of its length, and not overlapping with transposable repeats and/or CDS. Evolutionary conservation is a reliable indicator of functionality of TFBSs (Zhang and Gerstein 2003). If a CRM overlapped with a promoter then it was merged with the promoter; if it overlapped with the 3'UTR of genes, we considered it a 3'UTR-associated CRM; and all other CRMs were considered as DCRMs. 3'UTR-associated CRMs most likely regulate post-transcriptional processes (mRNA stability, translation, etc.) (Xie et al. 2005), thus we distinguished them from DCRMs which are mostly involved in the regulation of transcription. Genome conservation scores and repeat coordinates were downloaded from the UCSC database (Siepel et al. 2005). Conservation score 0.5 was used as a threshold for considering a TFBS conserved. DCRMs were considered high quality if they contained at least one 150 bp region with 6 conserved TFBS.

Presence of high-quality TFBS as well as multiple TFBS of the same kind in a CRM are considered as indicators of its function as a transcription regulator (Blanchette et al. 2006). Thus, we evaluated regulation potential of a CRM by a score, RPS, which was a sum of scores for individual TFBS and scores for multiple TFBS of the same kind. Our method of estimating RPS is different from the one by Elnitski et al. (2003). We used only one genome (mouse), evolutionary conservation score, and matches of known TFBS patterns, whereas Elnitski et al. (2003) used multiple genomes without considering known TFBS patterns. The probability of TFBS accidental occurrence within a CRM of length L was estimated as p = D(s)*L where s is the similarity score of the binding site, and D(s) is the density of binding sites with a similarity score in a semi-random sequence generated using 3rd-order Markov process. Depending on whether a TFBS was in a CpG-rich or CpG-poor region, we used semi-random sequence generated with transition probabilities estimated from CpG-rich or CpG-poor regions in the mouse genome, respectively. Regulatory score for a TFBS was estimated as -log10(p)-2 if p < 0.01 or set to 0 otherwise. The probability of accidental occurrence of multiple binding sites of the same kind, pm, was estimated as the product of probabilities of their individual occurrences, p. The regulatory score for multiple TFBS was estimated as -log10(pm) - 2 if pm < 0.01 or set to 0 otherwise. The regulatory potential score, RPS, which is a sum of scores for individual TFBS and multiple TFBS, was then estimated for all CRMs in the mouse genome. The probability distribution of RPS within CRMs of each size class (from 50 to 150; from 150 to 250; from 250 to 350; ...; >1950 bp) was then compared with the probability distribution of RPS estimated for semi-random sequences of size 100, 200, ..., 1900, >1900 bp (the last class included sequence sizes from 2000 to 3000 bp) generated using 3rd-order Markov process with transition probabilities from CpG-rich or CpG-poor regions. Probability distributions of RPS were very similar for CpG-rich or CpG-poor semi-random sequences (see Figure below), thus we averaged them and used for estimating of p-values and false discovery rate (FDR) of RPS in CRMs in the same size class. After sorting all CRMs by increasing p-values we estimated the false discovery rate for i-th CRM as FDRi = pi*N/i, where pi is the p-value for i-th CRM, and N is the total number of CRMs. We considered that a CRM had a significantly higher RPS than in semi-random sequences if FDR was ≤0.1

.

,

Figure: Examples of the cumulative distribution of regulatory potential score (RPS) in cis-regulatiory modules (CRMs) in the mouse genome and in semi-random sequences of the same size generated with 3rd order Markov process with transition probabilities specific to CpG-rich and CpG-poor genome regions. (A) CRM size from 50 to 150 bp. (B) CRM size from 1450 to 1550 bp

.

7. Methods: browser

The browser for the Mouse Regulatory uses cgi scripts (Perl) for generating pictures and web pages. To accelerate data processing we created data files, which include all information on genes, sequence, and TFBS, for each 60 Kb region. Query tools include search for specific TFBSs or their combinations in promoters or in DCRMs, search for specific genes based on symbols, annotations, gene ontology (GO) terms or protein domains, search for promoters of different quality and/or containing a TATA box. Any list of promoters which resulted from queries or uploaded by a user can be further analyzed for over-represented TFBSs (both singles and pairs), GO terms and protein domains that are preferentially associated with the list. Over-representation of promoters with specific TFBSs or genes with specific GO annotation was evaluated statistically using z scores estimated from the hypergeometric distribution and FDR<0.05.

8. How to use CisView

Search for a specific gene
1. To search for a specific gene, enter its symbol (e.g., Hoxb3) at the front page and select search for gene symbols.
2. Put the GenBank, RefSeq, Ensembl sequence name and select search for mRNA/EST member.
3. Put Unigene, TIGR, or DOTs sequence name and select search for Unigene,TIGR,DOTs.
4. Put U-cluster name (from NIA Mouse Gene Index), TSS name, or CRM name and select search for Uname,Rname,CMname.
5. Put NIA oligo name and select search for NIA oligo.
6. If you have gene sequence, go to BLAT page and enter it there, then select the best match at the top of the table.
Search for a transcription factor
1. Enter gene symbol (e.g. Hox), matrix name from TRANSFAC (e.g. V$SP1_Q6), or U-cluster name (e.g., U032117) at the front page and select search for transcription factor.
2. Enter TFBS motif (e.g. CCCGCCC) and select search for TFBS pattern.
Search for DNA sequence (e.g., enhancer)
1. Go to BLAT page.
2. Enter DNA sequence (at least 20 bp) in a fasta format:
>sequence_name
TTCCCTTAATCTCTAGAACTCCCAGCAGTGTTGGCTACT
Sequence may occupy multiple lines.
3. BLAT search will return a table with all matches for this sequence.
4. Select the top (best) match, or any other match to dispaly it in the browser.
Upload a list of genes
1. Uploading a list of genes can me used for the following tasks: (a) view a sequenctial list of genes/promoters/CRMs; (b) find GO-annotations and protein domains that are over-represented in the list of genes; (c) find TFBS and pairs of TFBS that are over-represented in the list of genes; (d) generate a table of TFBS abundance in promoters; (e) plot the frequency distribution of a selected TFBS in given promoters; (f) search withing a given list of promoters/genes.
2. List of genes can contain one of the following: (a) gene symbols, (b) GenBank names, (c) Unigene names, (d) TIGR names, (e) DOTs names, (f) NIA EST names, (g) NIA oligo names, (h) U-clusters, (g) promoter names (=Rnames), (h) CRM names (=CMnames).
3. Specify the data tipe using the pull-down menu.
Advanced search
1. There are 3 sections in "advanced search": gene features, TFBS features, and promoter features. Specify search criteria atb least in one of the sections and click on the "Search" button.
2. Search is done either in promoters from -1000 to +1000 relative to TSS; or in conserved regions (conservation score > 0.5) from -30000 to +30000.
3. You can search up to 3 TFBS using options "And", "Or", or "Distance". If "Distance" is selected, then the search targets pairs of TFBS separated by a specified distance.
4. Search can be tuned up by specifying distance from TSS, conservation score, and mismatch ratio.
5. You can select specific type of promoters, e.g. CpG-rich promoters of high quality, and limit search to this set of promoters.
6. You can select to return a list of promoters or a list of TFBS. In the latter case, the list will contain multiple entries for each promoter and a location of each TFBS in it. The TFBS option can be useful if you want to view all hits of TFBS. However, there are more options to analyze a list of promoters than for a list of TFBS.
7. After you receive search results, you can do the following tasks: (a) view a sequenctial list of genes/promoters/TFBS; (b) find GO-annotations and protein domains that are over-represented in the list of genes; (c) find TFBS and pairs of TFBS that are over-represented in the list of genes; (d) generate a table of TFBS abundance in promoters; (e) plot the frequency distribution of a selected TFBS in given promoters; (f) search withing a given list of promoters/genes.
Change viewing options
1. Viewing options include (a) selecting TFBS of interest, (b) selecting thresholds for conservation and/or mismatch, (c) hiding a group of TFBS from specific source (e.g., from TRANSFAC), (d) adding user-specified motifs for TFBS. All viewing options are stored as cookies. Thus you need to enable cookies in your web browser for full functionality.
2. User can select up to 5 different TFBS. Selected TFBS can be un-selected individually or as a group.
3. User can select up to 3 additional patterns. Patterns should consist of the following upper-case characters A, T, C, G, Y=[TC], R=[AG], M=[CA], K=[TG], W=[TA], S=[CG], B=[CTG], D=[ATG], H=[ATC], V=[ACG], N=[ATGC]. Selected patterns can be un-selected individually or as a group.
4. By default we do not show nested TFBS with lower scores unless these TFBS are highleghted. However, you can uncheck the box "Hide nested TFBS with lower score" and hit the "Change" button. After that all TFBS will be shown.

Frequently asked questions (FAQ)

Why some 3'UTR cis-regulatory modules (CRMs) are not marked as 3'UTR?
We identified 3'UTR regions based on main transcripts only because alternative transcripts often have truncated ORF and 3'UTR may be identified incorrectly. In some cases, the 3'UTR was not identified in the main transcript (e.g., if it was projected from protein sequence) but was present in alternative transcripts. In this case, CRMs that happen to be there are not marked as 3'UTR-specific.
If you have quyestions, please send a note to the webmaster