NIA Mouse Gene Index

Gene Index Help          

Legend for gene view

1. Three scales for gene location in the chromosome: whole chromosome, 3-Mb window, and a 300-Kb window. You can navigate in all 3 windows by clicking at any gene or location.
2. Magenta triangle shows the location of the gene in each scale.
3. Upper line in each pair of lines represent the positive strand, the lower line represents the negative strand.
4. Red-bordered bar is a gene or gene candidate. Click on it to see the gene structure.
5. Green-bordered bar is a non-gene (ORF<100 aa and single exon).
6. Blue '+' indicates TSS identified using the FirstRF software. The TSS is strand-specific.
7. Gray area indicates a CpG island identified using CpGproD software.
8. Oligos used in NIA mouse microarrays (60-mers manufactured by Agilent). Click on the oligo to get more information.
9. Names of assembled transcripts. The first part (e.g., U000006) indicates a U-cluster (gene or transcribed non-gene), and the second part (after dash) is the transcript number. Click on the graph of a transcript to see aligned sequences.
10. Additional information on transcripts. First column show ORF length (aa). Second column is the first aminoacid (M for methionine, ATG, and L for lysine, CTG) and the Kozak consensus shown by the thickness of the green bar. Third column indacates the source of sequences: N=NIA, E=Ensembl, Rf=RefSeq, Gb=GenBank, Est=dbEST.
11. U-cluster start.
12. U-cluster end.
13. U-cluster end.
14. Exons. Blue color = ORF; magenta = untranslated regions (UTR).
15. Introns. Black color = correct splie sites (canonical, GT-AG, as well as two major non-canonical, GC-AG and AT-AC). Gray color = at least one splice site is incorrect
16. Transcripts that belong to other U-clusters in the same strand.
17. Transcription start.
18. Transcription end.
19. Exon number.
20. Intron length, bp (not in-scale with exons).

Navigation in the Gene Index

You can either browse genes based on their genome location, search by annotation term or sequences name (type-in a search term and click on "Search"), or BLAT your sequence against the genome (click on "BLAT").
Major data sets can be downloaded from here.

Gene Index web pages belong to the following levels or "views"

Genome view
proivides information on the genome location and exon-intron structure of U-clusters. Three upper bars present the location of the gene at 3 scales: whole chromosome, 3Mbp window, and 300Kbp window. Positive and negative strands are shown separately. A user can click on a chromosome location or gene box to get to a different gene. The interface is designed for viewing a single U-cluster (gene). Links to genome browsers that have a zoom-in/zoom-out option (NCBI, UCSC, Ensembl) are provided to view the same region of the chromosome. Click on any transcript sequence to get to the transcript view.

Transcript view
provides information on a particular transcript. A link to the U-cluster returns to the genomic view. At the top there is a plot of the transcript its open reading frame (ORF). Character "M" or "L" at the start of ORF indicates that the first aminoacid is Methyonine or Lysin, respectively. A green bar next to "M" indicates the presence of a Kozak consensus. Below the transcript there are members of the transcript plotted on a white background: Refseq, Ensembl, Riken, and NIA clusters. Individual ESTs are plotted on a gray background (NIA library name is indicated on the right). At the bottom of the page there are lists of protein domains and a list of GO-terms associated with the gene symbol. Click on the transcript sequence to get to the sequence view.

Sequence view
provides information on the nucleotide and protein sequence of a transcript. In addition it lists protein domains, GO-terms, repeat and regions. There are links to several sequence analysis tools: BLAST, BLAT, ORF finder.

Methods

Filtering

Genome alignments were selected if at least 30% length matched to the genome but not less than 40 bp, and the ratio of the total alignment length to the best alignment was at least 0.9. Sequences that had >100 alignments and satisfied the above listed conditions were considered repeats and removed. None of them had an ORF with known protein domains according to searches using RPS-BLAST and CDD database ver 2.02 (Marchler-Bauer and Bryant 2004). For other sequences we considered not more than 50 alignments. At the next filtering step we checked the quality of the alignment using the percent identity (PID). Short non-intronic gaps as well as short inserts (<30 bp) in the sequence were treated as mismatches for estimating PID. The threshold of PID = 70% was used for filtering best alignments, and PID = 85% for additional alignments.

Because the BLAT algorithm attempted to find a genomic match for as many nucleotides as possible, it created artificial small exons within an intron to avoid mismatches that are close to exon boundaries (Volfovsky et al. 2003). Small exons (<6 bp) without splicing consensus were considered artifacts and were either merged with neighboring exons (if the number of mismatches was <50%) or removed from the alignment. Most real micro-exons (80%) were well detected with BLAT (Volfovsky et al. 2003); thus we did not use any additional correction procedures for micro-exon detection. Short initial and final blocks (<40bp) in the alignment that were separated by an intron without splice consensus were removed because most of them were random matches. Short alignments (<70 bp) were removed if either genome span was >500000, or there were introns without splice sites, or PID was <95%. Because sequence quality was usually lower in ESTs than in full mRNAs, we applied more stringent criteria for the filtering of ESTs. The best EST alignment was deleted if its genome span was >200000 bp and there were no introns with splicing consensus, or there were >1 intron without splicing consensus, or PID was <90%. Other EST alignments were deleted if the genome span was >100000 bp and there were <2 introns with splice consensus, or there were >1 intron without splice consensus, or PID was <90%. These criteria were determined iteratively by examining the results of gene assembly and identifying sequences that caused problems.

Sequence orientation was validated in three steps. First, we identified multi-exon alignments in which orientation was unambiguously determined by the direction of splicing consensus. If the orientation determined from splicing consensus did not match with the original sequence orientation, then the sequence was labeled as "wrong-strand". At the second step, we assembled alignments with validated orientation and checked their overlap with other alignments. If an alignment with unclear orientation overlapped by >20% length with a some assembly with known orientation, and the overlap length with this best matching assembly was at least twice as larger as the overlap with any assembly in the opposite strand, then the orientation of the tested alignment was considered valid. This procedure might have re-oriented some naturally occurring antisense RNAs if they are unspliced. The orientation of spliced antisense RNAs was determined based on splice sites and was not changed. We did not intend to assemble unspliced antisense transcripts, because they could not be effectively distinguished from the genomic contamination, and their biological function was unclear. At the third step, we assembled all sequences with unclear orientation and determined the orientation of each consensus based on the majority rule.

Additional filtering was done in the groups of partially overlapping alignments. If a sequence had multiple partially overlapping alignments in the same group, then only the best one was retained. Alignments of gene models that had no support for any of the introns from alignments of expressed sequences in the same group were removed. If an intron joined two distinct sets of alignments or contained several multi-exon alignments and had insufficient evidence (i.e., supported by one mRNA/EST sequence or by only gene models), then the alignments were truncated at that intron. In addition, we truncated alignments at introns with insufficient evidence that either had length>30 kb and no splicing consensus, or included a promoter or a start or end of a RefSeq sequence alignment.

Assembly

The proposed All Alignment Assembly (AAA) algorithm assembled the set of all longest transcripts from EST/mRNA sequences aligned to the genome. Each transcript consisted of partially overlapping compatible alignments. Two alignments were considered compatible if each sequence had no elements mapped to an intron of another sequence. For practical purposes, we relaxed this condition so that two alignments were considered compatible if all non-compatible fragments were shorter than 15 bp. This made the assembly less sensitive to sequencing and alignment errors. The compatibility relationship among alignments was non-transitive. This means that if alignments A and B were compatible, and B and C were compatible, then A and C were not always compatible. However, in a set of sequences that extended each other from right to left (or from left to right), the compatibility relation became transitive and could be chained to produce longer transcripts (Haas et al. 2003, Eyras et al. 2004).

Alignments in each chromosome and each strand were grouped into non-overlapping clusters, and then each cluster was processed sequentially by the AAA algorithm. The proposed AAA algorithm consisted of four steps: (1) find all non-redundant left (towards 5'-end of gene) extensions for each alignment; (2) identify all right-end alignments that cannot be extended to the right (towards 3'-end of gene); (3) assemble transcripts starting from right to left by branching the extension of each alignment to the left; (4) remove redundant and low-quality transcripts.

The algorithm started with sorting all alignments by their starting position, and sequence extensions were determined from left to right. An alignment B extended alignment A to the left if it partially overlapped with A, was compatible with A, and strictly left from A. From the set of all left extensions we removed redundant left extensions using Algorithm 1. Extension B of alignment A was defined non-redundant if for any other extension C of alignment A either (1) C was non-compatible with B, or (2) B was longer, or (3) B was shorter, but it had left extensions (direct or chained) that were non-compatible with C. For example, in the figure below, extension B of alignment A was compatible with C and shorter than C. However it was non-redundant because it had a left extension D, which was incompatible with C.

To eliminate redundant left extensions for each alignment A, the set S of all left extensions was sorted by increasing left boundary position. Then for each subsequent extension s we checked if it was compatible with any longer non-redundant extension n. If it was not compatible with any, then s was added to the list of non-redundant extensions of A. If s was compatible with a longer left extension n, we checked if any left extension of s was compatible with n. A stack was initialized with alignment s and then it accumulated assemblies that started from s and extended transitively to the left. When assembly was extracted from the stack, it was extended to the left with all non-redundant left extensions determined for its left-most element. If extension Q was compatible with n, and its left boundary was to the right from the left end of n, then it was combined with the assembly and added back to the stack. If Q was compatible with n but its left boundary was equal or left to the left end of N, then the next left extension Q of the assembly was tried. If Q was not compatible with n then i was not redundant compared to n; in this case we went to the next non-redundant extension n. If all non-redundant extensions were tested and s was non-redundant to all of them then s was added to the set of non-redundant left extensions of A, and the algorithm was repeated for the next s.

Algorithm 1. Filtering non-redundant left extensions for alignment A

  1    Initialize empty set N of non-redundant extensions of A.
  2    Sort the set S of all left extensions for A by increasing left boundary
  3    For each extension s in S{
  4        For each non-redundant extension n in N{
  5            If s is compatible with n{
  6                Initialize stack T with alignment s
  7                While T is non-empty{
  8                    Extract last assembly [q0,q1,q2, ... ,qm] from stack T.
  9                    For each non-redundant extension Q of the last element (qm){
10                        If Q is compatible with n{
11                            If left end of Q is equal or left to the left end of n{
12                                Next Q (line 9)
13                            }
14                            else{
15                                Push assembly [q0,q1,q2, ..., qm,Q] into stack T.
16                            }
17                        }
18                        else{
19                            s is non-redundant comparing to n; try next n (line 4)
20                        }
21                    }
22                }
23                s is redundant; go to next s (line 3)
24            }
25        }
26        s in non-redundant; push s into set N; go to next s (line 3)
27    }
28    Return N

Transcripts were assembled (step 3) starting from the rightmost alignments, which were then combined with all possible non-redundant left extensions. Because the assembly could branch, we used a stack to store incomplete transcripts.

It can be proven that all possible full transcripts are generated by the algorithm. A full transcript is the one that can be extended further neither to the right nor to the left. We define a frame of a transcript assembly, as a set of member alignments that were not included into any other alignment. In a frame, all alignments are linearly ordered by the strictly left relation. If alignment B in the frame is a redundant left extension of the previous alignment A, then it can be removed without breaking the transcript frame. According to the definition of redundancy, there is another longer non-redundant left extension C that extends A beyond B and is compatible with all elements in the frame. If alignment C is not in the frame itself, then it is included into another alignment D in the frame. If B is removed from the frame, the transcript will remain joined either by C or D. After removal of all redundant left extensions, the frame of the transcript should be constructed via our algorithm starting from the rightmost alignment.

At step 4, we removed redundant transcripts with the same composition of exons or shorter if they had a fewer number of introns with a splicing consensus. Transcripts with unspliced alternative first exon were removed if there was no promoter within 1 kb of transcription start. Transcripts with unspliced alternative last exon were removed if the last exon had no polyA signal.

The AAA algorithm was a part of a gene and transcript assembly system that included pre-processing and post-processing of data. The first step in pre-processing was a temporary removal of redundant alignments that were exact copies or slightly shorter copies (by 15 bp) of other alignments. To increase computation speed, all alignments that were included into other ones were considered redundant if the total number of sequences was >150. Small gaps (<15 bp) in alignments were removed and intron boundaries were adjusted to neighboring splice sites within 15 bp distance. Unspliced alternative first and last exons in ESTs were truncated unless they matched to promoters or polyA signals.

The last pre-processing step was the grouping of alignments into U-clusters (=potential genes) based on alignment overlap and clone-linking. We distinguished gross overlap on the level of whole alignments, and fine overlap, on the exon level. All alignments on the same strand of a chromosome were sorted by starting position and then subdivided into gross-overlapping groups. The starting position of alignments was adjusted according to the clone-linking information. If a clone was sequenced from the 3' and 5' ends, then two resulting EST sequences were assumed to represent the same transcript. After these ESTs were aligned to the genome, we considered alignments clone-linked if they were on the same strand of the same chromosome, 3' alignment was on the 3' side relative to the 5' alignment, and the distance between alignments was <800,000 bp. The starting position of the EST located farther from the chromosome start was set to the starting position of another EST with which it was clone-linked. Thus, clone-linked ESTs always appeared in the same gross-overlapping group together with all other alignments between them.

Each gross-overlapping group was then subdivided into U-clusters assuming that alignments in different U-clusters had fine overlap <5% of alignment length. If 2 clone-linked EST pairs appeared in different U-clusters within the same gross-overlapping group and each alignment of one U-cluster was compatible with all alignments in the second U-cluster, then these U-clusters were merged. U-clusters containing copies of the same sequence were not clone-linked to avoid merging gene tandems. A U-cluster located entirely within an intron of another U-cluster was considered intronic. Many intronic U-clusters did not seem to be real genes but rather cloning artifacts. However, some of them were real single-exon genes (e.g., Rpl12 was within Acadl, and Cks2 was within Sntg1). It is very unlikely for a multi-exon gene to be located within an intron of another gene because the splicing mechanism of an outer gene would not work properly. Although we found several instances of intronic multi-exon U-clusters, we believe that most of them were artifacts resulted from genome or alignment errors. All alignments within the same U-cluster were submitted for the AAA algorithm to generate transcripts.

Post-processing of U-clusters and transcripts included clone-linking of transcripts, generating genome alignments of transcripts, mending genomic gaps based on alignments of expressed sequences, generating alignments of transcript members to transcripts, and compiling a graph of exons. Transcripts of the same U-cluster were merged if they contained clone-linked EST pairs and all member alignments were mutually compatible. Gaps in the genome sequence were identified if transcripts from 2 independent sources indicated the same gap. These gaps were mended using expressed sequence information. Alignments of transcript members to transcripts were generated as a composition of two alignments: the alignment of a member sequence to the genome, and a reverse alignment of the transcript to the genome.

Exon graph has become a standard representation of possible transcript alteration (Xing et al., 2004). Some exons are represented by multiple exon forms which differ in their starting and ending coordinates. We constructed exon graphs for all U-clusters using preferentially introns with splicing consensus. Introns without consensus appeared in the graph only if no better intron was known. Retained intron was a special case of alternative splicing that was difficult to distinguish from a splicing error (Zhou et al. 2003). Thus, retained introns were included into the exon graph only if their length was ≤500 bp.

Inheritance of U-cluster and transcript names from previous program runs was important for the consistency of results. First, we matched U-clusters in the nearest neighborhood (4 Mbp) if they shared at least some members and the number of exons in the new assembly was close to the number of exons in the old assembly. At the second step we identified key sequences that matched to only one old U-cluster. Then U-clusters were matched if they shared any of key members and the number of exons was close. Finally we matched U-clusters that shared key sequences without considering the number of exons. Non-matched old U-clusters were deleted, and non-matched new U-clusters were created. Then we found matching transcripts within matching U-clusters using key members that were found in only one old transcript.

Analysis of transcripts

Analysis of transcripts included identification of (1) the longest open reading frame (ORF), (2) repeat regions, (3) main transcript for each U-cluster, (4) duplicated U-clusters, (5) U-clusters with suspicious orientation, and (6) generating annotations for transcripts and U-clusters. ORF was detected using the ORF Finder software (Wheeler, 2004) with both standard and alternative genetic code options. Because generated transcripts might have contained ORF shifts resulted from single nucleotide insertions/deletions, we analyzed not just individual ORFs but also composite ORFs consisting of a pair of overlapping ORFs if each portion was longer than 100 aa. The threshold of 100 aa. was selected because ORFs of this length are highly unlikely (P<0.01) to appear in random sequences. If the difference in length between a single ORF and composite ORF was <100 aa, then a single ORF was selected. As a result only ca. 5% transcripts appeared to have composite ORFs. Genomic repeat sequences were already masked in the mouse genome database (mm4), thus we simply projected them onto transcript sequences. Main transcripts for each U-cluster were identified based on the score

S = L(1 + 0.25N/Nmax),    if N ≥ 10
S = L,            if N < 10,

where L is ORF length, N is the average number of supporting mRNA/EST sequences for each intron (RefSeq sequences were weighted as 10), and Nmax is the maximum value for N among all transcripts of the gene.

A U-cluster was considered a copy of another U-cluster if <30% of its members were best matches. Cross-links were established between primary genes and their copies based on member copies. U-clusters had a suspicious orientation if they fine-overlapped by >50% with a better supported U-cluster in the opposite strand.

Annotations for transcripts were generated from annotations of member sequences. The preference was given to member sequences from Refseq, GenBank, and to sequences with a valid symbol.

References

Eyras, E., M. Caccamo, V. Curwen, and M. Clamp. 2004. ESTGenes: alternative splicing from ESTs in Ensembl. Genome Res 14: 976-987.

Haas, B.J., A.L. Delcher, S.M. Mount, J.R. Wortman, R.K. Smith, Jr., L.I. Hannick, R. Maiti, C.M. Ronning, D.B. Rusch, C.D. Town, S.L. Salzberg, and O. White. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31: 5654-5666.

Thierry-Mieg, D. et al. http://www.aceview.org/: Danielle and Jean Thierry-Mieg, Michel Potdevin, Mark Sienkiewicz. Identification and functional annotation of cDNA-supported genes in higher organisms using AceView, unpublished. 2004.

Volfovsky, N., B.J. Haas, and S.L. Salzberg. 2003. Computational discovery of internal micro-exons. Genome Res 13: 1216-1221.

Xing, Y., A. Resch, and C. Lee. 2004. The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res 14: 426-441.

Zhou, Y., C. Zhou, L. Ye, J. Dong, H. Xu, L. Cai, L. Zhang, and L. Wei. 2003. Database and analyses of known alternatively spliced genes in plants. Genomics 82: 584-595.