Information Theoretic Analysis of Sequences:
Drosophila Splice Site Database
IGS 350/550 Computer Laboratory
M. Rice / M. Weir
The Weir-Rice research group
has constructed a relational database containing nucleotide sequences in the
vicinity of 11,161 donor and acceptor splice sites in 3,375 Drosophila cDNAs
(Weir and Rice, 2004). In
particular, the nucleotides at positions -32 to 32 are stored (where the splice
site position is 0) and the sequences at each type of splice site are
aligned.
Several procedures are
available for analyzing the data set by using the web interface found at http://igs.wesleyan.edu/ (Click on
"Databases & Tools" in the left hand menu and choose the option
"Use WesQL to run stored procedures on public IGS Splice Site and RNA
Databases"; then click the stored procedures tab to see the list of stored
procedures).
1. Select the stored procedure Compute Splice Site Information
and click the Continue button. You
will see a number of pre-set parameters including the cDNA table (Wesleyan
Known cDNA), Intron Table (Wesleyan Known Introns), and Splice Site Table
(Wesleyan Known Splice Sites), and Minimum Splice Element Length (20).
Change the default parameters
by setting the Type of Site to Donor Sites and the Start and Finish positions
to -4 and 8.
Click the Execute Stored
Procedure button. Executing the
procedure generates several HTML tables that contain the following entries:
SpliceElements
- number of introns
Sitetype
- donor (D) or acceptor (A)
Nuclposition
- nucleotide position with respect to aligned donor or acceptor sites (with
splice sites at position 0)
Information
- at each nucleotide position, this is calculated using the formula
information = 2 - [-fA
* log2(fA) - fC
* log2(fC) - fG * log2(fG) - fT * log2(fT)] - g
where
the quantity in brackets is the uncertainty (Shannon entropy) at the given position
based on the frequency of occurrence fA, ..., fT of the
nucleotides A, ..., T, at the given position, and the correction factor g depends on the number of splice sites that are being
aligned.
nA,
nC, nG, nT - the numbers of nucleotides at the given position
pA,
pC, pG, pT - the probability of each nucleotide at the given position
[Note:
The T in the cDNA corresponds to the U in the RNA.]
(a) How many cDNAs were used?
(3,090)
(b) How many introns were
analyzed ? (10,057)
(c) What are the consensus
nucleotide values for positions 1 and 2 at the donor sites (D+1, D+2)? (GT)
(d) What are the percentages
of occurrence of the predominant nucleotides at these positions ? (99.8%,
99.2%)
Store the HTML tables in
MicroSoft Excel - we will use this data in part 3.
2. Repeat part 1 without any restriction to cDNAs with splice
elements (introns or exons) less than 20 - i.e. no restriction on the lengths
of the introns and exons.
(a) What are the percentages
of occurrence of the predominant nucleotides at positions D+1, D+2 ?
(b) What are some possible
reasons why the GT consensus is not as well represented as in part 1 ?
[consider the algorithm used to compute the splice sites - see Weir and Rice
(2004)]
(c) For the set of cDNAs that
contain either an intron or an exon with length less than 20, calculate the
frequency at each of the two positions of the canonical GT [hint: compare
nucleotide counts from 1(d) and 2(a)].
3. Let's reconsider the higher quality data set (where each
intron and exon has length greater than 20). Identify a consensus nucleotide
sequence at all nucleotide positions with greater than 0.5 bits of information.
The sequence [3' UCCAUUCA 5']
in the Drosophila U1 snRNA is thought to bind near the donor site to the
consensus sequence.
Draw a diagram of the
predicted RNA/RNA base pairing.
4. To study the effects of intron length on information values,
first set a range of small intron lengths (e.g. 64-80 using the Minimum and
Maximum Intron Length parameters) and compute the information for positions -10
to 10. Compute information for
both donors and acceptors. Store
the output tables in Excel.
Next, compute the information
for positions -10 to 10 using a Minimum Intron Length of 8192. (Note - you will
also need to set the Maximum Intron Length to 0.)
(a) Compare the total
information at the donor and acceptor sites for the longer and shorter introns.
(b) Compare the amounts of
information at each nucleotide position for the longer and shorter introns.
(c) For the positions with
large information differences, compare the nucleotide content. How does your result relate to the U1
snRNA binding sequence discussed in part 3 ?
(d) Do you notice any other
differences between the longer and shorter intron data sets? For example, compare the A content in
each data set. What conclusions
can you draw?
See Weir and Rice (2004) for
a more complete analysis of the two datasets.
Conclusion: This session has provided examples of the power of
relational databases in allowing partitioning of large datasets into
substantial subsets which can be compared to provide useful insights into the
data.
(a) Answer the questions in
part 2.
(b) Summarize at least two
general conclusions that you can draw from the analysis of splice sites.
(c) What possible
molecular-based hypotheses are suggested by the analysis ?
(d) Answer the questions in parts 3 and 4.
Stephens, R.M. and Schneider,
T.D. 1992. Features of spliceosome evolution and function inferred from an
analysis of the information at human splice sites. J Mol Biol 228: 1124-36.
Weir, M.P. and Rice, M.D.
2004. Ordered Partitioning Reveals Extended Splice Site Consensus Information.
Genome Research 14:67-78.
Weir, M., Eaton, M. and Rice,
M. 2006. Challenging the spliceosome machine. Genome Biology 7:R3.
Copyright Wesleyan University 2006