Information Theoretic Analysis of Sequences:

Drosophila Splice Site Database

IGS 350/550 Computer Laboratory

M. Rice / M. Weir

 


 

The Weir-Rice research group has constructed a relational database containing nucleotide sequences in the vicinity of 11,161 donor and acceptor splice sites in 3,375 Drosophila cDNAs (Weir and Rice, 2004).  In particular, the nucleotides at positions -32 to 32 are stored (where the splice site position is 0) and the sequences at each type of splice site are aligned. 

 

Several procedures are available for analyzing the data set by using the web interface found at http://igs.wesleyan.edu/ (Click on "Databases & Tools" in the left hand menu and choose the option "Use WesQL to run stored procedures on public IGS Splice Site and RNA Databases"; then click the stored procedures tab to see the list of stored procedures).

 

 

1.  Select the stored procedure Compute Splice Site Information and click the Continue button.  You will see a number of pre-set parameters including the cDNA table (Wesleyan Known cDNA), Intron Table (Wesleyan Known Introns), and Splice Site Table (Wesleyan Known Splice Sites), and Minimum Splice Element Length (20).

 

Change the default parameters by setting the Type of Site to Donor Sites and the Start and Finish positions to -4 and 8.

 

Click the Execute Stored Procedure button.  Executing the procedure generates several HTML tables that contain the following entries:

 

Transcripts - number of transcripts meeting filter criteria

 

SpliceElements - number of introns

 

Sitetype - donor (D) or acceptor (A)

 

Nuclposition - nucleotide position with respect to aligned donor or acceptor sites (with splice sites at position 0)

 

Information - at each nucleotide position, this is calculated using the formula

 

information = 2 - [-fA * log2(fA) - fC  * log2(fC) - fG  * log2(fG) - fT  * log2(fT)] - g

 

where the quantity in brackets is the uncertainty (Shannon entropy) at the given position based on the frequency of occurrence fA, ..., fT of the nucleotides A, ..., T, at the given position, and the correction factor g depends on the number of splice sites that are being aligned.

 

nA, nC, nG, nT - the numbers of nucleotides at the given position

 

pA, pC, pG, pT - the probability of each nucleotide at the given position

 

[Note: The T in the cDNA corresponds to the U in the RNA.]

 

(a) How many cDNAs were used? (3,090)

 

(b) How many introns were analyzed ?  (10,057)

 

(c) What are the consensus nucleotide values for positions 1 and 2 at the donor sites (D+1, D+2)? (GT)

 

(d) What are the percentages of occurrence of the predominant nucleotides at these positions ? (99.8%, 99.2%)

 

Store the HTML tables in MicroSoft Excel - we will use this data in part 3.

 

 

2.  Repeat part 1 without any restriction to cDNAs with splice elements (introns or exons) less than 20 - i.e. no restriction on the lengths of the introns and exons.

 

(a) What are the percentages of occurrence of the predominant nucleotides at positions D+1, D+2 ?

 

(b) What are some possible reasons why the GT consensus is not as well represented as in part 1 ? [consider the algorithm used to compute the splice sites - see Weir and Rice (2004)]

 

(c) For the set of cDNAs that contain either an intron or an exon with length less than 20, calculate the frequency at each of the two positions of the canonical GT [hint: compare nucleotide counts from 1(d) and 2(a)].

 

 

3.  Let's reconsider the higher quality data set (where each intron and exon has length greater than 20). Identify a consensus nucleotide sequence at all nucleotide positions with greater than 0.5 bits of information.

 

The sequence [3' UCCAUUCA 5'] in the Drosophila U1 snRNA is thought to bind near the donor site to the consensus sequence. 

 

Draw a diagram of the predicted RNA/RNA base pairing.

 

 

4.  To study the effects of intron length on information values, first set a range of small intron lengths (e.g. 64-80 using the Minimum and Maximum Intron Length parameters) and compute the information for positions -10 to 10.  Compute information for both donors and acceptors.  Store the output tables in Excel.

 

Next, compute the information for positions -10 to 10 using a Minimum Intron Length of 8192. (Note - you will also need to set the Maximum Intron Length to 0.)

 

(a) Compare the total information at the donor and acceptor sites for the longer and shorter introns.

 

(b) Compare the amounts of information at each nucleotide position for the longer and shorter introns.

 

(c) For the positions with large information differences, compare the nucleotide content.  How does your result relate to the U1 snRNA binding sequence discussed in part 3 ?

 

(d) Do you notice any other differences between the longer and shorter intron data sets?  For example, compare the A content in each data set.  What conclusions can you draw?

 

See Weir and Rice (2004) for a more complete analysis of the two datasets.

 

 

Conclusion: This session has provided examples of the power of relational databases in allowing partitioning of large datasets into substantial subsets which can be compared to provide useful insights into the data.

 

 

Assignment

 

(a) Answer the questions in part 2.

 

(b) Summarize at least two general conclusions that you can draw from the analysis of splice sites.

 

(c) What possible molecular-based hypotheses are suggested by the analysis ?

 

(d) Answer the questions in parts 3 and 4.

 

References:

 

Stephens, R.M. and Schneider, T.D. 1992. Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J Mol Biol 228: 1124-36.

 

Weir, M.P. and Rice, M.D. 2004. Ordered Partitioning Reveals Extended Splice Site Consensus Information. Genome Research 14:67-78.

 

Weir, M., Eaton, M. and Rice, M. 2006. Challenging the spliceosome machine. Genome Biology 7:R3.

 


Copyright Wesleyan University 2006