EST Sequence
Assembly
IGS 350/550 Computer
Laboratory
M. Rice / M. Weir
An important component of the genome projects is to assemble mRNA
sequence information for all genes. One approach towards this goal is
to systematically sequence large numbers of cDNAs from cDNA
libraries.
High quality sequence runs are typically about 500 bp, whereas mRNAs are typically longer.
Therefore it is necessary to perform analysis of cDNA sequences to
identify overlaps and thereby predict larger sequence fragments of
mRNAs. This analysis can be complicated by issues including the
possibility of alternative splicing.
We have assembled several Drosophila sequence segments with
overlapping sequence -- an example of the kind of data that might
emerge from this type of analysis.
S1
S2
S3
S4
- Use the "BLAST 2 sequences" server to determine the regions of overlap between these four sequences (use FASTA format entering >sequencename in the line above the sequence).
- Based on the overlaps, assemble a predicted composite cDNA segment. You may find it useful to use the OLIGO program to work with each sequence.
- Notice that the overlap between two of the sequences contains
some mismatches: what are three possible explanations for
this?
- To resolve this issue, and assess whether the composite cDNA represents a real mRNA, it is useful to compare the composite cDNA with Drosophila genomic sequence. Go to the Drosophila Flybase BLAST server. Use your composite cDNA as a query against the whole euchromatic genome sequence (i.e. choose the Genome Section "Genome Assembly (NT)"). Use the program Blastn nt->NT.
- Does your output allow you to distinguish between the possible explanations for the mismatches (step 3 above). Notice the orientations of your cDNA fragments. (Assume that all the cDNA sequences correspond to the mRNA single strand sequences, not the antisense sequences.) Develop a model to explain all the results of your BLAST search.
- To test your model, perform a BLAST search with the composite cDNA (input) against the dataset of predicted Drosophila genes (using Feature Type "Annotated Genes (NT)" on Flybase BLAST server). Does the BLAST search confirm your model?
- Use the BLAST result to link to matching predicted gene(s).
- View the genes in the "Map(GBrowse)" link (on the LHS of the gene report). This will facilitate assessing your model. You may find it useful to reduce the scale of the map (e.g. change to "show 100 kbp") in order to see neighboring genes. Notice that the Gene Region Map can show several maps based on your choices at the bottom of the page:
- DNA sequence map (we are looking at 7M on chromosome 2R)
- cytologic map showing chromosome band names ("cytologic band"; we are in the region of band 47F17)
- mutation map ("point_mutation")
- gene model map ("Gene span"; notice the genes en and inv)
- predicted gene map (e.g. "Genescan prediction")
- mRNA map ("mRNA")
- protein coding sequence map ("CDS")
- DNA maps referring to DNA clones used in the sequence assembly ("Tiling BAC")
- sequenced cDNA clones ("cDNA and other aligned sequences")
- microarray probes ("Affymetrix v1 or v2")
This analysis provides indications of the kinds of issues that arise
during sequence assembly -- of genomic and mRNA sequences. It is wise
to try to confirm interpretations using independent data -- in this
case, comparing cDNA and genomic sequences.
Assignment:
- Answer question 3 above.
- Answer question 5 above and explain your conclusion.
- Using artificial sequence constructs (using S1, S2, S3, S4), determine a way to deduce ALL overlaps between these four sequences using a SINGLE call of the "BLAST 2 sequences" server. Provide your input and output.
- (optional) Using pseudocode or an actual programing language, consider how you would design a prefix-suffix overlap detection function that takes two strings as input, and if the two strings overlap, output's the inferred combined sequence. You may consider an exact match or imprecise match version.
Copyright 2007 Wesleyan University