EST Sequence Assembly

IGS 350/550 Computer Laboratory

M. Rice / M. Weir


An important component of the genome projects is to assemble mRNA sequence information for all genes. One approach towards this goal is to systematically sequence large numbers of cDNAs from cDNA libraries.

High quality sequence runs are typically about 500 bp, whereas mRNAs are typically longer.

Therefore it is necessary to perform analysis of cDNA sequences to identify overlaps and thereby predict larger sequence fragments of mRNAs. This analysis can be complicated by issues including the possibility of alternative splicing.


We have assembled several Drosophila sequence segments with overlapping sequence -- an example of the kind of data that might emerge from this type of analysis.

S1

S2

S3

S4


  1. Use the "BLAST 2 sequences" server to determine the regions of overlap between these four sequences (use FASTA format entering >sequencename in the line above the sequence).

  2. Based on the overlaps, assemble a predicted composite cDNA segment. You may find it useful to use the OLIGO program to work with each sequence.

  3. Notice that the overlap between two of the sequences contains some mismatches: what are three possible explanations for this?

  4. To resolve this issue, and assess whether the composite cDNA represents a real mRNA, it is useful to compare the composite cDNA with Drosophila genomic sequence. Go to the Drosophila Flybase BLAST server. Use your composite cDNA as a query against the whole euchromatic genome sequence (i.e. choose the Genome Section "Genome Assembly (NT)"). Use the program Blastn nt->NT.

  5. Does your output allow you to distinguish between the possible explanations for the mismatches (step 3 above). Notice the orientations of your cDNA fragments. (Assume that all the cDNA sequences correspond to the mRNA single strand sequences, not the antisense sequences.) Develop a model to explain all the results of your BLAST search.

  6. To test your model, perform a BLAST search with the composite cDNA (input) against the dataset of predicted Drosophila genes (using Feature Type "Annotated Genes (NT)" on Flybase BLAST server). Does the BLAST search confirm your model?

  7. Use the BLAST result to link to matching predicted gene(s).

  8. View the genes in the "Map(GBrowse)" link (on the LHS of the gene report). This will facilitate assessing your model. You may find it useful to reduce the scale of the map (e.g. change to "show 100 kbp") in order to see neighboring genes. Notice that the Gene Region Map can show several maps based on your choices at the bottom of the page:

    This analysis provides indications of the kinds of issues that arise during sequence assembly -- of genomic and mRNA sequences. It is wise to try to confirm interpretations using independent data -- in this case, comparing cDNA and genomic sequences.


    Assignment:

    1. Answer question 3 above.

    2. Answer question 5 above and explain your conclusion.

    3. Using artificial sequence constructs (using S1, S2, S3, S4), determine a way to deduce ALL overlaps between these four sequences using a SINGLE call of the "BLAST 2 sequences" server. Provide your input and output.

    4. (optional) Using pseudocode or an actual programing language, consider how you would design a prefix-suffix overlap detection function that takes two strings as input, and if the two strings overlap, output's the inferred combined sequence. You may consider an exact match or imprecise match version.


    Copyright 2007 Wesleyan University