DNA as a Language

IGS 350/550 Computer Laboratory

M. Rice / M. Weir


Illustrated again are cDNA and genomic DNA sequences of the Drosophila engrailed gene and an alignment of these two sequences [engrailed is a gene required for embryo development.] 

  1. What is the definition of an open reading frame (ORF)? [Think of a definition involving the start codon ATG, and the stop codons TAA, TAG, and TGA.]

  2. [Homework due next Friday] Write pseudocode for a program that identifies all ORFs of a DNA string. Convert your pseudocode to a program, e.g. in C++ or Python. Run your program on the engrailed cDNA and genomic sequences [you might initially work with the first 200 bases of the sequence]. Your group should hand in a print-out of the output for engrailed cDNA.

    For comparison, an example of a function that identifies all the ATG codons in a DNA sequence is illustrated in
    C++ and in Python.

  3. Compare your output with the outputs from the ORF finder at NCBI using the same sequences. You can click on the ORF boxes to display the ORF amino acids.

  4. Here is the actual ORF used for the Engrailed protein. Use the codon tables to work out by hand the first ten amino acids of Engrailed. (See jkimball for a codon table and description of translation.)

  5. Why are other ORFs of the engrailed cDNA not actually used to make the Engrailed protein? For example, there are many copies of the start codon ATG (corresponding to methionine) within the long ORF.

    Kozak (for many organisms) and Cavener (for Drosophila) [see assigned background reference: Cavener, D.R. (1987) Comparison of the concensus sequence flanking translational start sites in Drosophila and vertebrates. Nuc. Acid Res. 15:1353-1361] have examined the frequencies of different bases at positions near the known start (ATG) codons of large numbers of proteins. In particular, Cavener collected the sequences that occur in the 10 positions upstream (5') (-10 to -1) of the AUG initiating translation for many fly genes. Here is a partial listing of these sequences.

  6. [Optional] Calculate the frequencies of occurence of each nucleotide at each position using the Cavener Data. Compare the 10 bp upstream of the actual translation start (AUG) of engrailed and the internal AUG codons. (The 10 bp immediately upstream of the actual AUG used for initiating translation are: GTCGAAACCA.)


Copyright 2005 Wesleyan University