BIOL/MBB 210
Michael Weir
Introduction
In this genomics
laboratory, we are going to examine the human gene Aniridia and its Drosophila homolog, eyeless.
The two genes are described as homologs because their protein sequences
are very similar. The most similar
Drosophila sequence to human Aniridia is Drosophila eyeless.
The human gene
has several names:
Pax 6
Aniridia
Paired box
gene 6
Aniridia is named after a human disease that
affects development of the eye. The Drosophila eyeless mutant phenotype, as its name suggests,
also affect eye development. This
similarity of function is striking since vertebrate and insect eyes are
morphologically very different. This is discussed on pages 6-7 of the Hartwell et
al. (2004) text.
Drosophila eyeless
Let us start by
looking at the Drosophila homolog, eyeless (ey).
Using Flybase, go to the eyeless gene of Drosophila melanogaster.
How many
different transcripts are there?
What is a likely
explanation for there being more than one Drosophila ey transcript?
Compare
transcript ey-RA with
ey-RB:
a. Describe the
splicing events that give rise to these two transcripts.
b. Identify the
sequences of the mRNA segment(s) that differ.
c. Compare the
protein sequences encoded by ey-RA
and ey-RB (ey-PA and
ey-PB).
You will find it
useful to use several sequence analysis programs to address these questions.
--BLAST two sequences
allows you to align two sequences
--ORF finder shows all possible
protein open reading frames of a DNA sequence
It is also often
convenient to store sequences and screen images of results pasted into e.g.
Microsoft Word.
As we make use
of the information reported by the Drosophila Genome Project, it is worth
considering issues of accuracy:
(i) What
reporting criteria could you imagine are used by a genome project to decide on
reporting multiple transcripts and proteins for a given gene? Are these criteria based on known data
or predictions?
(ii) How much
confidence do you have in the accuracy of information reported on a genome
project site?
We tend to
assume that information reported in genome projects is accurate. However, we should always keep in mind
the basis for the reported information, and consider possible sources of error. Information is often the best available
at that time, but it is not always completely accurate, and it is not
infrequently corrected over time.
Human Aniridia
Now let us move
to the human gene.
Use ey-PA as input for a BLAST search of human
proteins at NCBI BLAST –
using protein-protein blastp, limit your search to human proteins.
The BLAST search
immediately indicates that eyeless
has two highly conserved motifs -- the paired domain and homeodomain. Indeed, these regions of the protein
products are where there is most sequence conservation when comparing the Human
Aniridia and Drosophila Eyeless proteins.
Both motifs encode DNA-binding activities (consistent with the proteins
being transcription factors). We
will come back to these motifs at the end of this session.
Notice that the
highest scoring matches from your BLAST search are:
--paired
box gene 6 (aniridia, keratitis)
--Paired
box protein Pax-6 (Oculorhombin) (Aniridia, type II protein)
At NCBI PubMed,
look up the protein for Aniridia.
Initially, use the "gene" search option.
Notice
that the coding region of paired box gene 6 has two isoforms: a and b.
Compare
these two protein sequences using BLAST two sequences.
[Try
the comparison with and without the filter -- how does this affect the
output?]
Provide
a likely explanation for the difference between the two protein isoforms.
If
you use the "protein" search at NCBI PubMed
and enter "Aniridia", you will see multiple individual entries -- Aniridia proteins in several
organisms. Indeed, we should note
that NCBI
PubMed has a wealth of information available from a large number of
different sources. For example,
the "PubMed" search links to publications in the scientific literature;
the "books" search links to a set of online books in molecular biology
and related areas.
Conserved motifs are important for function
Go back to the
initial BLAST results page ("Formatting BLAST" page). We will discuss
now how to follow the links for each conserved motif and view the conserved
protein structures using Cn3D.

Click on the red
PAX box or blue Homeodomain in the BLAST results page; these links gives you
listings of the conserved motifs -- the paired box gnl|CDD|16534 and the homeobox gnl|CDD|16525
. The paired box
structure is that of the Drosophila protein called Paired; the homeobox
structure is that of yeast MAT-alpha2.
Both structures were determined by X-ray crystallography.
You can click on
"Show structure" and view the structures in Cn3D [you may wish to
save the structure file as a text file on your hard drive before importing it
into Cn3D]. You can change the
rendering style to "worms" -- this shows the helix-turn-helix
structure of the homeodomain. The
identity of the ninth residue of the third helix (known as the
"recognition" helix) determines in part the DNA-binding specificity
of the homeodomain. This is a
Serine (11th last amino acid of the Homeodomain) in the 1AKHB
(MAT-Alpha Homeodomain). You can
view this residue by highlighting it on the sequence.
How might you
design experimental tests of the statement above that the ninth residue of the
recognition helix plays a crucial role in the functional specificity of
homeodomains? Consider what kinds
of experiments might be helpful.
For example, would it be helpful to make altered versions of the
homeodomain and test them by making transgenic lines? Can you imagine the steps to do this?
[The image below
is taken from Alberts
et al. (the book is available on line) and shows a homeodomain and POU
domain, another protein domain often associated with homeodomains.]

We started this
session describing Human Aniridia
and Drosophila eyeless
as homologs, and we noted that most of the sequence conservation between the
Aniridia and Eyeless proteins is in the paired domain and homeodomain regions. These proteins are transcription
factors (they regulate the transcription of other genes) and the specificity of
their DNA-binding activities depends upon these domains. Hence, it is not surprising that these
are the most highly conserved regions of the proteins.
Sequence
conservation between different proteins can provide important insights into
shared functions. Hence, analysis
of sequence conservation is an important part of our analysis of the sequences
made available by the genome projects.
Sometimes, testing the functions of conserved protein motifs is easier
done in certain model systems compared to others (e.g. Drosophila compared to
Human). But often, the insights
gained in one system apply to the equivalent motifs in other model systems.
Copyright 2005
Wesleyan University