IGS 350/550 Computer Laboratory
M. Rice / M. Weir
Microarray expression data can provide important information regarding regulatory relationships between genes. If genes are regulated equivalently, then their expression profiles in different conditions (experiments) will tend to correlate. Various methods for clustering analysis can reveal these correlations. These methods will be explored in this lab.
Please be sure to record your results as you proceed through this lab - an effective approach is to record screen images and paste them into Microsoft Word, carefully annotating how results were obtained (which application, clustering method, genes, filters, transformations). At the end of the lab, you will then be able to review your records and evaluate the approaches you have tried. Since microarray analysis is a new field, it is particularly important for you to critically assess different methods for analyzing microarray data to determine which are most useful in different contexts.
Go to the Integrative Genomic Sciences (IGS) home page and select IGS Databases -> Access IGS Microarray & Slide Database. Select one of the Affymetrix microchip data sets (e.g. the Tamayo et al. (1999a) data) and click on the Access Data Set button. [We will consider slide data later.] Select a portion of the data set (e.g. expression values for the first 500 genes of the Tamayo (1999a) data for all 4 experiments) by entering the appropriate gene range and experiment range. Then click the Continue button at the bottom of the page. Once your chosen data set is displayed, click the Format Data button -- this will give you the option of outputting the data set in your format of choice. Click the Genesis button. This generates a tab-delimited text file which includes the gene identifier names in the first column of each line (row). Save the file in the directory C:\Program Files\Genesis\Samples\. [NOTE in Internet Explorer, before saving the file, it may be neccessary to switch to "view source" using the right mouse button, and then save using systematic labeling of the file].
2. Clustering Algorithms
Clustering methods use a distance measure (e.g. Euclidean metric) to compare expression values of pairs of genes for each experiment. When the distance between a pair of genes is small, then the two genes might be clustered. We will use the Genesis application to cluster the raw data. Let us try clustering the raw data using the K-means algorithm. Open the Genesis program, and import your data set into the program. The option "Expression Images" allows you to visualize the data on a red (high expression) to green (low expression) scale. Different genes are represented in different rows, and different experiments (microarrays) are found in each column. Before running a clustering algorithm, use the 'Distance" pull-down menu to choose a distance measure (e.g. Euclidean). Then choose a clustering algorithm (e.g. the K-means option from the "Cluster" pull-down menu. and select the number of clusters (K) into which you wish to partition your genes (e.g. 20). Choosing an approriate number is an important issue.
Your clustering results can be viewed in several ways. Take a look at the (mean) expression profiles of your clusters by looking at "Centroid views" -- first choose the "All clusters" option to see the profiles of all clusters with the numbers of genes in each cluster. Note that in order to see the profiles, you may need to "adjust to maximum" under the "View" drop-down menu in Genesis. To obtain a better indication of how the individual genes contribute to the clusters, choose "Expression views" to display the expression profiles of all genes in a cluster, with the centroid highlighted. To save files listing the genes (and their expression values) in individual clusters, use mouse right click "save cluster" (or click "save all clusters" to save a file for each cluster). You can also use the mouse right click to save images of the cluster profiles. [You might also save the screen image for the all-clusters image of centroid views.] [output from previous run]
Notice that many of the centroid profiles are rather flat, and they often are not particularly representative of the profiles of the individual genes in the cluster. This indicates that we need to consider preprocessing the data set before running clustering algorithms. For example, when we preprocess the first 500 genes in the same way as Tamayo et al. (discussed below), the following profiles result from K-means clustering (K = 10).
In the middle of the Tamayo data set page, select the "Click here to see percentiles..." link (in the "Expression Levels (percentiles)" box). The entire data set of expression values from the microarray experiments is divided into 20 bins each representing 5 percent increments.
Notice that a large percentage of the data set values are negative. Processing of the microarrays includes estimating non-specific background signal which is then subtracted from all expression values. The Affymetrix algorithms for calculating gene expression values compare the signals obtained with perfect-match and one-base-mismatch hybridization oligonucleotides on the microarrays (oligonucleotide "probe pairs"). Because the one-base-mismatch oligonucleotides can sometimes hybridize to other mRNAs, they do not always give a good representation for non-specific background signal. Hence, apparent negative expression values for genes can result. Also, since there is some noise in the data, this can also result in negative values.
On Affymetrix microarrays, the expression of each gene is measured using several different spots (the oligonucleotide probe pairs) -- each probe pair corresponds to a different region of gene mRNA sequence. The "gene calls" of present (P) or absent (A) depend upon whether the signals for the different oligos are internally consistent. Also, expression values considered too low to be measurable are given a call of "A". Indeed, notice that many of the low expression values, and (virtually) all the negative values are scored as "A". To assess this, scroll down the data set in the "Query Results" window (each column number represents a different microarray experiment -- in the case of Tamayo et al. (1999a), experiment 1, 2, 3 or 4).
An important difference between the Affymetrix and Slide microarray approaches is that with each slide, we compare the expression of two different mRNA or cDNA populations, each labeled with a different color of dye. This gives a ratio of expression values for each gene.
Typically, a log transformation is applied to the expression ratios so that we can compare the fold changes in the expression of genes in different populations. This notion applies to the Eisen (1998) data set in the IGS database -- i.e. the values in the database are logs of expression ratios.
Using the idea we discussed above, the mean log(ratio) values for each gene can be subtracted from the log(ratio) values so that all genes have the same mean (zero). Apply this transformation to the Eisen (1998) data set and then run the K-means algorithm on 500 or 1000 genes with the first 18 experiments. How do your clustering results compare to the clustering runs above? [Note: Since the data set is already stored as log ratios, use the "value - mean" transformation option, not "Log2(value) - Log2(mean)"].
The first 18 experiments (columns) of the Eisen (1998) data represent wild type yeast cells with synchronized cell cycle. The cells were fixed at specific times in the cycle [e.g. in experiment 2 cells were fixed at 7 min (alpha 7)]. See which of your cluster profiles are periodic. Does the choice of K (in K-means) determine how many clusters have periodic profiles? What wavelengths do you see? Do different filtering or transformation choices reveal periodic expression of different genes?
You can also examine the annotations of genes which have periodic profiles. Do any of the annotations implicate them in cell cycle events? [Note: You can retrieve gene annotations by selecting the "Excel" output file format from the IGS database and in Genesis you can store the identities of genes in clusters by saving the cluster with a mouse right click.]
Another approach is to try clustering runs in which you select different ranges of experiments from the set of 18 (e.g 1-5, 6-12). Is one cell cycle of expression data sufficient to give the same clusters? What happens if you use only a portion of each cycle?
The McDonald and Rosbash data set also contains data for timed experiments (Drosophila adults at different times of the day -- 0, 4, 8, 12 hours etc.) You might also analyze this Affymetrix data set to identify clusters of genes with periodic expression using appropriate filters and transformations for your analysis.
Review the annotated results in your Word document. For each analysis, record your assessment of the implications of your results. At the end of the document, record your general conclusions regarding the different clustering approaches that you have explored.
In this lab, we have introduced a number of techniques for analyzing microarray data. There are a number of additional issues.
(a) How should the value of K be chosen for the K-means algorithm ?
(b) What extra information does hierarchical clustering provide (compared to K-means)?
(c) What criteria would you use to decide on the clustering approach?
1. Using the first 500 genes of the Tamayo (1999a) data set, perform K-means clustering for several choices of K and various filters and transformations. Use the resulting profiles to illustrate the analysis. Discuss how the results change for the various choices of parameters.
2. What additional output information would you add if you were re-designing the GENESIS implementations of K-means and hierarchical clustering?
Another common algorithm for clustering gene expression profiles makes use of Self Organizing Maps (SOM). Clusters are represented as an array of cells in two-dimensional space, and the expression vectors of each cell (cluster) are updated in each iteration of the SOM based on the expression values of the genes assigned to that cell, as well as genes in the neighboring cells -- but the influence of neighboring cells falls off with distance.
In Genesis, try applying the SOM algorithm to the same data set (Tamayo et al. -- filtered and transformed) using the default settings. Compare the centroid values of clusters (which are influenced by neighboring clusters) with the expression of genes assigned to the clusters. Compare your results with those from K-means and hierarchical clustering.