This section briefly addresses some of the issues you need to consider. However, a full discussion of the issues involved is beyond the scope of this manual. These issues are covered in other more focused statistical methods literature and you might also address them in consultation with biostatisticians. The Internet has vast resources for microarrays. A few to get you started might include: a microarray citation electronic library http://arrayit.com/e-library/, the National Library of Medicine PubMed journal search engine, a general microarray Listserv GENE-ARRAYS@ITSSRV1.UCSF.EDU. The MGED group (Brazma, 2001) has published the MIAME standard which specifies (Minimum Information About a Microarray Experiment). This information is useful in doing an analysis. Also try searching using general Internet search engines. There are a number of public microarray data repositories. One that we find useful is NCBI's GEO (Gene Expression Omnibus), that contains array data and MIAME compliant information about the arrays.
A good and appropriate experimental design (i.e. the design and
setting up of experiments to subsequently be analyzed) is critical for
resolving significant differences in gene expression between
experimental conditions. We touch on some of the issues here. (Simon, 2001), (Dudoit,2000), and Kerr and
Churchill (2001a, 2001b) discuss some of the issues
of experimental design for microarrays. We do not currently implement
the Kerr-Churchill method. However, some of the issues involved in experimental design based on the types
of arrays are discussed in Section 3.1.1 for (Cy3/Cy5)-labeled as well
as 33P-labeled samples.
If users are comparing two different types of samples, the analysis
would be different than if they were comparing an ordered sequence of
samples (e.g. time series, cell cycle, dose-response, tumor-stage,
etc.). MAExplorer gives users the ability to:
Briefly, data mining is the discovery of potentially interesting
patterns in the data that were previously unknown. One approaches the
analysis of a set of data with minimal expectations. However, some
idea of what you are interested in helps focus the search. But beware
of the trap of mining the data until you get the results you hope
for. The following figure helps illustrate this process.
Figure 3.1 Flow chart of a typical data mining session. The
user makes some initial decisions on the experimental design such as
which hybridized samples to compare, the type and numbers of
replicates. They then make initial guesses as to the normalization
method to use, and the gene subset (the gene class) to concentrate on
when setting the data filter. The data is viewed in various modalities
to get a feeling for its inherent dynamic range and where interesting
outliers might appear. Clustering and plots helps bring these
differences into view. The results are then evaluated and either the
process is finished or the views are refined by adjusting data
normalization and filter parameters, data subsets to be investigated,
clustering methods, plots etc. and the process repeated until the user
is able to see the differences between gene subsets more clearly or no
significant differences appear to be found.
Obviously, this approach is a first approximation to what is
eventually required. But it does capture the flavor of the data-mining
process. Typically the user would refine the search using variations
of the data filters and might contrast (using gene sets and hybridized
sample condition lists operations) results found under one set of
conditions with those found under another set of conditions.
Proper experimental design of microarray experiments is critical to
successful use of microarray data. Several recent reports discuss some
of the key issues involved in various aspects of statistical analysis
of microarrays: (Radmacher,
2001), (McShane, 2001),
(Korn, 2001), (Simon, 2001), (Dudoit,2000).
An alternative method would be to compute
(Cy3X/Cy5Y) directly. However, this too
has its own sources of error and other problems, namely that not all
genes are labeled symmetrically with the two dyes since different dyes
may have different sequence specific affinities due to a variety of
causes. For that reason, dye-swap experiments are often
done. I.e. the two samples would be run as
(Cy3X/Cy5Y) as well as
(Cy3Y/Cy5X). If one were to plot
(Cy3X/Cy5Y) against
1.0/(Cy3Y/Cy5X) and the data were
perfectly symmetric (which they are not) then one would expect
a straight line. That is generally not what you get in practice.
Another issue is that when you have a number of samples A, B, C, D,
..., N and wish to compare them, there are a number of alternate
experimental designs you can use with different resulting sets of
advantages and problems. If a common pooled Cy5P
sample P were used, then the following experiments would be done:
MAExplorer is currently not oriented to handling these large
combinatoric types of non-pooled sets of experiments. However, you do
have the ability to swap (Cy3,Cy5) data on an individual basis so you
could compute an average of data from dye-swap experiments - but
with the caveats or non-uniform labeling mentioned above.
The major focus of the MAExplorer is interactive data mining with an
emphasis on direct graphical and tabular manipulation of the data.
The investigator is able to interact with the system by clicking on
spots in the array image, points in graphic plots, cells in
spreadsheets, by manipulating threshold sliders or typing in gene
names/clone Ids. This level of interaction allows investigators to
search for and identify patterns of differences with greater ease than
with a more static graphic system since it is easier to test ideas by
"grabbing onto the data". For example, "what" is the identity of
"this" outlier I am pointing to in a scatter plot; "which" genes are
best clustered with "this" gene in this clustergram and are perhaps
co-regulated; "which" genes have expression ratios within the range of
the histogram bins to that I am pointing?
Direct user manipulation of data, as incorporated in MAExplorer, was
defined by (Schneiderman,
1997) who defends the position that the direct manipulation of data
in data mining is an extremely effective means to amplify human
creativity in understanding patterns. Schneiderman's dogma states
"overview first, zoom, and then filter details on demand" and favors
the use of "shallow search trees, slide controllers, and
information-right screens with tightly coordinated panel view of
data", (Beardsly,
1999). MAExplorer also uses many of these direct manipulation
principles. It was designed to run on the desktop computers with data
residing on the same computer and loaded into its memory for rapid
direct manipulation - for both the Web browser and stand-alone
versions.
Part of the Flicker system allows comparison of user 2D gel images
with standard images from SWISS-2DPROT for putative identification of
unknown spots in the user gels. The user would select a standard
2D gel image from over 20 tissue types, enter their own 2D gel image
and align them at spots of interest. They could then switch to a
database access mode, click on those spots and generate popup
SWISS-2DPROT Web pages for those proteins - similar to Clone reports
in MAExplorer. That is accessed at
http://www.lecb.ncifcrf.gov/flicker/swissProtIdFlkPair.html.
MAExplorer will have a groupware facility similar to what we
have done with our
WebGel (http://www.lecb.ncifcrf.gov/webgel/) system described in
(Lemkin et al., 1999b). It is a
two-dimensional electrophoresis system for sharing data analyses. In
WebGel, users may perform a data-mining analysis and leave the state
of the their analysis and accompanying notes to share with their
collaborators on a login-protected basis.
We now discuss using these tools for analyzing ones data.
Table 3.2 Steps in a data-mining analysis.
3.1 Objectives in data mining, discovery and analysis
There are a number of objectives an investigator has when analyzing a
set of data. The types of analyses and how useful they are depends on
what they wish to get out of the analyses as well as the type of
data.
Recording the analysis steps during your data mining session -
command history
Because of the iterative nature of this process, you might want to
keep a record of the commands you have used or the messages and
measurements you have made. To do this you need to enable message and
command history logging. Go to the View pull-down menu and then select
the type of logging you want using the Show log of messages or the
Show log of command
history commands.
3.1.1 Some experimental design issues of microarray experiments
*** THIS SUBSECTION IS IN THE PROCESS OF BEING UPDATED *** Comparing HP-X/HP-Y for Cy3/Cy5 data as 'ratio of ratios'
If we have two samples HP-X and HP-Y with a common reference sample P
(e.g. Cy5P), then we would be comparing the HP-X
"intensity" Cy3X/Cy5X against the HP-Y
"intensity" Cy3Y/Cy5Y. Alternatively,
you can label Cy3 as the common reference sample P in which case just
swap Cy3 and Cy5 in these equations. If you are using a common
reference standard (i.e. Cy5X1) is the same sample
as Cy5Y1 eg. a pooled sample
Cy5P, then
a) (Cy3X/Cy5X1) / (Cy3Y/Cy5Y1)
becomes
b) (Cy3X/Cy3Y)
However, this new comparison is accompanied by additional noise
because of use of the two Cy5P intermediaries.
(Cy3A/Cy5P), (Cy3B/Cy5P), ... , (Cy3N/Cy5P)
This assumes that there is enough of the pooled sample P to be used
for all of the experiments - otherwise additional sources of
error would be introduced. MAExplorer is ideally used with this
common reference sample P.
It a common pooled sample is not used, then the experimental
design becomes more complicated - especially if dye-swap experiments
are performed for all samples. For N samples taken 2 at a time
(i.e. Cy3 and Cy5), then the number of experiments may be impossibly
large to perform for other than a very small N. Eg. for N of 3, the
number of experiments is 3 and 6 if dye swap experiments are also
performed. For N of 4, the number of experiments is 6 and 12. And
this is without doing any replicate experiments. If a reasonable
number of replicates is added, then this set of experiments becomes
even difficult to perform.
[(Cy3X/Cy5Y) + 1.0/(Cy3Y/Cy5X)]/2
In general, this is probably not a very good estimate.
3.1.2 Design philosophy of MAExplorer methodology
There are several ways to implement a data mining system on moderate
size databases. The first is that all computations are performed on a
Web server and the user's Web browser displays the results. The second
is download an applet from the Web server, get the data from the
Web server and do computations in the Web browser. A third way is do
download data from a Web server and run a local stand-alone program on
the data. MAExplorer can be run using both the second and third
ways. However, we encourage the use of the stand-alone paradigm as
having the best bandwidth and being the most robust.
The browser-based computation paradigm (as opposed to server-based) is
somewhat unusual. It keeps both the program and data on the server,
making user maintenance of the latest versions easier than if they had
to constantly upgrade the program or data. This also has the distinct
advantage of giving the user instantaneous feedback through rapid
visual and tabular views and the ability to more effectively navigate
the data since the analysis is done on their desktop computer. Because
it is easy to access reference data from other genomic sources
(e.g. UniGene, GenBank, NCI/CIT's mAdb clone DB, dbEST, GeneCard,
etc.), it can be accessed from their respective Web servers as needed.
Complex browser-based computations are used in other data mining or
intensive computation domains. With the increased bandwidth of the
Internet and compute power and memory of PCs approaching the Cray
supercomputers of the previous decade, this paradigm becomes even more
feasible. However there are limits to how well it scales because of
Web browser limitations. Appendix E.2
discusses these issues in more detail3.1.3 Evolution of MAExplorer from earlier proteomic data mining systems
MAExplorer was designed to do flexible exploratory quantitative data
analysis of gene data from microarray hybridized sample experiments.
Many of the data-mining concepts are derived from a system called GELLAB-II
(http://www.lecb.ncifcrf.gov/lemkin/gellab.html) that is a UNIX-based
stand-alone exploratory data analysis system for 2D protein gels over
multiple experiments (Lipkin and
Lemkin, 1981), a review (Lemkin
and Lester, 1989) and examples of graphical representations of
this type of data (Lemkin,
1995). An on-line
GELLAB-II Web-Poster
(http://www.lecb.ncifcrf.gov/lemkin/gellab-ep93wd.html) is available
showing various screen shots of GELLAB-II in action. Whereas GELLAB
works with sets of corresponding spots (i.e. proteins) across sets of
2D gel samples, MAExplorer works with sets of genes (spots in the
microarray) across sets of hybridized sample microarrays. With
protein gels, one typically has spot alignment problems since gels are
generally not superimposable. This is often called the rubber-sheet
distortion problem and requires localized alignment of spots based of
neighboring spot constellation morphology. We have used Web-based
visual methods to visually compare gels including the Flicker
(http://www.lecb.ncifcrf.gov/flicker/) image comparison system a Java
applet, (Lemkin, 1997), and the 2DWG
(http://www.lecb.ncifcrf.gov/2dwgDB/) meta-database of 2D gel images,
(Lemkin, 1999a). Since the genes
are precisely spotted on the arrays, aligning spots between arrays is
not required and greatly simplifies that the data analysis problem.3.1.4 Concepts used in data mining with MAExplorer
This section introduces some of the concepts used in data mining
microarrays with an emphasis on how they are used with MAExplorer.
Gene data filters - a Boolean AND of gene set tests
A primary MAExplorer concept is that of gene data filter that selects
a working set of genes by the conjunction (Boolean AND) of user
selectable tests. Each test further restricts the working set of genes
to those meeting the test. These criteria include gene membership in
particular gene classes, membership in particular user defined or
computed gene subsets, and meeting a variety of statistical
constraints. Statistics include intra- and inter-array CV, X-Y sets
t-tests. Range test criteria include X/Y ratio ranges and histogram
bins, intensity ranges and histogram bins. Membership criteria
include test if genes are in the current-cluster (derived from
cluster-analysis), gene set membership, etc. By selectively including
one or more of these filter restrictions, the user can home in on the
data that appears to be interest. Of course as in real mining, what
appears interesting may not be interesting based on further
investigation.Set operations on gene subsets
Because of the complexity of comparing many different replicated
samples, it may be difficult to manually organize the resulting
comparisons. MAExplorer offers set-theoretic operations on sets of genes and sets of hybridized samples
(i.e. intersection, union, difference) to help with this organization
(step 9 in Table 3.2). The results
of set operations may be saved and used in subsequent set operations,
normalization, as well as with the data filter. This is useful when
comparing and documenting procedures, methods, and analyses from
several subsets of experiments. User exploration states
Users needs to be able to save and restore the current state of their
explorations of the data and option settings to document and continue
at later times. When running in stand-alone mode, the user
may save their data mining session on the local disk as in named (.mae
file extension) startup files. Clicking on one of a startup file will
restart MAExplorer and restore the state to that of the time it was
saved. In addition to filter and parameter status, the HP-X, HP-Y,
HP-X and HP-Y 'sets', HP-E 'list', the named gene sets and HP
condition sets are saved as part of the state User groupware sharing of exploration states with collaborators
[In the future], these could be saved on a public Web server using
multiple named state files. These are protected for the user using a
login procedure. A groupware sharing of these intermediate
exploratory results is available when they allow another user to
access selected states. User states and groupware sharing complete
step 11 in the analysis described in Table 3.2. 3.2 Steps in data mining, discover, and analysis
An analysis scenario may use many methods for viewing the data. A
typical sequence of analysis steps is listed below in Table 3.2 in the
order they might be performed. Note that this is a rough guide
for a possible analysis and the iteration and backing up for
of some of these steps is required for data mining complex sets of
conditions, especially in the setting of constraints for the "data
filter" (step 4) when the user focuses on subtle patterns of interest
(c.f. Figure 3.1).
|
In designing a data mining experiment, the first decision to be made is selecting the set of hybridized samples to be compared (steps 1 and 2). This is accomplished by setting the current hybridized sample-X (HP-X) and hybridized sample-Y (HP-Y). In Figure 2.4.4.2 for the scatter plot we selected a single C57B6 pregnancy day 13 and a single Stat5a (-,-) pregnancy day 13 as current HP-X and current HP-Y samples. Changing the normalization changes the view in the scatter plot so that hidden differences may be more apparent (see Figure 2.4.2.3)
The names of the current HP-X and HP-Y samples are displayed at the top of the main window. The current HP-X and HP-Y samples may be changed at any time by clicking on a new sample from a list of samples shown on the left side of the main window or from lists of samples organized by sample population in the Samples menu.
The next decision to be made is selection of the genes to be studied by choosing a subset from the gene class menu list (step 4). Further selection occurs throughout the analysis by clicking on spots in microarray images, points in graphic plots or cells in spreadsheets, by adjusting threshold sliders, or using the text-entry "guesser" to type in gene names, clone IDs, genomic IDs, samples, etc.
The next decision the user must make is to set the intensity data normalization mode (step 3). Normalization of quantitative data is crucial when comparing data between different hybridized microarrays because of spotting, hybridization efficiency, uniformity, and other systematic errors.
Genes of interest may be separated for all of the genes in the database using a cascade of data filters (step 4). Additional filtering options are easily accessible in the (data) Filter menu. Some of the filters require additional parameters. These parameters are set by state scroll bars that pop-up on the screen when data filters requiring them are added to the filter cascade. Changing scroller values causes the data filter to be automatically be reapplied and a new set of genes to be computed.
It is desirable to reduce false-positives found by the data filter by eliminating genes with high quantification variability between duplicate spots on the same sample or spot duplicated in replicate samples. If duplicate genes are available on the array (denoted by Field 1 and Field 2 or F1 and F2 spots), this allows the computation of a coefficient of variation (CV) for the duplicates. This CV may be used in a data filter to reduce potential false-positives. CV is computed as 2|F1-F2|/(F1+F2) using those spot values for each gene, as StdDevHP/MeanHP for a set of replicate hybridized samples.
Graphical views of the data give the user additional insights into the data. These include spot intensity and ratio or Zdiff pseudoarray images, scatter plots, histogram plots, expression profile plots, cluster plots showing genes similar to a specified gene, the number of clustered genes for each clone, divisive clusters for K-means clustering, and clustergrams and dendrogams for hierarchical clustering.
When there are too many EP-plots to be viewed simultaneously, you
might use a scrollable list of expression profile plots that lets you
scroll through an arbitrarily large list of genes. However, it is
difficult to compare genes that are not sorted in some way
(i.e. clustered). Therefore, these are most useful when used after
clustering the data and displaying the scrollable EP-plots of
the cluster-order data.
Clustering is one way of possibly finding co-expressed genes that
exhibit similar expression changes in a set of samples. Genes may show
similar co-expression, but that does not prove they are co-regulated
at the same point in a pathway - merely that measurements of those
genes in a particular set of experiments show similar
expression. However, identifying genes with similar expression for
which some information is already known about some of the genes may be
useful as a starting point to help figure out gene function and
pathway using additional experiments and analysis.
There are many methods for doing clustering - each with advantages and
disadvantages. We present three methods in MAExplorer and plan on
adding a variety of more powerful methods through the MAEPlugin
facility under development.
The first cluster method finds a cluster of genes whose expression
profiles are similar to that of the currently selected gene. This list
of genes is restricted by the constraint that the cluster distance
between each of these genes to the selected gene is less than the
"Cluster threshold" distance set by the user with a scroll bar. It
displays genes that are found both with blue boxes (the larger the
box, the higher the similarity) and in a text report window showing
the genes and their distances to the current gene. By varying the
threshold and observing the results, the user can find a set of highly
correlated genes. If the threshold is set to 0.0, no genes are
found. If it is set too high, all data filtered genes are found. So
it is critical to adjust the threshold to a reasonable level
commensurate with the type of data being analyzed and the approximate
number of genes expected.
A second cluster method draws blue circles in the array image around
all filtered genes meeting the threshold criteria, where the larger
the circle the larger the number of similar genes (i.e. passing the
threshold) are found to be clustered with that gene. Clicking on a
gene toggles between the first and second methods. For both of these
methods, it will pop-up a "Cluster Distance" threshold scroller and
recomputes the clusters if you change the scroller value or the current
gene. It also shows a text report that displays the number of genes
similar to each data filtered gene.
A third method called "K-means" clustering K genes (we call primary
nodes) whose expression profiles are most orthogonal to each other. It
uses the current gene as the first or "seed" node. It then finds the
gene furthest from this and assigns it as node 2. Then the gene
furthest from both nodes 1 and 2 is assigned to node 3, etc. This
process is repeated until all K nodes are assigned. Then the
remaining genes are assigned to the closest node. Having defined the
initial cluster centers, it recomputes the centroid of each of the
clusters. The centroid can alternatively be computed using a median
instead of a mean in which case we would be doing K-median clustering
(Bickel, 2001). K genes are
then reassigned to the nearest new centroids as the new K-means node
instances. Finally, the remaining genes are assigned to the nearest
centroid. A scrollable K-means cluster text window report pops up
with genes sorted by cluster. Clicking on a gene in either the array
image or scatter plot assigns all genes in the cluster to which that
gene belongs to the "current cluster". Genes in the current cluster
are labeled in the array and scatter plot with a small number of the
cluster. In addition, genes in the current cluster are copied to the
E.G.L. where they can be used in a report, saved in a named gene set,
or used for additional filtering. It also pops up a "N-clusters"
scroll bar window to let you dynamically adjust the number of
clusters. Changing N will recompute the clusters. When the K-means
is recomputed, it uses the current gene as the initial seed gene.
The fourth method is a hierarchical clustering method that generates a
clustergram and dendrogram similar to that of Eisen's red-black-green
clustergram (Eisen, 1998). This
was derived from the clustered correlation map (ClusCor) of Weinstein
et al. (Weinstein, 1997). The
MAExplorer clustergram and dendrogram are dynamic and may be
interrogated and used to set the current gene. This means that it may
also position a corresponding ordered list of expression profile plots
to the same gene so you may view the data as a plot as well. The
dendrogram may be zoomed in to explore a part of the dendrogram in
more detail. As with the K-means clustering, a report can be made of
the ordered genes.
Filtering genes by histogram plots of ratios, Zdiffs or intensities
Histogram plots may be generated from either X/Y ratios or (X-Y)
Zdiffs of two different hybridized samples (single samples or X and Y
replicates) or from the F1/F2 intensities of a single hybridized
sample. Selecting a bin in a histogram restricts filtered genes to
those that are contained in that histogram bin. As an alternate
method, data filtering by ratio (Zdiff) or intensity range may
be used with adjustable range scrollers independent of the
histograms. However, histograms and scrollers may be used together.
For example, one could filter by the ratio histogram after filtering
out genes with low-intensity values that may be considered noise using
the intensity sliders. That might help eliminate falsely high ratios
resulting from dividing high X values by a very small noisy Y values.
Histograms are useful for getting a better understanding of the range
and distribution of the gene intensities or ratios. Expression profile plots (EP-plot) of N conditions for viewing
time series, etc.
List HP-E is an ordered list of samples - as different from HP-X and
HP-Y that are unordered sets of samples. The expression profile (step
5) of a gene is the plot of its normalized intensity as a function of
the samples in the ordered HP-E list. It may be plotted for the
current gene in a pop-up window. Selecting a different current gene
causes the EP-plot to be displayed for that gene. Multiple EP-plots
may be created to view the differences between a few genes you are
investigating further. The HP name button pops up a window with
the ordered list of samples so you can see the details of the sample
names being plotted. Selecting a line in a plot displays the intensity
data and sample name for that hybridized sample. The data may be
plotted as a bar, point or continuous curve and error bars may be
turned off to better compare multiple plots.Finding clusters of genes with similar expression profiles:
similar, cluster counts, K-means, and hierarchical methods
We may define a cluster of genes as a set of genes whose expression
profiles are found to be similar (step 6). The samples used in
computing the expression profiles are specified by the HP-E ordered
list. You can scale the list of normalized intensity data for each
gene to 1.0 (resulting in finding genes with similar shaped
EP-plots). Alternatively, if you don't scale this data it will
cluster more on magnitude changes. You can select either the Euclidean
distance or the correlation coefficient of the EP lists between two
genes as the measure of gene-gene distance. Similarity is 1.0 -
normalized distance.Gene reports: dynamic spreadsheets for Web access or
tab-delimited for exporting data to Excel
Pop-up report windows (step 7) may be generated for either individual
genes or a global array sample data. Instances of the latter include
experimental information and Web links, global statistics, correlation
coefficients between array samples, etc. Gene reports may present
this data in a number of ways. These include: highest/lowest gene
ratios, profiles, parametric and cluster statistics, etc. Reports may
be presented as either dynamic Web-interactive spreadsheet tables or
as static tab-delimited tables. The latter is useful for exporting
data using cut and paste into Excel (step 10). If the user clicks on
a blue hyperlinked cell in a dynamic spreadsheet table, it pops up
another Web browser window and loads it with data (step 8) from the
respective Internet genomic database such as mAdb Clone DB, UniGene,
GenBank, dbEST, GeneCard, and MGAP model and histology Web pages.Collaborative groupware environment
Having immediate access to collaborator's data is a powerful research
tool. A collaborative environment is being implemented for MAExplorer
that allows groups of users to share data and intermediate
results. These capabilities include: 1) the ability to save and
restore exploratory data mining sessions (states) through the Web
server including named sets of genes, and 2) to selectively share
these states with collaborators. The latter process is sometimes
called a groupware environment because if offers a collaborative group
the ability to share and interact. These capabilities are modeled
after our WebGel system (Lemkin
et al., 1999b). In addition, users can create a custom database
Web page as a subset of samples from the entire database. This may be
saved on their own computer through their Web browser's "File/Save as"
command. This hypertext file could then be used at a later time to
access the database or be E-mailed to a collaborator to do the same.
Then, the expression profile is expressed as a list of values:
For scaled data such that dpq has a maximum value of 1.0
ovger all samples. A similarity measure could be computed as
1.0 - distance or
3.2.1 Definition of expression profile
It is helpful to define an expression profile. There may be alternate
definitions, but the following is useful for getting an understanding
of how it might be computed. An expression profile
ej of an ordered list of N samples (k=1
to N) for a particular gene j is a vector of scaled
expression values vjk.
ej = (vj1, vj2, vj3, ..., vjN)
A difference between two genes p and q may be estimated as a
N-dimensional metric "distance" between ep and
eq. The Euclidean distance is then defined as
dpq = (1/N SUMj=1:N (vjp - vjp)2 )1/2
Other distance measures may include correlation coefficient,
city-block (or manhatten distance) etc.
spq = 1 - dpq
3.2.2 Clustering Methods
Clusters represent one way to identify similar gene expression across
a set of experiment samples. There are many ways to cluster the data,
some of which are available in MAExplorer. These include:
Other methods include Self Organizing Memory (SOM), fuzzy clustering,
Support Vector Machines (SVM), etc.
3.2.2.1 Clustering similar genes
If we have a particular gene s (the "seed" gene), we may want
to find a set of all genes {gj} similar to
gs. We can find this set of genes by testing
We define a particular gene gj as similar to seed
gene if the distance between genes s and j meets
the following criteria.
djs < T
The threshold T is set by the investigator and in MAExplorer
is changed using a slider. Typically, the set of all genes
{gj} found is sorted by similarity before being
viewed.
3.2.2.2 K-means clustering
K-means clustering finds K clusters of genes with similar expression
profiles to a given gene (see
Sneath and Sokol, 1973). Given the number of clusters K,
we could use high variance of clusters to determine if they should
split into sub-clusters. K-means clustering does not need a distance
matrix (see Hierarchical clustering which follows), so it is faster
and may cluster large numbers of N genes. However, it is
highly dependent on seed selection. It may be useful for getting an
initial estimate - especially if other techniques (such as silhouette
plots) are also used. The following is a simplified definition of one
way to compute a set of K-means clusters of gene expression profile
data.
Algorithm:
|
3.2.2.3 Hierarchical clustering
Hierarchical clustering of a set of genes will generate a binary tree
of clusters with the genes at the terminal ends of the tree and a
single cluster of the entire tree at the top (also called the root) of
the tree. See (Sneath and Sokol,
1973) for a discussion on hierarchical clustering. There are many
other variants of hierarchical clustering. Hierarchical clustering
requires a distance matrix or the equivalent of one [there are more
efficient ways to compute it]. ForN genes (terminal
clusters), it generates 2N-1 clusters. Distance matrix is
upper diagonal matrix D of dpq of size
N(N-1)/2.
D can get quite large for clustering a large number of genes N [for N=5000, this is > 50 Mbytes!]
The following is a simplified definition of one way to compute a hierarchical clustering of gene expression profile data.
Algorithm:
|
If there is only one field in the array, it will appear as field 1. In
the above example, [1-A4,3] is field 1 grid A
row 4 and column 3. Note that the pseudoarray coordinates are for
visualization purposes in MAExplorer and may or may not be the same as
the coordinates on the actual array. That depends on how the
MAExplorer database was defined in the configuration file described in
Appendix C.
When the current gene is defined, it will draw a yellow (green) circle
around the spot in the ratio (intensity) pseudoarray image and display
other features of the gene in the three-line status area near the top
of the main window. If background correction is enabled (the "Use
background intensity correction" in the Normalization menu), then spot
intensity values will appear as intensity' (with background
intensity subtraction) and intensity (without background
subtraction).
There are a number of different reporting formats available depending
on the array display mode and particular normalization method
selected. These include: the pseudoarray image of the intensity of a
single sample, the pseudocolor ratio X/Y or Zdiff (X-Y) image (using
either HP 'sets' or single samples), or the ratio of Cy3/Cy5 for
dual-labeled dyes or F1/F2 for replicate spots for a single sample.
In addition, the normalization mode is also displayed in the reporting
line. We will present examples of each of these different reporting
formats.
You may show the intensity data for a particular spot in the currently
displayed pseudoarray image. First select the "Pseudograyscale image"
option in the "Show Microarray" submenu in the "Plot menu". If your
data has duplicate grids (i.e. fields F1 and F2) then you may look at
F1, F2 and mean (F1+F2)/2 data in the reports when you click on a
spot. If the "Gang F1-F2 scrolling" switch is disabled in the "View
menu", then the intensity value is the intensity data value
for the gene at that location. If the "Gang F1-F2 scrolling" switch is
enabled, then it reports intensity[F1], intensity[F2], and the F1/F2
ratio. These two formats are shown in the following two examples for a
C57B6 pregnancy day 13 samples in the MGAP database:
a) Field F1 spot for a single spot in a single sample with the median
intensity selected.
c) Ratio data for two samples X and Y in separate hybridized
arrays. Ratio data for the field F1 and F2 spot data as well as the
mnX/mnY ratio is reported. The median normalization was used in this
example.
f) Multiple HP-XY 'sets' using median normalization for the pseudoarray
image display for the HP-X 'set' of three C57B6 samples.
j) Multiple HP-XY 'sets' p-value using median normalization for ratio
(HP-X/HP-Y) data for the "Pseudocolor (HP-X,HP-Y) 'sets' p-value
display.
For the intensity and ratio threshold filters, the range
interpretation may be inside, or outside the specified range. The
ratio range [R1:R2] is between 0.01 and 100.0. The Zdiff range [Z1:Z2]
and [CZ1:CZ2] are between -4.0 and +4.0. The intensity threshold range
[I1:I2] is set to the dynamic range of the min and max intensity for
the current normalization method.
A list of possible threshold sliders is shown in the following table.
When a Filter is enabled that requires a slider, it pops up the
State Scrollers window that contains one or more
slides. When you disable all filters that use these sliders, the
popup window will disappear. The corresponding Ratio R1[R2] or
Zdiff Z1[Z2] sliders are used if you are using a ratio or
Zscore normalization - and will change if the normalization changes
while the filter is active.
Some of the sliders are implemented with a non-linear scale so that
you have more resolution at the low end (eg. p-Value, Spot CV, Diff
HP-XY).
Depending on the set of data Filters selected, there may be multiple
sliders present in the State Slider popup window (eg. see Figure
2.4.3).
Table 3.3.1. List of threshold sliders. Sliders are enabled in
the State-Scroller popup window when the corresponding data filters
are enabled.
3.3 Display of gene spot intensity and identification data measurements
You may select the current gene by clicking in the pseudoarray image or in the X-Y scatter plot and
MAExplorer reports. The
microarray grid coordinates, normalized quantified spot intensity
data, plate coordinates, gene name (if known) and associated data for
that gene. If you are displaying a pseudocolor ratio (X/Y) or Zdiff
(X-Y) image, it will report HP-X/HP-Y or (HP-X - HP-Y) data
respectively. It also sets the gene as the current gene. The
pseudoarray image coordinates are reported as:
[<field>-<grid name><row#>,<col#>].
e.g.
[1-A4,3]
[1-A4,5] intensity=4.5267, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
b) Field F1 and F2 replicate spots for a single sample. The top line
is shown for each of the different normalization methods.
[1-A4,5] intensity[F1]=-0.3067, intensity[F2]=-0.2312, F1-F2=-0.0755, (Norm.: Zscore intensity)
[1-A4,5] intensity[F1]=4.5267, intensity[F2]=6.2408, F1/F2=0.7253, (Norm.: median intensity)
[1-A4,5] intensity[F1]=0.8755, intensity[F2]=1.1457, F1-F2=-0.2701, (Norm.: log median intensity)
[1-A4,5] intensity[F1]=-0.1442, intensity[F2]=-0.0945, F1-F2=-0.0497, (Norm.: Z-score, stdDev, log intensity)
[1-A4,5] intensity[F1]=-0.1533, intensity[F2]=-0.1004, F1-F2=-0.0528, (Norm.: Z-score, mean abs.deviation, log intensity)
[1-A4,5] intensity[F1]=630.9911, intensity[F2]=869.9273, F1/F2=0.7253, (Norm.: calibration DNA intensity)
[1-A4,5] intensity[F1]=1919.9376, intensity[F2]=2646.957, F1/F2=0.7253, (Norm.: scale to max. (65K) intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
If the "Pseudocolor HP-X/HP-Y ratio or Zdiff" option is selected in
the "Show Microarray" submenu, data is reported as either Ratio or
Zdiff data depending on the normalization method selected. The data
used in the following examples is for C57B6 pregnancy day 13 (HP-X)
compared with Stat5a (-,-) pregnancy day 13 (HP-Y).
[1-A4,5] HP-XY: mn(X,Y)=(5.383,6.834) (X/Y)(F1,F2,mean)=(0.651,0.928,0.787), (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
d) Zdiff data for two separate samples X and Y. Ratio data for the
field F1 and F2 spot data as well as the mnX-mnY Zscore difference is
reported. The three Zscore, ZscoreLog, and logMean normalizations were
used in this example (first lines are shown).
[1-A4,5] HP-XY: mn(X,Y)=(-0.269,0.151) (X-Y)(F1,F2,mean)=(-0.470,-0.370,-0.420), (Norm.: Zscore intensity)
[1-A4,5] HP-XY: mn(X,Y)=(-0.119,0.051) (X-Y)(F1,F2,mean)=(-0.199,-0.142,-0.170), (Norm.: Z-score, stdDev, log intensity)
[1-A4,5] HP-XY: mn(X,Y)=(1.010,1.224) (X-Y)(F1,F2,mean)=(-0.362,-0.064,-0.213), (Norm.: log median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
e) Example of when the "Use dual HP-X & HP-Y Pseudoimage" mode is
enabled in the "Show Microarray" submenu of the "Plot" menu. This
displays mean data for the HP-X and HP-Y data side-by-side. The median
normalization was selected.
[1-A4,5] intensity[X]=5.3837, intensity[Y]=6.8342, X/Y=0.7877, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
Reporting for multiple hybridized samples when using HP-X/-Y 'sets'
If you have enabled MAExplorer to "use HP-X and HP-Y 'sets' of
multiple samples" rather than single samples" in the Samples menu, it
will report a spot differently using the means (mn), standard
deviations (S.D.), coefficient of variations (CV) for the samples in
the HP-X and HP-Y 'sets'. For duplicate fields, these are computed
using the normalized average of F1 and F2 spots for each gene in each
samples. The data used in the following examples is for three C57B6
pregnancy day 13 (HP-X) samples, and five Stat5a (-,-) pregnancy day
13 (HP-Y) samples.
[1-A4,5] HP-X 'set' mean intensity=3.295 stdDev=1.482 CV=0.449 n=3, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
g) Multiple HP-XY 'sets' using median normalization for the pseudoarray image
display for the HP-Y 'set' of five Stat5a (-,-) samples.
[1-A4,5] HP-Y 'set' mean intensity=8.180 stdDev=0.986 CV=0.120 n=5, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
h) Multiple HP-XY 'sets' using median normalization for the
pseudoarray image display for the HP-X and HP-Y 'sets' when the "Use dual
HP-X & HP-Y Pseudoimage" mode is enabled in the "Show Microarray"
submenu of the "Plot" menu.
[1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120)\
n(X,Y)=(3,5), (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
i) Multiple HP-XY 'sets' using median normalization for ratio (HP-X/HP-Y) data
for the "Pseudocolor HP-X/HP-Y Ratio or Zdiff" display.
[1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120) \
n(X,Y)=(3,5), (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, platey[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
[1-A7,20] HP-XY: mn(X,Y)=(3.449,0.853) (X/Y)(F1,F2,mean)=(4.09,4.008,4.041), (Norm.: median intensity)
CloneID: 1382656, dbEST5': 1775754, GenBank 5': AI036495, UniGene: Mm.300, plate[12,A,8]
GeneName: Carbonic anhydrase 3
Reporting for Cy3 and Cy5 channels for a single hybridized sample
k) If you have Cy3/Cy5 data, then you can look at the two channels for
a single sample ( the current HP sample). For median normalization and
the display set to "Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5
data" display.
[1-A6,11] Cy5/Cy3=0.3588, Cy5=67.324, Cy3=187.622, (Norm.: median intensity)
CloneID: IMAGE:1054189,
GeneName: expressed sequence AW213287
Reporting HP-X (Cy3 or Cy5) vs HP-Y (Cy3 or Cy5) data for 2 samples
l) If you want to compare Cy3 or Cy5 in the HP-X sample with a Cy3 or Cy5
value in the HP-Y sample, you do it through the special Cy3,Cy5
scatter plots. There are four types of plots:
After the plot is started, clicking on a scatter plot will report data
from the point in that plot will print the following data as shown in
the following example where HP-X Cy3 is plotted against HP-Y Cy3.
[1-A5,16] intensX=4.695, intensY=5.923, (X-Y)=-1.2275, (Norm.: log median intensity)
CloneID: IMAGE:963758,
GeneName: RIKEN cDNA 2410114O14 gene
3.4 Selecting subsets of genes using the data Filter
Genes may be selected on a number of criteria specified by the data Filter (Section 2.4.3)
that is a cascade of data tests. The first might be the gene class (Section 2.4.1) to
restrict the set of all genes to a particular subset. Various numeric
and statistical data tests might be applied on the remaining genes to
exclude those not meeting these tests. For example, genes having a
high coefficient of variation between duplicate spots on the same
sample or duplicate samples could be eliminated. Then one could
select genes that had a (HP-X/HP-Y) ratio greater than 4.0 but less
than 8.0, etc. The latter could be done using either the ratio
scrollers or by clicking on that bin in the ratio histogram plot. See
the Filter menu options and look at one of the tutorials (Appendix A) for ideas on
adjusting the Filter to close in on a particular subset of genes.3.5 Selecting subsets of hybridized sample conditions
Sets and lists of hybridized sample conditions (HP-X and HP-Y sets,
HP-E list) may be selected using various commands from the Samples Menu (Section 2.2)
including pull-down menus, guessing by name or part of a name, or
using a
"Chooser" (see Figure 2.2.1) to design your settings for the
current (HP-X and HP-Y sets, HP-E list). The Chooser is the easiest
way to select these entries. In addition, if you want to change the
current HP-X or HP-Y individual sample, you can do this directly from the array
pseudoarray image by clicking on the [X] or [Y] part of the image
and then selecting the particular sample to use. Note that if the
mouse-over checkbox is enabled, then moving the mouse over the sample
names gives you the full sample name. Otherwise, the sample name may
be truncated.
3.6 Setting threshold values using the state-scroller sliders
You may filter genes using a variety of thresholding operations (see
the Filter menu (Section
2.4.3) to select any of these). For example, these include a spot
intensity (per channel) range [SI1:SI2], gene intensity range [I1:I2],
ratio range [R1:R2], Zdiff range [Z1:Z2], Coefficient Of Variation
(CV) range [0:1.0], p-value range [0:1.0] for the
t-test, etc. Additional threshold scrollers are used with the
clustering methods including the number of clusters (default 6), the
maximum cluster distance from a gene to a another gene for the latter
to be considered in the same cluster, and the absolute difference
between HP-X and HP-Y.
Slider name | Associated with operation |
---|---|
Spot Intensity SI1 | Filter by spot intensity range per channel |
Spot Intensity SI2 | Filter by spot intensity range per channel |
Percent SI OK | Filter by percent of spots whose spot intensity is in threshold range criteria meets the AT LEAST or AT MOST criteria |
Intensity I1 | Filter by gene intensity range |
Intensity I2 | Filter by gene intensity range |
Ratio R1 | Filter by gene X/Y ratio range |
Ratio R2 | Filter by gene X/Y ratio range |
Zdiff Z1 | Filter by gene X-Y Zdiff range |
Zdiff Z2 | Filter by gene X-Y Zdiff range |
Ratio CR1 | Filter by Cy3/Cy5 gene X/Y ratio range |
Ratio CR2 | Filter by Cy3/Cy5 gene X/Y ratio range |
Zdiff CZ1 | Filter by gene (Cy3-Cy5) X-Y Zdiff range |
Zdiff CZ2 | Filter by gene (Cy3-Cy5) X-Y Zdiff range |
p-Value | Filter by t-Test |
Spot CV | Filter by Coefficient of Variation |
Cluster Distance | Plot - cluster by expression similarity |
# of Clusters | Plot - K-means clustering |
Diff HP-XY | Filter by absolute difference (HP-X,HP-Y) |
Spot Quality | Filter by continuous spot quality (If data available) |