|
||
|
Newsletters | Plugins | Quick start | Short tutorial | Advanced Tutorial | Glossary | Figures | Tables | Index | Help desk |
||
|
| ++ Note: This hypertext manual is divided into chapters and appendices Web pages. These may be printed individually from your Web browser by (1) clicking in the text window to be printed, and (2) using the "Print Frame" in Netscape or "Print" in Internet Explorer. Some of the chapters (eg. 2) have many images. The entire manual may be downloaded at one time with low resolution figures and is suitable for printing in the Web browser. You may also download a an Adobe acrobate PDF file version of the entire manual with the lower resolution figures (~5Mb). The Unix script for creating the full reference manual from the individual HTML pages is CreateMaeFullRefManual.do. |
The MAExplorer is a Java-based bioinformatics exploratory data-analysis and data-mining program for analyzing sets of quantitative spotted cDNA or oligonucleotide microarray data (Lemkin et al., 2000) - (see (Schulze, 2001) for a review of microarray technology).
Prior to its release on SourceForge, MAExplorer was developed by Dr. Peter Lemkin (LECB/NCI-Frederick) with help from Gregory Thornwall (SAIC) and Jai Evans (DECA/CIT, NIH). It was initially created for analyzing 33P labeled membrane array data from the mouse mammary tissue from Mammary Genome Anatomy Project (MGAP) http://mammary.nih.gov/ with the help of many researchers in the Laboratory of Genetics and Physiology, NIDDK under Dr. Lothar Hennighausen. Since the early work with MGAP it was extended to work with other types of cDNA and oligo arrays and various nucleotide labeling methods. These include spotted Cy3/Cy5 glass slides, spotted membranes, non-geometric chip data, and other chip supports with different geometries and numbers of duplicate spots/gene, clones as well as oligo chip data such as Affymetrix. A wizard tool called Cvt2Mae was developed to make it easier for other researchers to convert their data to the format required by MAExplorer. Cvt2Mae was developed by Peter Lemkin, Greg Thornwall and Bob Stephens (ABCC/SAIC). You may extend the set of builtin analysis methods by writing Java plugins called MAEPlugins.
This document describes the MAExplorer's functionality, provides tutorials and contains documentation for using it with various types arrays.
With this program, you may: 1) analyze expression of individual genes; 2) analyze expression of gene families and clusters; 3) compare expression patterns for multiple hybridized samples.
MAExplorer is written in Java and runs as a stand-alone application that you download to your computer. Although MAExplorer began out as a Java applet for use with with Web browsers for the MGAP Web database ( http://www.lecb.ncifcrf.gov/mae ), we have depricated its use as an applet because of many problems with running large Java applets in some Web browsers. Instead, we recommend downloading MAExplorer which includes the public MGAP array data as a demonstration data set. Then run MAExplorer on this data after you have installed it on your computer.
|
Notation: MAExplorer uses the notation that the sample
probe total mRNA is labeled and then hybridized against the
known cDNA targets tethered to the microarray. Because of this
notation, we refer to a hybridized sample as a HP. An alternative
notation that reverses these terms is also commonly used (see
"Chipping Forecast", Nature Genetics supplement, Jan, 1999,
pg 1). Also, because arrays may be constructed from either spotted
clones or oligonucleotides, we refer to hybridized chip DNA from any
of these sources genericlly as "genes".
|
Throughout this document we use the abbreviations HP for hybridized sample, GC for gene class. These and other terms are explained in the Glossary and Index . There are a number of figures and tables illustrating various features of MAExplorer throughout this manual. Figures are presented at low-resolution. By clicking on the lower-resolution figure, the high-resolution versions can be viewed.
| NOTES: because MAExplorer is under development, there may be occasional problems with some of its functionality. There may also be some problems (mostly bad HTML links) with migrating from LECB/NCI to the SourceForge Web site. Some operations that are under development are labeled with "[Future]" in this manual. We welcome your suggestions for improvements as well as letting us know about problems that you encounter. Occasionally the manual or the figures in the manual may not be quite in phase with the software. Please notify us of problems or suggestions by E-mail so we can try to fix or implement them. If you are a bioinformatics developer and would be interested on working with the MAExplorer project, consider joining the MAExplorer development team on SourceForge.net. |
2. MAExplorer menus
3.
Exploratory Data Analysis - Data Mining
4. Status and Bugs of MAExplorer
References to related exploratory data analysis methods
Appendices
C. Use of MAExplorer with user's microarray data
D.
Use of MAExplorer as a stand-alone application
E. Design issues
Download Installers
MAExplorer Open Source
List of Figures
**Icon Legend
Data from a 38 sample subset of hybridized samples from the MGAP mouse microarray
database. This screen illustrates a synthetic pseudoarray image
showing the ratios of duplicated grids of genes comparing day 13
pregnancy in C57B6 mouse (sample HP-X 'set') with Lactation day 1
(sample HP-Y 'set'). The color scale of the spots is indicated on the
left as is the current data normalization mode (median). Genes with
white circles are named genes and were selected by the data filter. A
scatter plot of this data is shown on the right with genes passing the
data filter indicated as red + and those not passing the filter
(i.e. ESTs, calibration DNA, user's genes) shown as gray + symbols. A
single gene was selected by clicking on it in the array image and has
a yellow circle (grid 1-D) and a corresponding green circle in the
scatter plot. Information on that gene is indicated above the array
and at the top of the scatter plot. MAExplorer can also be used to
view mean data from sets of samples e.g. Day 13 pregnancies
from C57B6 (3 HP-X samples) vs. Day 1 Lactation (4 HP-Y samples) at low or high resolution.
Table of Contents
Menu summary
Quick start
1.1 Microarrays and notation used with MAExplorer
1.2 Microarray image quantification
1.2.1
Ratio and Zscore comparison of data from different hybridized samples
1.3 Microarray image and plot display
1.4 Exploratory data analysis - overview
1.4.1
Saving the state of a data-mining session in stand-alone mode
1.4.2
Logging messages and command history
1.5 Quick start - demonstration of MAExplorer
1.6 Tutorials for using MAExplorer
2.1 File menu
2.1.1
Databases menu
2.1.2
Exploratory state menu
2.1.3
Groupware facility for sharing user states menu
2.2 Samples menu
2.2.1
Selecting sample HP with chooser or menu sample lists
2.2.2
Swapping selected samples's (Cy3,Cy5) channels in ratio data
dye-swap experiments
2.2.3
Viewing sample HP-X, HP-Y, and HP-E partitions
2.2.4
Defining sample condition 'class' names
2.2.5
Toggling between single HP-X (-Y) samples and HP-X (-Y) sets
2.3 Edit menu
2.3.1
User edited gene list - the 'Edited Gene List' menu
2.3.2
Sets of genes menu
2.3.3
Sets of Sample Conditions menu
2.3.4
Setting user preferences menu
2.4 Analysis
2.4.1 GeneClass menu
2.4.1.1
GeneClass ontology subsets
2.4.1.2
Simulating Gene Class ontologies using Gene Set operations
2.4.2 Normalization menu
2.4.2.1
Intensity background correction
2.4.2.2
Normalization between microarrays to allow comparison
2.4.2.3
Using different normalizations to 'see' different data views
2.4.3 Filter menu
2.4.3.1
Data filtering using multiple gene data filters
2.4.4 Plot menu
2.4.4.1
Show microarray pseudoarray images menu
2.4.4.2
Scatter plots menu
2.4.4.3
Histogram plots menu
2.4.4.4
Expression profile plots menu
2.4.5
Cluster menu
2.4.5.1
Cluster genes with expression profiles similar to current gene
2.4.5.2
Cluster counts of similar filtered genes by expression profiles
2.4.5.3
K-means clustering' gene expression profiles for filtered genes
2.4.5.4
Hierarchical clustering of expression profiles
2.4.6 Report menu
2.4.6.1
Array report menu - hybridized samples global data
2.4.6.2
Gene reports menu
2.4.6.3
Table format menu
2.4.6.4
Table font size menu
2.5 View menu
2.5.1
Logging MAExplorer messages
2.5.2
Logging command history
2.6 Plugins menu
2.7 Help menu
3.1 Analysis objectives
3.1.1
Some experimental design issues of microarray experiments
3.1.2
Design philosophy of MAExplorer methodology
3.1.3
Evolution of MAExplorer from earlier proteomic data mining systems
3.1.4
Concepts used in data mining with MAExplorer
3.2
Steps in an analysis
3.2.1
Definition of expression profile
3.2.2
Clustering Methods
3.2.2.1
Clustering similar genes
3.2.2.2
K-means clustering
3.2.2.3 Hierarchical clustering
3.3
Display gene intensity and identification data measurements
3.4
Selecting subsets of genes using the data Filter
3.5
Selecting subsets of hybridized sample conditions
3.6
Setting threshold values using the state-scroller sliders
3.7
Exporting report and plot data
4.1 Known Bugs in MAExplorer
4.1.1 Browser Applet Bugs
4.1.2 Downloading and Installer Bugs
4.1.3 Computation speed and display Bugs
4.1.4 User state and login Status
4.1.5 Data file names Bug
4.1.6 Gene Sets Bugs
4.1.7 Clustering Bugs
4.1.8 Expression profile Bugs
4.1.9 Data conversion problems
4.1.10 Java Plugins bugs
4.2 Revision notes
4.3
Web Browser problems when running MAExplorer as an applet
4.4
Handling fatal error reporting (i.e. DRYROT errors)
R.1 Nucleic Acids Res. paper (PDF)
R.2 Overview (PDF)
R.3 Examples (PDF)
R.4 Using mAdb data with MAExplorer (PDF)
R.5 Introduction to Data Mining with MAExplorer(PDF) or
(PPT)
R.6 Using Cvt2Mae to convert array data for use with MAExplorer.(PDF)
R.7 Statistics in Functional Genomics workshop paper (PDF)
R.8 Software design of the MAExplorer data mining tool
(PDF) or
(PPT)
A. Short tutorial for MAExplorer
A.1 Demonstration data
A.2 General instructions
A.3 Self-guided tutorial of MAExplorer - notation and examples
C.1
Creating quantified spot data files from hybridized sample arrays
C.2
Table of samples that can be loaded into MAExplorer
C.3
Quantified spot data file format
C.4
GIPO table database file format
C.5
Configuring MAExplorer for use with other arrays
C.6
Using the Cvt2Mae 'wizard' tool to convert array data for use with
MAExplorer
D.1
Installing MAExplorer as stand-alone application
D.2
Downloading MAExplorer for stand-alone use with other arrays
D.3
Starting MAExplorer by clicking on a .mae file
D.4
The data file format for .mae files
D.5
Using MAExplorer as an Applet on your computer
D.6
List of startup .mae files included in the download installation
E.1
Internal data structures design to facilitate direct manipulation
E.2
Approaches to data mining: client-centric and server-centric models
E.3
Conversion of microarray data files to MAExplorer format using Cvt2Mae
E.4
Extending MAExplorer functionality using Java Plugins
E.5
Web database server design
Installer information
Download source
javadocs for source
MPL1.1 Public License
Legal
List of Tables
Glossary of terms used in MAExplorer
Index
MAExplorer - Overview
MAExplorer is a bioinformatics microarray data mining Java application
that may help in the discovery of genes regulated in cancer and other
diseases. MAExplorer is generally run as as a stand-alone application on a
local computer. By running as a local application, it is able to
access your local disk to save the state of your data mining session
as well as plots and reports. Using the previously saved data mining
state, you can continue a data-mining session at a later date after
exiting the program.
Recommended Hardware
Because data mining is a computationally and graphically intensive
activity, a reasonable level of computation resources are required for
adequate response. The same Java program runs on a variety of
operating systems including Windows 95/98/Me/NT/2000, Macintosh
OS8/9/X, Solaris, Linux, etc. so the choice of computer is not that
critical. We recommend the following hardware:
Addition of user defined analysis methods using Java Plugins
We have provided the ability for users to add their own Java Plugin
Extensions to MAExplorer. These extend the capabilities of the core
MAExplorer program to other more sophisticated analysis methods
created by users and allow interaction with specialized genomic
servers. This is described in Appendix E, Section 2.6, and in the MAExplorer Plugins Web page.

1. Introduction
This hyperlinked manual provides a detailed description of the
MAExplorer conventions (Section 1) and operation (Section 2). The latter contains
many figures of computer screens showing the operations described in
the Section. Section 3
discusses typical scenarios in using MAExplorer for data-mining
microarrays and contains a brief introduction to the process of data
mining. Section 4 lists currently
known bugs and the revision history. Appendix A is a short
tutorial. Appendix B is a more
advanced tutorial. Appendix
C describes the files required by MAExplorer and how they may be
created for using MAExplorer with other array data. It also describes
the data conversion tool Cvt2Mae. The Appendix D covers downloading,
installing and running MAExplorer as a stand-alone Java application on
a local computer. Appendix E
discusses design issues for the MAExplorer Java program and supporting
Web servers. Users may create new analytic methods and add them as MAExplorer Plugins as Java
extensions. There is a
glossary of terms used in MAExplorer. There is also a List of Figures, a List of Tables, and an Index to help find material of interest.
MAExplorer is normally used as stand-alone program
Figure 1 gives an overview of the
system. Note that MAExplorer does not perform spot
quantification from raw scanned images - it is used for the
subsequent data mining analysis of quantified spot data. Figures 1.1.1 through
1.1.3 describe this in more detail.
Figure 1. Overview of MAExplorer exploratory data analysis
system. Initial data preparation steps are performed prior
to analysis by MAExplorer and are indicated by cyan
italics at the top of the figure. The primary data consists of
quantified microarray image data as well as corresponding qualitative
clone ID, gene-in-plate-order (GIPO or print-table, etc.), gene name,
hypertext base references and related information. After the
microarrays are hybridized, they are scanned and spots quantified
using image spot quantification programs. These lists are then saved
for each array in a tab-delimited file. Microarray image
quantification may be performed by various software such as Axon's
GenePix(TM), Scanalyze, Molecular Dynamics
ImageQuant(TM), Research Genetics' Pathways(TM),
etc. When used as a stand-alone application, data may be saved on the
local computer for local off-line use, and direct access to other
Internet genomic databases may be made without using a proxy server.
[DEPRICATED: When used as an applet, this auxiliary
databases and the MAExplorer Jar files are copied to the Web server or
local file system (in the case of the stand-alone version) where they
are then available to be downloaded by users. When a user invokes a
Web page containing the Java applet, it first downloads the applet
that then downloads auxiliary databases including a configuration file
that describes the array data. It then downloads the subset of
quantified microarray spot data files requested for the set of
hybridized samples being investigated. Additional samples may be
downloaded at any time. When the user selects an operation that
requires access to Web databases not residing on the MAExplorer Web
server, implicit Java security restrictions prevent the applet from
going directly to these other Web servers. Instead, it requests the
MAExplorer proxy server request the data from the foreign Web server,
and then returns it back to the user's Web browser. ]
Figure 1.1.1 Overview of MAExplorer exploratory data analysis
system. MAExplorer is used as a stand-alone application on local
data. [Its use as a Web browser applet has been DEPRICATED. In
the case of the applet, it may only access quantified array data from
the Web server that launched the applet.]
Figure 1.1.2 Overview of data preparation for quantified spot data
used by MAExplorer. MAExplorer handles quantified spot data as
shown in this figure. Arrays are hybridized against labeled samples
are scanned and spots are quantified into spot data files. Quantified
spot data is represented as tab-delimited data with data for one
spot/row. Each spot is identified in this file by its grid
coordinates (grid, grid row, grid column) with image (X,Y) coordinates
being optional. Quantified spot data includes the raw spot intensity
for each channel (in the case of multiple channels such as Cy3, Cy5,
etc.). If the original data has background spot intensity values, then
that may be included as well - otherwise no background data will be
available for background correction. The spot data is discussed in
more detail in Section 1.1 and Appendix C.1, and Appendix C.3.
Figure 1.1.3 Overview of running MAExplorer as a stand-alone
application. The preferred way of running MAExplorer is as a
stand-alone application. There are distinct advantages in running
MAExplorer as an application in that data and the exploration state
may be saved on the users local computer, direct access to genomic
servers is easier (no proxy server required - see Figure 1.4). MAExplorer plugin
extensions (MAEPlugins) may only be used with the stand-alone
version. Since MAExplorer is packaged for
download for a variety of operating systems, using this method is
not difficult to set up and the MAEPlugins should run on a variety of
operating systems.
Figure 1.1.4 [DEPRICATED] Overview of running MAExplorer as a Web
browser applet. An alternative way of running MAExplorer on
existing databases is as a Web-browser applet. There advantage of this
method is that no software installation is required on the user's
computer. However, the user may not save data and the exploration
state on their local computer. Furthermore, direct access to genomic
servers requires a proxy server. MAExplorer plugin extensions
(MAEPlugins) may not be used with the the applet version. The Mammary Genome Anatomy Program
(MGAP) originally used the MAExplorer applet.
In MAExplorer we refer to grids by letter names (A,B,C,...) and fields
by F1 and F2. If you are using Cy3/Cy5 ratio data and the Cy3 and Cy5
data is available as independent channels for each HP sample, then
operations that use F1 and F2 will use the Cy3 and Cy5 data for
various operations such as scatter plots (Cy3 vs Cy5), etc. If there
is only one field in an array (i.e. no duplicate grids), then when
MAExplorer is run, operations and menus describing F1 and F2
operations will not be available.
Using duplicate (F1 and F2) spots allows us to get an estimate of the
hybridization variance within an array and is used to compute the
(F1,F2) gene coefficient
of variation (CV) used in the gene data Filter to remove noisy
data before looking for additional differences. Note that if Cy3/Cy5
data is used, then F1 and F2 duplicates are not allowed as MAExplorer
uses the (F1,F2) data to hold the(Cy3,Cy5) data for a hybridized
sample.
Example of a MAExplorer database -
http://www.lecb.ncifcrf.gov/mae - the public MGAP DB
The Mammary Genome Anatomy Program (MGAP) microarrays of cDNA clones
from mouse mammary tissue (collaboration with Research Genetics) were
hybridized with 33P radio-labeled samples. These were then
used to charge fluorescing plates. See the MGAP site for more
documentation on the database and preparation procedures. The
hybridized arrays are scanned on a phospho-imager scanner at high
resolution. Spot data was quantified from these images using the
Research Genetics' "Pathways 2.01" program which generated
tab-delimited data files. This data also includes the microarray grid
point locations (field, grid, grid row, grid col) from the associated
microarray description data files (grid-in-plate-order data). When you
download MAExplorer, you will also
download the public MGAP dataset.1.1 Microarrays and notation used with MAExplorer
In general, microarrays are hybridized using cDNA samples derived from
mRNA labeled with either radio-label, biotin, fluorescent dyes, or
other methods (see Schulze,
2001) for review of the technology). MAExplorer may be used to
construct databases using single-labeled sample intensity (e.g.,
Affymetrix, 33P radio-labeled, etc.) and double-labeled
ratio fluorescent (i.e. Cy3/Cy5) data arrays with different GIPO
geometries.
Definition of "Condition list of samples"
Samples are organized into Condition Lists of samples (generally
replicate samples). These may be used in various statistical and
clustering tests. There are three built-in lists of samples called
the HP-X 'set', the HP-Y 'set' and the HP-E list. The X and Y sets are
used in various 2 condition tests such as the t-Test between the X and Y
sets (Section 2.4.3). The HP-E list is an ordered expression list
of samples used in clustering and in displaying expression
profiles. You may interactively define new or edit
named condition lists using a graphical wizard (Section 2.6), manipulate and assign them to the HP-X
'set', HP-Y 'set' and HP-E list. Some examples of condition lists
might be (assuming you have the data available in your database):
Virgin = ( V.1, V.2, V.3 )
Pregnacy = ( P13.1, P13.2, P13.3 )
Lactation = ( L3.1, L3.2, L3.3 )
Involution = ( I4.1, I4.2, I4.3 )
Definition of "Ordered Condition list" of multiple condition lists
We further extend this paradigm by defining a meta-data structure
called the "Ordered Condition List" or OCL. This is an list of
multiple conditions that you have previously defined. The OCL
may be sorted if you want and the data lends itself to
sorting. E.g., a time series of conditions lends itself to sorting -
different types of diagnoses may not. The OCL may be used in various
statistical tests (e.g., the F-test applied to the current OCL - see Section 2.4.3)). You may interactively
define new or
edit named Ordered Condition Lists using a graphical wizard
(Section 2.7).An example of an ordered condition list might be:
Partuition= ( Virgin, Pregnacy, Lactation, Involution )
Definition of "intensity" for single-labeled samples
MAExplorer uses the term "intensity" in slightly different ways
dependent on whether you are using the single-labeled or fluorescent
double-labeled data. For single-labeled data, "intensity" is the raw
quantified data value as measured by the image scanner. Raw data must
be normalized between samples in order to compare it between
samples. Therefore, to compare N samples, you must first normalize the data and then
compare them.Definition of "intensity" for fluorescent double-labeled samples
For fluorescent double-labeled data, the Cy3 and Cy5 dye-labeled (for
example) measurements are the raw quantified data values as measured
by the image scanner. In this case, "intensity" is defined as the
ratio of Cy3 to Cy5 (i.e. Cy3/Cy5). If you wish to look at the ratio
as Cy5/Cy3, you may flip the two channels on a per-sample basis (see
Section 2.2.2 for more
details).Issues of experimental design of microarray experiments
Some of the issues involved in experimental design
(setting up experiments) based on the types of arrays are discussed in
Section 3.1.1 for (Cy3/Cy5)-labeled as well as 33P-labeled
samples. Poorly designed experiments will not yield significant
statistical results, so attention should be paid to developing an
adequate and robust design for your data given costs of doing
experiments as well as statistical constraints on analyzing the data.
Actual and "Pseudoarray" image geometry
The main MAExplorer windows contains a pseudoarray image for
visualization purposes. It may or may not correspond the spot
positions on the actual array. This array geometry is defined by the
number of replicate Fields (normally 1) each of which contains a
number of grids (also called "blocks") containing a number of
rows/grid and columns/grid of spots. If there is no explicit array
geometry or spot (X,Y) coordinate data available but simply gene
identifiers and intensity data, then an arbitrary pseudoarray geometry
is generated. If there is an explicit array geometry, then it waill
draw the pseudoarray using this geometry. The database configuration
determines which method will be used and is discussed in Appendix C.5. If there is no
explicit grid geometry, the number of spot Locations (e.g., IncyteID,
Affymetrix probe_set) may be used to synthesize a set of grids of a
size that is reasonable for viewing with MAExplorer. This is done in
the Cvt2Mae array data
conversion program when the array
geometry (#grids, #rows/grid, #columns/grid) is not known. This
conversion is not done in MAExplorer itself.
In Cvt2Mae we generate a visually appealing pseudoarray image geometry
if no array geometry is specified with the data (e.g. Affymetrix data,
etc). It maps the number of N spot data entries to a
(#grids,#grid-rows,#grid-columns). The algorithm is given in Appendix C.6 as well as
a suggestion for handling
non-standard geometries using Cvt2Mae.
Gene coordinate numbering on the microarray
A gene coordinate
numbering is a mapping of gene identifiers to locations on the
array for a particular array geometry. These are described by
grids (or blocks), each consisting of grid rows by
grid columns of spots. The grids may be repeated on the array
and constitute duplicate fields. Some arrays group subsets of
grids into meta-grids which are specified by meta-grid
rows by meta-grid columns of grids. MAExplorer can handle
grids but not meta-grids. In the case where there is no
array grid geometry specified or meta-grids are used, an arbitrary
pseudoarray geometry can be constructed to serve as a basis to display
the microarray pseudoimage (see the Algorithm for constructing the
pseudo array from a list of spots in Appendix C.6).
Example: special array spot coordinate numbering for the MGAP arrayAs an example of this coordinate system, the following describes the array geometry for the array used in the NIDDK MGAP database. The general principal with different sizes and numbers of fields is the same for other arrays. The MGAP array was spotted by Research Genetics for MGAP. Clones in the array are laid down in grids consists of 8 rows and 24 columns per grid. There are 8 grids (named A through H or 1 to 8) to a field with a space between grids. Finally, there are two fields (left and right named 1 and 2 or F1 and F2) that are duplicates.Note: we currently present the MGAP arrays with grids A through H oriented from top to bottom - whereas Research Genetics orients them rotated +90 degrees with grid H to the left and grid A to the right. This occurred when the images were scanned with a -90 degree change in the orientation. Therefore, we have swapped rows and columns in our relative orientations so it meets with users normal expectations of row-column orientation. This could be easily changed to the Research Genetics convention using a parameter in the configuration file. Since the actual plate coordinates are tracked with each clone and reported when it is accessed in MAExplorer, the image coordinate system is not that critical - although the verisimilitude of actual array layout and the data-mining layout can be useful.
|
Various gene identifiers may be present in the GIPO data file
associated with the array. One of these is selected to as a unique
identifier to represent genes in the MAExplorer database. Normally,
the Master gene ID is defined as the Clone ID. However if the
Clone ID is not present, but the GenBank ID is, it will use the latter
as the identifier. If neither GenBank nor Clone ID is present, it
will use GenBank5' then GenBank3' if present. If that is not present,
it will use the UniGene ID if is present. If that is not present, it
will use dbEST5' then dbEST3' if present. If that is not present, it
will use LocusLink LocusID if present. Finally, if none of those
identifiers are present, you can specify a 'Generic ID' that is
related to some other database gene identifier such as a 'Location'
identifier.
The current gene may be specified by clicking on a spot in the
microarray image or on a point in the popup scatter plot, or a gene
ID cell in a report.
Setting the "current gene" to a specific gene by "Master gene ID"
The MAExplorer uses the concept of the "current gene" to indicate a
particular gene to be analyzed. You may interrogate the microarray
database or Internet databases for data on the current gene or to use
it in one of the operations. For example, you might cluster genes by
expression profiles to find other genes with profiles similar
to the current gene. Setting the "current gene" by Gene Name Guesser
In addition, the user may type a specific gene name or clone ID into a
popup Gene Name Guesser dialog text window. This is invoked by
clicking on the blue button "Enter gene name or clone ID" at the top
right in the control panel. When the "guesser" window pops up, start
typing the gene name or clone ID in the blue text entry field. You
select either the Gene Names, Clone ID, UniGene ID, GenBank, GenBank
3' or GenBank 5',dbEST 3', dbEST 5', or LocusID identifier. Then you
may start typing letters and it will match all names or identifiers
which are prefixed with the sub-string you have typed so far. As you
type more characters, it will limit the list of possible completions
of what you are typing. After selecting the gene you want, you then
press the "Done" button to use this entry to set the current gene and
remove the guesser popup window. You may press the "Clear" button to
clear what you have typed and the "Cancel" button to cancel the
current gene selection process.
Setting the "Edited Gene List" subset of genes using wildcard
names
You may also define a set of genes from the guesser window using
wildcard names where the character '*' matches zero or more
characters. First you specify a sub-string common to gene names. Then
press the "Set E.G.L." (set 'Edited Gene List') button. For example (see Figure
2.3.1), you could find all oncogenes and proto-oncogenes by typing
"*ONCO*" in the guesser. It automatically enables the View 'Edited
Gene List' in the array that shows genes in the E.G.L. enclosed in
magenta boxes.
The current gene cluster
Some operations involving clustering will automatically assign the
gene cluster to the E.G.L. This includes clustering of genes similar
to a selected (i.e. current) gene and K-means clustering. In the case
of K-means clustering, the cluster you select by picking a gene
belonging to that cluster will cause it to be defined as the current
cluster and also assigned to the E.G.L. This will be discussed in more
detail in the section on clustering.
The current Condition List of samples
The current condition list of samples
is the last condition edited with the interactive graphical wizard
(Section 2.6) used to define new or edit condition lists. The current Ordered Condition List (OCL) of multiple conditions
The current ordered condition
list (is a possibly ordered list of Multiple Condition
Lists) is the last condition edited with the interactive graphical
wizard (Section 2.7) used to define new or edit ordered condition
lists.Saving full resolution plots as GIF files in stand-alone mode
The various plots may be saved as full resolution GIF files when
running MAExplorer in stand-alone mode. The various plots have
"SaveAs" buttons which appear in stand-alone mode. Saving your
intermediate results may be useful for documenting your data mining
session or for subsequent publication. (Here is an example of a full
resolution
clustergram of 38 MGAP hybridized samples for 1076 named and EST
genes).
Saving Text windows as .txt files in stand-alone mode
The various text windows may be saved as .txt files when running
MAExpplorer in stand-alone mode. The various text windows have
"SaveAs" buttons which appear in stand-alone mode. Saving your
intermediate results may be useful for documenting your data mining
session or for subsequent publication.
1.2 Microarray image quantification
Quantification data for all genes in a hybridized sample (x and y
coordinates, intensity, background density) is obtained by reading
data from a quantification file for that hybridized sample. The
quantification file for each hybridized sample resides on the local
file system (for stand-alone) or MAExplorer Web server (for applet
use) and is derived from image quantification programs such as Axon's
GenePix(TM) program, Scanalyze, Molecular Dynamics'
ImageQuant(TM) program, Research Genetics'
Pathways(TM) program, etc. These programs are
independent of MAExplorer and are not part of our downloadable
software distribution.
Normalization between hybridized samples must be performed to
allow comparison between different hybridized array samples. File
formats are discussed in Appendix
C).
1.2.1 Ratio and Zscore comparison of data from different
hybridized samples
Because of variation between hybridized samples, data is normalized.
Methods that are pure scaling transformations (such as Median, Scale to 65K, By Calibration DNA, By Use Gene Set,
etc.) allow you to compare data using the ratio between two normalized
sets of data. We define the ratio for two samples as follows:
ratio(x,y,c) = Ixc / Iyc
where:
samples x,y have values Ixc and Iyc for the same
gene c in samples HP-X and HP-Y
The Zscore method transforms the data such that it can not be used
with the ratio comparison. Instead we use the Zdiff(x,y) method for
comparing Zscore developed by Mark Vawter (Vawter, 2000). Zscores typically
cover the range of -3.0 to +3.0 (standard deviations) with a
transformed mean of 0.0. Therefore the Zdiff will typically cover the
range of -6.0 to +6.0.
Let
Zscore(p,c) = (Ipc - meanp)/stdDevp
where:
Ipc is the intensity of gene c for sample p. Sample p has meanp
and stdDevp
Then,
Zdiff(x,y,c) = Zscore(x,c) - Zscore(y,c),
where:
samples x,y have Zscore(x,c) and Zscore(y,c) normalized values for the
same gene c in samples HP-X and HP-Y, or HP-X 'sets' and HP-Y 'sets'.
|
1.3 Microarray image and plot display
The MAExplorer displays one microarray pseudoarray image of the
hybridized samples. This is either for a single sample, the ratio of
two samples, the average of replicate samples or the ratio of two sets
of replicate samples, the ratio Cy3/Cy5 or Cy5/Cy3, or other
mappings. Section 2.4.4.1
Show microarray pseudoarray images menu describes these options and
shows some examples.
The Filter menu is used to select a set of data filters that determines which genes are selected. These are highlighted in the array image in different ways - with a red (white) circle in the intensity (ratio) pseudoarray image each spot meeting the range threshold criteria. How these are highlighted depends on which Plot menu Show Microarray method and View menu modes were selected. If the Show 'Edited Gene List' (EGL) option is set in the View menu, genes in the EGL will appear as magenta squares. The "Filter mode" is always present and shows genes meeting various Filter criteria (to be discussed). The user may interactively define a list of genes by clicking on them when the Click to add gene to edited gene list option is set in the Edit menu. Alternatively, you can click on a gene with the Control key pressed to add a gene to the EGL or with the Shift key pressed to delete a gene from the EGL.
In all of the pseudoarray images, the grids in the image are labeled
field#-GridLetter (e.g. 1-C, 2-B, etc). This allows them to be
clearly identified as the user scrolls over the image that is larger
than the visible computer window.
There is also a popup alert message window for bettering informing
users of conditions that prevent them from doing the operation they
requrested. You must press the Close button to pop-down the message,
although you may do press the SaveAs butto to save the message to a
file. For complex problems, some of the messages may suggest what you
need to do to correct the problem.
Hybridized samples are selected from a list of all of the sample
samples in the database. To make it easier to select a HP, they may
be selected from submenus by their developmental stage (if supported
by your particular database) or from a list of all samples in the
database located on the left side of the pseudoarray image. If a sample
has never been loaded during a session, it will be loaded when you
request it.
The last sample selected is called the current sample or
current HP. That is the sample that is displayed in the pseudoarray
image in the primary MAExplorer window when using display modes
requiring a single sample.
Figure 1.3 Data Filter Venn diagram. This illustrates some of
the logical, data range and statistical tests criteria available using
the MAExplorer data Filter paradigm. Note that multiple criteria
may be selected from each of these categories. The extreme case,
probably never used, could use all tests.
A first-approximation approach to data-mining might be to sequentially
constrain the data of interest to find some changes and then to report
on those changes. We have arranged these commonly performed first-pass
operations as submenu entries in the Analysis Menu. The submenus are:
Figure 1.4 Screen view of MAExplorer main window with Analysis
Menu. The menu structure of MAExplorer was designed to allow users
to quickly perform commonly used data-mining operations. Other menus
are used for modifying the data (File, Samples, Edit, and View menus)
or accessing on-line Help menu information in a separate Web browser
popup window. MAExplorer menus are similar to most Windows PC
applications where pull-down menu selections are used to invoke
operations. The current hybridized array sample is displayed as a
pseudocolor ratio image of median normalized spot intensities.
Clicking on a spot assigns it as the current gene with data being
reported in the top most message area. The names of the current HP-X
and HP-Y samples are listed above that area. In general, clicking on
spots, points in plots or cells in spreadsheet reports will assign the
it as the current gene and access Web genomic databases if enabled.
In addition to displaying the hybridized sample pseudoarray images,
derived data may be viewed in various types of plots. These include
scatter plots, histograms, ratio-histograms, expression profiles, gene
clustering, etc. Data may be presented as table reports presented as
either active spreadsheets that can access genomic databases by
clicking on cells or as tab-delimited Excel-compatible tables that may
be cut (if your windowing system supports this) and pasted into an
Excel spreadsheet.
The selected HP-X and HP-Y samples are used when generating scatter
plots, ratio histograms and other graphics. Scatter plots and ratio
histograms may also be performed on the left and right sides of the
currently displayed HP array (fields F1 and F2 respectively if array
data has duplicate spots for the same genes).
A MAExplorer database contains a table identifying genes, so data is
accessible by gene name as well or by sub-strings identifying a set of
genes (e.g. "onco" that could be used to find any oncogene or
proto-onco gene in the database).
When the program starts, it displays the microarray image of the first
hybridized sample in the HP-X set of samples initially specified. If
you specify a new HP-X or HP-Y sample, then it changes the pseudoarray
image to correspond to that array. You may change the current HP-X or
HP-Y sample from either the Samples pull-down menu or by clicking on a
sample in the Active Sample list in the left of the pseudoarray
image. If you click the mouse on or near a spot, it will latch
onto that spot and define it as the current gene.
Note: In Figure 1.4,
genes that pass the MAExplorer data Filters are indicated by red
(white) circles around spots in the pseudograyscale (pseudocolor)
intensity (ratio) image. The pseudoarray image shows the gene data as
replicate grids of spots if there are two fields Field 1 (left set of
grided spots) and Field 2 (right set of grided spots). If there is no
duplicate spot data, then only Field 1 is shown.
If background correction is enabled in the Normalization menu, then
intensity is reported in the message displays as intensity'
otherwise as intensity.
Normalization should also be used between hybridized samples -
whether the data is ratio data (i.e. Cy3/Cy5) or single sample
intensity arrays.
Setting up MAExplorer to work with user-specific data is
discussed later in this manual
Figure 1.5.1 The MicroArray Explorer home page at
http://maexplorer.sourceforge.net/. The table of contents in
the left panel lists an introduction and short tutorial, several
demonstration databases. Below that are links to documentation
including this reference manual, glossary and index. The Export
version discusses running MAExplorer with other arrays and as a
stand-alone version. The Download
application is a Web page for downloading and installing the
stand-alone Java application on your computer.
You may start MAExplorer in your Web browser from the MGAP
Startup DB. This offers several preset public databases consisting
of sets of hybridized samples as well as the empty database. After
you have clicked on a particular startup database, it will begin
loading MAExplorer - indicated by a red box with a
"Loading..." message in the top window of your browser. After
MAExplorer starts, this message changes to a white box with "Reading DB" while it downloads the data files
required. Finally, when it is ready for your interaction, it displays
a white box with a green "Ready".
NOTE: for Web browser invocation, the MAExplorer applet works with
Netscape 4.7, Internet Explorer 5.0, and HotJava on a Windows
(95/98/NT/2000/XP) system or a Solaris Unix system. Macintosh and SGI
systems seem to hang at times because of Web browser
problems. However, it works on all other systems as a stand-alone Java
application that you may download and
install on your computer. You might want to review these Web browser restrictions.
After the MAExplorer is started and the menus become active, you may
switch the preset hybridized samples to other samples using the
Samples pull-down menu. The last hybridized sample loaded
becomes the "current hybridized sample" and its image is the one
displayed.
The following Sections 2.1 through 2.7 describe the pull-down menus in
detail.
In stand-alone mode, the user may select the database subset to be
loaded from either a Web server or a local file system. When used as
an applet, this is pre-determined by the Web page where MAExplorer is
started. Opening a disk DB, 'Open disk DB', also restores any user defined gene sets
and other parts of the exploratory
state that were present when the 'Save ... disk DB' was
invoked.
In the following menus, selections that are sub-menus are
indicated by a '
When used as an applet connected to a Web database server, databases
may be divided into public and collaborator projects. Users accessing
protected collaborator projects will be required to log-in to the
server and a popup login request will appear.
[In the future], each user will be able to save the state of their
exploration into a password protected directory of named states on a
Web server (e.g. doing a 'Save ... Web DB' command. Later, they could
restore that state from the Web server by doing an 'Open Web DB'
command). Users would be required to register with that server to set
up a unique state-saving area. Once this facility was setup, users
may selectively allow other user's to view selected data implementing
a groupware environment for improving collaboration.
Figure 2.1.1 Example of the "Open file DB" command. The file
browser is opened in the current project directory with the name of
the currently opened file. You may select another .mae startup
database file to load in the current project. You may also "cruise"
the file system and load an .mae file from a different project directory.
The "Set project" command makes this easier since it gives you a list
of available projects that you may change directly. The projects must have
been setup on your computer previously. The "New project" command can
be used for setting up new projects or projects.
Figure 2.1.2 Example of saving a user session in a new startup file
using the "SaveAs DB" command. The file browser is opened in the
current project directory with the name of the currently opened
file. You may enter another .mae file name to save your current
session. Then when you restart MAExplorer using this new file, it will
restore the data mining state to where you left off (except that no
popup windows are opened).
A registered user may allow another registered user to access their
state or states (using the Open another user's state command)
if the user owning the data had granted them permission. The Share
user state and Unshare user state commands control these
permissions. There are two special share-users defined: public
to allow unlimited read-only access to the state they specify, and
private to disallow all access to a user state.
The first menu command, "Choose HP-X, HP-Y and HP-E samples", entries
lets you change the current working HP-X 'set', HP-Y 'set', and
HP-E 'list' hybridized samples.
The second menu command, "Choose named condition lists of samples",
lets you define or edit new named lists of hybridized samples. This is
useful for defining sets of replicate samples. These may be further
manipulated using the (Edit menu | Sets of Conditions (samples))
commands.
The third menu command, "Choose ordered lists of conditions", lets you
define or edit new named Ordered Condition Lists (OCL) of named
condition lists. This is useful for defining a sub-experiment
consisting of N conditions each with replicate samples. The last OCL
manipulated is defined as the "current OCL". The current OCL is used
in the OCL F-test Filter.
The fourth menu command, "Set Samples from lists", lets you change the
current HP-X and HP-Y, HP-Y samples as well as the HP-X 'set',
HP-Y 'set', and HP-E 'list' samples. This is similar to using the
"Choose HP-X, HP-Y and HP-E samples" command, but is more dificult to
use. You may change the current HP-X or HP-Y sample by clicking on
the sample name directly in the list of sample names on the left side
of the pseudoarray image (see Figure 2.2.3 legend).
The fifth menu entry, "Edit use (Cy5/Cy3) else (Cy3/Cy5) for each HP",
lets you swap data channels for Cy3/Cy5 data for individual samples.
Other menu commands list the status of the current HP-X 'set', HP-Y
'set', or HP-E 'list', and define condition class names that are
associated with the HP-X 'set' and HP-Y 'set'. The last menu entry,
"Use HP-X & HP-Y 'sets' else single samples", lets you switch
between using HP-X and HP-Y as single samples of sets of multiple
samples. For example, if you are using a scatter plot of X and Y, it
will switch the data being plotted from a comparison of single samples
to a comparison of means of sets of samples depending on the status of
the switch. Sets of samples are used extensively in data explorations.
Figure 2.2.1 Samples menu - selecting lists of samples by using the
"chooser". The hybridized samples assigned to the current HP-X,
current HP-Y, set of HP-X, set of HP-Y and expression profile list
HP-E may be changed from the Samples pull down menu using the
Choose HP-X, HP-Y and HP-E option lets you graphically change
the currently active sample HP-X, HP-Y sets and E-list.
Figure 2.2.2 Samples menu - selecting samples by source
characteristics. The hybridized samples assigned to the current
HP-X, current HP-Y, set of HP-X, set of HP-Y and expression profile
list HP-E may be changed from the Samples pull down menu. The
specific "By Source" menus shown here are from the MGAP database.
This figure shows the user changing the current X sample from the
developmental stages submenu that is part of the "By Source" submenu.
Alternatively, samples containing a keyword or part of a keyword can
be found using a "guesser" popup window that allows the use of wild
cards. This is invoked using the "From list of all H.P.s" submenu. For
example, you could specify "*pregnancy*" to find all samples of
containing that word.
Figure 2.2.3 Changing the current sample to either the HP-X or
HP-Y sample by clicking on a sample name at the left edge in the
microarray pseudoarray image. The current sample is indicated in
magenta. Click on the magenta "*" adjacent to the new name you want to
select and it will change the HP-X sample. To switch between setting
HP-X and HP-Y, click on the [X] Current
Sample box to change the sample to HP-Y. You can click on
[Y] Current Sample box to change it
back to HP-X. Then clicking on a sample name will set it to the
current HP-X or HP-Y that was selected. This figure shows that the
user had selected [Y] and C57B6-L10-29hrs for the new HP-Y sample.
The Set current HP-X sample and Set current HP-Y sample
commands offer another way to set the single current X and Y sample
(see Figure 2.2.3 above for
the preferred way using the "Chooser").
The Edit HP-X & HP-Y 'sets' of samples by source menu allows
the user to define HP-X and HP-Y as sets having multiple
hybridized samples. Then, the mean values of the genes are used
when comparing HP-X with HP-Y.
For example, the By Source database-specific entries for the
MGAP database includes the following submenus.
The From list of all samples selection pops up a hybridized
sample guesser dialog window. As with the gene name guesser, you can
start typing in the name of a sample and it will give you a list of HPs
that match that initial string. You then click on the sample you want
and then press the Done button.
Figure 2.2.4 Samples menu - selectively swapping (Cy3,Cy5) data
channels for particular samples. This is only operative if your
database contains Cy3/Cy5 ratio labeling data. This is useful in
databases containing subsets of dye-swap experiments mixed in with
other samples that are not dye-swapped.
Figure 2.2.5 shows a screen illustrating a popup condition chooser
session. The set of all samples in the database is in the
scrollable "Remainder Samples" window in the upper left. The samples
you have selected for the condition list being edited is shown in the
upper right "Selected Samples in current condition" window. The list
of all conditions in the database is in the lower left "List of
Conditions" window. The current condition list that is selected is
highlighted and its contents displayed in the "Selected Samples"
window. User defined annotation associated with the current condition
are displayed in the right "Current Conditioned Annotation" window.
To add a new condition, click on the Add Cond button to define
the new condition name. The Remove Cond button is used to
delete a named condition list. The List Cond button pops up a
report listing the samples and annotation for the current
condition. The List All button pops up a report listing the the
names of all of the conditions and the annotation names. You may add
or remove new annotation names for all of the conditions. The Add
Ann button will add the new annotation you enter into all
conditions - you must enter the data for each condition that requires
it. You may The Save the current status of all of the
conditions into your working database. If you have pressed
Cancel before saving, then you will not have saved your
edits. Pressing the Done button will save the changes and
pop-down the window.
Figure 2.2.6 shows a screen illustrating a popup ordered condition
list (OCL) chooser session. The set of all conditions in the
database is in the scrollable "Remainder Conditions" window in the
upper left. The conditions you have selected for the OCL being edited
is shown in the upper right "Selected Conditions in current OCL"
window. The list of all conditions in the database is in the lower
left "List of Conditions" window. The current OCL list that is
selected is highlighted and its contents displayed in the "Selected
Conditions" window. User defined annotation associated with the
current OCL are displayed in the right "Current OCL Annotation"
window. To add a new OCL, click on the Add OCL button to
define the new condition name. The Remove OCL button is used
to delete a named condition list. The List OCL button pops up a
report listing the conditions and annotation for the current OCL. The
List All button pops up a report listing the the names of all
of the OCLs and the annotation names. You may add or remove new
annotation names for all of the OCLs. The Add Ann button will
add the new annotation you enter into all conditions - you must enter
the data for each condition that requires it. You may The Save
the current status of all of the OCLs into your working database. If
you have pressed Cancel before saving, then you will not have
saved your edits. Pressing the Done button will save the
changes and pop-down the window.
Sets of genes or HP condition lists are very useful for tracking
complex data-mining sequences of analysis. For example, derived named
gene sets may be used in successive data filters and for reports. For
example, one could do the following experiment given four different
types of HPs for (e.g. virgin, pregnancy, lactation, and
involution)
The Edit menu contains the following main selections. All of these entities
and preferences are saved as part of the startup state when you
do a (File | Databases | SaveAs ... DB).
Figure 2.3.1 Edited Gene List defined from the Gene Name Guesser
using wildcards. The Edited Gene List was defined as the set of
genes containing the sub-string "onco" in it. The sub-string was
specified to the popup guesser window as "*onco*" using '*' characters
as wildcard symbols indicating that it should match any or no
characters. The button Gene Name may be toggled through a set
of other identifiers including Clone ID, UniGene ID, dbEST 3', dbEST
5', GenBank 3', and GenBank 5', LocusID, etc. depending on what
identifiers are available in your database. The user then pressed the
Set E.G.L. button on the guesser window that sets the E.G.L. to
those genes. If you have enabled the View menu "Show 'edited gene
list', then the genes in the EGL. are viewed as magenta squares seen
in the pseudoarray image. You many to do additional editing to
manually add or remove genes that you want to change in the set. If a
2D scatter plot was being used, EGL labeled genes would appear there
as well. To select a particular gene as the current gene, click on the
gene you want in the list, then press the Done button.
If you are running MAExplorer in stand-alone mode, the current named
gene sets are saved when you save the DB using the Save disk DB
or Save as disk DB selections in the Databases submenu of the
File menu. The gene sets are saved in a State sub directory as
".cbs" files and are used to restore the gene sets when restarting
MAExplorer on a .mae startup file. The .mae startup file saves the
names of the .cbs files that are shared among the various startup
files for a given project. The implication then is that if you change
and save a gene set in one startup database, it will change in other
startup databases when they load that gene set. The advantage is that
different startup databases may view a gene set produced by another
database.
The Sets of genes operations in the Edit menu include:
The following is an example of List saved gene sets state
listing the catalog of named gene subsets in some of the MGAP
data. Note that sets #1 to #11 are fixed by the data in the GIPO file
and may not be changed by the user. Sets #12 to #14 are assignable
from other sets or in the case of the E.G.L, by various MAExplorer
operations. Sets #1 through #14 may not be removed whereas #15 and
higher may be removed.
The following figure illustrates selecting sets by name for gene set
operations.
Figure 2.3.2 Selection of gene sets for binary gene set
operations. This example computes the Boolean AND of two sets "ALL
NAMED GENES" and "60 genes closest to CA-III from Named and Ests", and
then the AND of the "Replicates" with the previous result. The first
result is save in the set called "The 60 genes closest to Carbonic
Anhydrase-III". The second result is saved in the called set "Named
genes in the 60 genes closest to CA-III". Finally, the third result
is saved in the set named "Replicate genes in the 60 genes closes to
CA-III".
The following is an example of List saved HP condition lists
state listing the catalog of named HP condition lists.
The following is an example of List contents of saved HP condition
list state.
The Font Family submenu is used to set
the text font family. This may be useful if your computer is missing
some fonts or some fonts are easier to read than others. Note: some
fonts may not work well on your computer. If this is the case, try
another font. When you save the data mining session with the "SaveAs
file DB", it also saves the font you have set. For some plots or popup
text-windows, you may have to regenerate the popup window to see the
font changes.
Figure 2.3.4.1 Popup window allowing you to adjust all threshold
slider values">. The Adjust all Filter threshold scrollers
command allows you to pre-adjust all threshold slider values used in
data filtering and in clustering. It may be easier to set the
approximate range before invoking the clustering operation because
changing a parameter will recluster your data.
The Define HP-X (HP-Y) class name command may be used to change
the names of the HP-X (HP-Y) experimental condition sets. These names
are used in various labels in the main window, popup plots and
reports, etc. The commands to change various names of database
components are in the Preferences submenu in the Edit menu.
Figure 2.4 MAExplorer main window with Analysis Menu. The menu
structure of MAExplorer was designed to allow users to quickly perform
commonly used data-mining operations as a first approximation
analysis.
Figure 2.4.1.1 Gene Class menu. The user may select a subset of
genes that belong to one of the classes of genes. This shows the
user selecting the set of "All named genes" that are indicated with
red (white) circle over the spots in the array intensity (ratio)
pseudoarray image.
Figure 2.4.1.2 Example of all replicated genes occurring more than
once in the array. This was selected by using the GeneClass
'Replicate genes'. You may use the data Filter "Filter by genes with
replicates" instead of the GeneClass. This has the advantage that you
may use other GeneClasses (e.g. ESTs, or All named genes, etc.).
Alternatively, you can find all of the replicates for a particular
gene by 1) use the Gene Guesser to find the particular gene you want;
2) press "Set E.G.L." to save it as an Edited Gene List; 3) enable the
Filter "Filter by E.G.L." at the same. This will show all occurrences
of that gene.
The set of all genes constitutes a number of different gene
classes. It is possible to restrict the subsequent analysis to a
particular subset of these genes called a gene class. The
GeneClass menu operations include operations to select the
current set of genes to analyze from the set of all genes by their
membership in a gene class.
Some of the above gene classes are deduced from the gene name supplied
with the Gene In Plate Order (GIPO) file for the array. We use the
following automatic classification rules shown in Table 2.4.1.
Table 2.4.1 Rules for the automatic classification of gene names
into the default Gene Class sets. The gene name is analyzed
alphabetic-case independently.
Some software quantification software (e.g. Research Genetics'
Pathways 2.01) measures background globally as: BGLow (low
background), BGAvg (Average background), BGRms (root mean square
background). For MGAP, MAExplorer uses the BGLow value when you
request background subtraction. These values are read from the
MAExplorer Samples DB file (see Appendix
Table C.2.1.1 For other quantification programs, background may be
available on a per-spot basis in the quantification files. It the
latter is available in your data, it will be used if background
correction is enabled (see
Appendix C.3).
The background corrected intensity I'ij is
computed from the raw intensity Iij and background
intensity bkgrdHPi for H.P. i and spot j as
follows:
Figure 2.4.2.3 Scatter plot of HP-X and HP-Y 'sets' data. HP-X
is C57B6 pregnancy day 13 and HP-Y is Stat5a (-,-) pregnancy day 13
filtered by "All named genes and ESTs". A) A scatter plot
using the Median normalization. B) A scatter plot using the
Zscore of the logs normalization. Notice how the Casein alpha outlier
is more apparent in the case of the Zscore log normalization. The
skewed plot is characteristic of much microarray data. Some
normalization methods (not currently included in MAExplorer) can
compensate for these some of these artifacts (Dutoit, 2000) and are planned for
future MAEPlugins.
Figure 2.4.3 Filter menu. The Filter menu is a cascade of data
filters that restrict the set of genes passing all filters that
have been enabled and whatever the criteria was that was set for those
filters. This figure shows the GeneClass filter set to "All genes and
ESTs", the spot CV filter and Ratio (X/Y) range filters being set
interactively by the scroll bars on the right. The genes that pass
the filter are indicated with a red (white) circle in the array
intensity (ratio) pseudoarray image.
The Filter menu options are used to restrict the set of genes
by pre-filtering the data with a series of cascaded filter criteria
and tests. The resulting subset of genes passing the filter are then
used in the plots, reports and other data analysis methods. Some of
the filters require additional parameters that are set by the State
scrollers. The user will automatically be prompted for changes to
these scollers (a threshold scrollers window will pop up) when the
filter is activated or change. These values may also be set from the
Adjust all Filter threshold scrollers entry in the
Preferences submenu in the Edit menu. The filters are
broken up into subgroups in the following menu with the grouping
haveing more to do with the criteria (i.e. gene set membership, data
range, or statistical tests).
The Filter by positive intensity
data submenu filter contains options that specify which spot
intensity values are to be considered when excluding negative
quantified spot data. Note: this filter only makes sense if your data
might have negative values (e.g. Affymetrix chip "Avg Diff" data) or a
background corrected value that is less than 0.0. The filter is
enabled by setting the "Filter by spots with positive intensity"
checkbox. Negative intensity values may occur with some types of
arrays quantification programs. In the "Check spots for positive
values mode" submenu, you may set the samples where the test may be
applied to spots from the current HP, the single (HP-X,HP-Y) samples,
(HP-X,HP-Y) 'sets' (replicated spots), or samples in the HP-E list
selected to be used in the filter. If there are (F1,F2) or (Cy3/Cy5)
data, then each spot must meet the threshold criteria.
The Filter by Good Spot data submenu filter contains options
that specify spots based on their quality. It filters out genes that
have that do not have "Good Spot" values defined by the optional
QualCheck spot data. (See the list of codes in Appendix C.4). If there is no
such spot quality data, then all spots are considered "good". The
filter is enabled by setting the "Filter by spots with Good Spot
values" checkbox. All spots for the specified samples must meet the
criteria. In the "Check spots for Good Spot mode" submenu, you may set
the samples where the test may be applied to spots from the current
HP, the single (HP-X,HP-Y) samples, (HP-X,HP-Y) 'sets' (replicated
spots), or samples in the HP-E list selected to be used in the filter.
The Filter by Spot Detection Value data submenu filter
contains options that specify spots based on their spot detection
value quality metric over the range of [0.0 : 1.0]. The filter is
available only if the data exists for your database and is ignored
otherwise. If active, it pops up a "Spot Detection Value" slider in
the range of [0.0 : 1.0]. Only spots greater than the slider value
pass the filter. This data could be the Affymetrix MAS5.0 "Detection
p-value" or some other metric correlated with spot detection quality.
The filter is enabled by setting the "Filter by per-sample Spot
Detection Value" checkbox. All spots for the specified samples must
meet the criteria. In the "Check spots for Spot Detection Value mode"
submenu, you may set the samples where the test may be applied to
spots from the current HP, the single (HP-X,HP-Y) samples, (HP-X,HP-Y)
'sets' (replicated spots), or samples in the HP-E list selected to be
used in the filter.
The Filter by spot intensity [SI1:SI2] sliders submenu contains
options that determines how individual spot intensity thresholding is
to be applied in the Filter.
The Filter by [I1:I2] sliders submenu contains options that
determines how spot expression (intensity or (Cy3/Cy5) ratio value)
thresholding is to be applied in the Filter:
The Filter by ratio or Zdiff sliders submenu contains options
that determines how spot-ratio thresholding is to be applied in the
Filter. The spot ratio is mean HP-X / mean HP-Y for sets of
samples. The spot Zdiff is used if one of the Zscore normalization
methods is active and is computed as (mean HP-X - mean HP-Y) for sets
of samples.
The Filter by Cy3/Cy5 HP-X ratio or Zdiff sliders submenu
contains options that determines how spot Cy3/Cy5 HP-X ratio
thresholding is to be applied in the Filter. The spot ratio is Cy3/Cy5
for normalized data unless one of the Zscore methods is used. In that
case, the Zdiff is used and is computed as (Cy3 - Cy5) for sets of
samples. If HP-X 'sets' is used, then it computes the mean Cy3 value
and the mean Cy5 value and uses those values in the above
computations.
The Filter by spot CV submenu filter contains options that
specify how the Coefficient Of Variation of the (F1,F2) or (HP-X,HP-Y)
'sets' (replicated spots) is to be used in the filter. The (F1,F2) CV
is available only if there are duplicate spots on the HPs.
Figure 2.4.3.1 Filtering using multiple scrollers. This example
is of Cy3/Cy5 time series data. It filters normalized spot intensity
of the Cy3 and Cy5 channels independently ([SI1:SI2] inside range)
where low intensity spots are eliminated. It then filters out genes
outside of the [R1:R2] ratio range.
Figure 2.4.3.2 Using the Positive Intensity data Filter.
This allows removing negative data if the data contains negative
intensity values (e.g. Some Affymetrix data has negative Average Difference
values which could be read as Intensity for MAExplorer).
You may switch between different representations of the microarray
spot pseudoarray image. It may be viewed as several different types of
pseudo images including an intensity gray value and a pseudo-color
Red/Black/Green image for ratio (HP-X/HP-Y) and Zscore (HP-X - HP-Y)
data. The p-Value results of comparing a HP-X 'set' with a HP-Y 'set'
of samples, or the CV of the HP-EP 'list can be displayed as a color
spectrum pseudoarray image.
Depending on the origin of the array data, it may have the same
verisimilitude as the original arrays. Otherwise, it is displayed in
a generic pseudoarray image containing grids that will fit the window
- these are not the same as the original array image (see . However, the
pseudoarrays are useful to getting a rough idea of the global changes
in the data between arrays and how may genes pass the data filter.
When enabled using one of the commands in the Section 2.4.5 Clustering menu, cluster data appears as
blue circles or squares drawn as overlays on
the pseudoarray image. These options are discussed in the section on
clustering. If you are doing clustering K-means clustering, the
current cluster is displayed in the scatter plot if the latter is
active.
Scatter plots, ratio and intensity histograms of the mean (HP-X/HP-Y)
or (HP-X/HP-Y) 'set' data, or the F1/F2 or Cy3/Cy5 data. F1/F2
or Cy3/Cy5 plots are available if the data exists in your particular
database. That might be the case with replicate spots or with Cy3/Cy5
data. If the normalization is set to a Zscore or log mean mode, it
will compute Zscore scatter plots and histograms.
Clicking on spots in an array image or points in scatter plots sets
the current gene and will bring up data on the gene or (optionally)
access corresponding data from GenBank, UniGene, mAdb Clone, etc. databases in a
popup Web browser. Clicking on a bin in a ratio or intensity
histogram plot filters out all genes except for those in the range of
that bin.
Expression profiles plots of selected genes or subsets of genes for
all samples in the HP-E list. These are active plots with data reported
when the user clicks in the plot.
Clicking on a spot (i.e. gene) in the microarray pseudo image or on a
point (i.e. gene) in the scatter plot, it will define that gene as the
"current gene" that is used in other operations. The current gene is
indicated in both plots with a green circle
around it. Similarly, you may modify
the 'Edited Gene List' from either the pseudoarray image or the
scatter plot. When viewing is enabled, it overlays those genes with
magenta squares.
Figure 2.4.4 Plot menu - selecting Ratio Pseudoarray
image. This displays a pseudocolor show in the scale on the left that
indicates the ratio of the value of the HP-X sample / HP-Y sample (or
'sets' if the option to use HP-X and HP-Y 'sets' is enabled.) If The
data is Cy3/Cy5 data, then this displays the ratio of the ratios using
the current normalization. Various other pseudoarray image
representations could be used.
If the database that was loaded contains only one sample, the
pseudoarray image display defaults to the pseudograyscale spot
intensity mode. If there is at least one HP-X and one HP-Y sample,
then the Pseudocolor HP-X/Y ratio or Zdiff mode is the initial
default display. If there are duplicate spots for each gene, you may
generate a Pseudocolor F1F2 ratio or Zdiff mode image. If you
are using Cy3/Cy5 ratio data and the data is available as independent
channels for each HP, then you may plot Cy3 vs Cy5 for individual
samples.
When available on the database server, the original image may be
displayed in a separate popup Web browser.
Table 2.4.4.1. Pseudocolors assigned to spots to represent data in
the X/Y ratios or X-Y Zdiffs pseudocolor array images. Each color
represents the normalized X/Y ratio or X-Y Zdiff depending on
Normalization mode. The 9 colors of the boxes represent the normalized
expression ranges.
The same data is shown in a variety of normalization and display formats.
Figure 2.4.4.1.1.1 Pseudoarray intensity image of median normalized
intensities of the current HP sample (C57B6 virgin 10 weeks from MGAP
database). The graylevel scale on the left edge of the pseudoarray
image indicates the spot intensity. All pseudoarray images have scales
that vary depending on the type of pseudoarray being displayed.
Figure 2.4.4.1.1.2 Pseudoarray intensity image of Zscore normalized
intensities of the current HP (C57B6 virgin 10 weeks from MGAP
database).
Figure 2.4.4.1.1.3 Pseudoarray intensity image of ZscoreLog normalized
intensities of the current HP (C57B6 virgin 10 weeks from MGAP database).
Figure 2.4.4.1.1.4 Pseudoarray intensity image of ZscoreLog
normalized intensities of the dual HP-X and HY-Y individual
samples. The Plot menu Show Microarray submenu toggle "Use dual
HP-X & HP-Y samples" option is set. HP-X is a C57B6 pregnancy day
13 and HP-Y is a Stat5a (-,-) pregnancy day 13.
Figure 2.4.4.1.1.5 Pseudoarray intensity image of ZscoreLog
normalized intensities of the dual HP-X and HY-Y sample 'sets'. The
Plot menu Show Microarray submenu toggle "Use dual HP-X & HP-Y
samples" option is set. The "Use HP-X & HP-Y 'sets' option in the
Samples menu. HP-X is the mean of three 'C57B6 pregnancy day 13' and
HP-Y is the mean of three 'Stat5a (-,-) pregnancy day 13'.
Figure 2.4.4.1.2.1 Pseudocolor array image of median normalized X/Y
ratios. HP-X is C57B6 pregnancy day 13 and HP-Y is Stat5a (-,-)
pregnancy day 13. Each spot's color represents the normalized X/Y
ratio depending on Normalization mode. The color of the box is one of
9 colors representing the normalized expression ranges and assigned
according to the table "Ratio
mode".
Figure 2.4.4.1.2.2 Pseudoarray color image of normalized X/Y 'set'
mean value ratios. Mean of three HP-X C57B5 pregnancy day 13
samples and mean of three HP-Y Stat5a (-,-) pregnancy day 13 samples.
Each spot's color represents the normalized X/Y 'set' ratios depending
on Normalization mode. The color of the box is one of 9 colors
representing the normalized expression ranges and assigned according
to the table "Ratio mode".
Figure 2.4.4.1.2.3 Pseudoarray color image of X-Y Zdiffs. HP-X
is C57B6 pregnancy day 13 and HP-Y is Stat5a (-,-) pregnancy day 13.
Each spot's color represents the normalized X-Y Zdiff depending using
the Zdiff normalization mode. The color of the box is one of 9 colors
representing the normalized expression ranges and assigned according
to the table "Zdiff mode".
Figure 2.4.4.1.2.4 Pseudoarray color image of X-Y Zdiff of log
data. HP-X C57B5 pregnancy day 13 sample and HP-Y Stat5a (-,-)
pregnancy day 13 sample. Each spot's color represents the normalized
X/Y ratio depending on ZdiffLog with StdDev normalization mode. The
color of the box is one of 9 colors representing the normalized
expression ranges and assigned according to the table "ZdiffLog mode".
Figure 2.4.4.1.2.5 Pseudoarray image showing color-coded p-values
for t-test comparison of HP-X and HP-Y 'set' samples. The HP-X and
HP-Y sets both have 2 samples each (more is obviously much better).
The data was normalized using the Median and a spot intensity
[SI1:SI2] data filter was applied to eliminate some of the noisy data.
Each spot's color represents a p-value in the range indicated in the
scale in the left edge of the image. Note that although all spots are
assigned a p-Value, many may not be very significant because adequate
preprocessing of the data (such as normalization, and low intensity
spot removal, etc.). So use this display with care.
Figure 2.4.4.2 Scatter plot of HP-X and HP-Y single sample
data. HP-X is C57B6 pregnancy day 13 and HP-Y is pactation day 1.
A) An active scatter plot may be generated for the current HP-X
and HP-Y samples filtered by "All named genes". B) similar
plot for HP-X and HP-Y 'sets' of replicate samples (3 pregnancy and 4
lactation samples in the sets respectively). Clicking on a point in
the plot sets the current gene. C) Zoomed up region (of
B) at the bottom of the plot showing more detail and filtered
by just "All named genes". Zooming is performed by adjusting the X or
Y axes limits scroll bars. Note the points enclosed in magenta boxes
indicate genes in the E.G.L. gene list.
Figure 2.4.4.2.1 Scatter plot of multiple channel data from a
single sample. A) F1 Vs F2 data for a C57B6 pregnancy day 13
sample. B) Cy3 vs Cy5 data for a NCI mAdb mouse array sample.
C) Scatter plot of individual Cy3 channels from two different
ratio Cy3/Cy5 data hybridized samples. C) Scatter plot of
individual Cy3 channel of HP-X compared with Cy3 channel of HP-Y for
ratio Cy3/Cy5 data hybridized samples. D) Scatter plot of
individual Cy3 channel of HP-X compared with Cy5 channel of HP-Y for
ratio Cy3/Cy5 data hybridized samples.
The Intensity selection plots a histogram of the gene intensity
data values for each Filtered spot (gene) in the current hybridized
sample.
The Histograms submenu includes:
If Cy3/Cy5 ratio data is being analyzed, then the F1F2 histogram menu
entry becomes
Figure 2.4.4.3 Histogram plots. A) Ratio histogram of
HP-X/HP-Y data with particular histogram bin selected with the
constraint set to filter all genes > that bin. HP-X is 13 day
pregnancy C57B6 and HP-Y is day 1 lactatation. The selected bin
thresholds are then used in the Filter with the resulting Filtered
genes shown in the array image. B) Zdiff histogram of HP-X -
HP-Y 'sets' for same data as (A) but with the ><
threshold constraint set to find genes outside of the symmetric
histogram range. C) Intensity histogram of HP-X data filtered
by [I1:I2] intensity range. As with ratio histograms, you can do
additional filtering by selecting a particular histogram bin that is
then used in the Filter. Filtering was disabled for the intensity
histogram. To apply the filter, the "Don't re-Filter" button would be
toggled to the "Re-Filter" state. The threshold constraints include:
=, >, <, >, <>, and ><. Note that each time
you click on the "Thr:" button, it cycles to the next option in the
threshold constraints list.
You many generate as many individual expression profile plots as you
want using the Display a gene's expr. profile for HP-E
command. However, only the last one will be active and will be updated
with different genes as you click on them in the microarray image
scatter plot. This could be used to compare the EP plots for several
different genes. First view the EP plot for one gene, then create a
new EP plot for the second gene, etc.
If you use the Display Filtered genes expr. profiles
for HP-E command, it will generate a scrollable list of
expression profile plots for all of the genes passing the Filter. If
the number of genes is very large, it may take a while.
You may interrogate a line corresponding to a particular HP sample in
a EP plot by moving the mouse over the line and then selecting the
line. This will cause the name of the HP, its intensity and CV to
appear in the plot. If the Err check box is set, then the mean
of the intensity is indicated by a short horizontal bar and the +- CV
by red vertical error bars above and below the mean. If the plot
style Line button is pressed, then the plot style is cycled
between Line (vertical lines for each point), Circle (small circles at
each point), and Curve (circles are connected). Pressing the button
repeatedly cycles through: Line (i.e. vertical vars),
Circle, or Curve (i.e. continuous curve of all
samples). In the case of mean expression profiles
used in K-means clustering, the standard deviation is used in
place of the CV value. The various clustering methods have EP
plots buttons. When they are invoked, the scrollable list of EP
plots is sorted by the clustering method ordered list of genes. This
enables you to view the data in the same order as that produced by the
cluster analysis. If the zoom nnX button is pressed,
then all of the plots are magnified by nn-fold to make low intensity
plots more visible. Pressing the button repeatedly cycles through:
1X, 2X, 5X, 10X and 20X. It does not change the data itself. The
Show HP names button pops up a numbered list of all HP entries
used in the expression profile. If you are in stand-alone mode, a
SaveAs GIF button will also be available for the EP overlay
mode (Figure 2.4.4.4.1) or individual EP plot. This saves the current
plot as a full resolution GIF file specified by the user in a popup
file browser window.
The Expression profile plots submenu contains:
Figure 2.4.4.4 Expression profile plots. A) Individual
expression profile plots may be created by clicking on any
gene. Multiple instances may also be created. Here we show some of the
presentation options for the 38 sample MGAP database. Error bars are
computed for the standard error for that sample. There are three
different plotting options: line, circle and curve. #1 is the default
line plot with error bars. #2 is the line plot without the error bars
but clicking on line 7 to find out which sample it is and what the
intensity value is. #3 is the circle plot with error bars, and #4 is
the curve plot without error bars. Window #5 shows the list of samples
corresponding to the 38 points in the EP plots. B) List of
EPplots of the oncogenes and proto-oncogenes in the database (set by
the guesser with "onco" and "Set E.G.L." and the Edited Gene List
Filter). The list would become scrollable if there were more than 10
profiles. Setting the current gene would scroll the list to the EPplot
for the current gene.
Figure 2.4.4.4.1 Expression profile plots.
A) Scrollable list of EP plots of Filtered named genes centered at
Carbonic anhydrase III.
B) Overlay plot of all named Filtered genes.
C) Overlay plot of all ONCO or PROTO-ONCO genes with the
draw EGL option active so the graphs are drawn for these genes.
When enabled, cluster data appears as blue circles
or squares drawn as overlays on the pseudoarray image. These
options are discussed in the section on clustering.
Cluster analysis plots include finding a subset of genes or subsets of
samples based on cluster analysis of expression profile similarity
measures. These show genes belonging to particular clusters, or genes
that cluster well with specified genes. Cluster methods include:
finding genes similar to the current selected gene within a "distance"
threshold; K-means-like clustering where you specify a seed gene and
the number of clusters; and hierarchical clustering with clustergram
and dendrogram graphics.
Figure 2.4.5 Cluster Menu options. The hierarchical clustering
option is being selected.
There are many methods for doing clustering - each with advantages and
disadvantages. We present three methods in MAExplorer and plan on
adding a variety of more powerful methods through the MAEPlugin
facility under development.
These methods may find genes belonging to particular clusters or genes
that cluster well with particular genes. Gene clusters are sets of
genes whose expression profiles are found to be similar according to a
particular metric. We now define what we mean by "similar". The order
list of hybridized samples used in computing the expression profiles
are those in the HP-E list. MAExplorer has two different dissimilarity
measures for Cij: Euclidean distance LSQdistij and Pearson correlation
coefficient rij. These are computed as
follows and are tested against the cluster distance threshold (set by
the slider in the preferences sliders). Let n= |HP-E|, the number of
samples in the expression profile. We define similarity as (1.0 -
normalized dissimilarity).
The Cluster plots submenu contains a number of clustering
methods. Pressing the Escape key during a long cluster operation will
abort the operation. If you are in stand-alone mode using the
ClusterGram, a SaveAs GIF button will also be available for
saving the current plot as a full resolution GIF file specified by the
user in a popup file browser window.:
The Hierarchical Cluster plots submenu contains:
Figure 2.4.5.1 Similar genes clustered to the current gene.
This method finds all genes that are similar to the current gene as
those defined by their distance between expression profiles being less
than the threshold set by the user. Each gene that passes the cluster
distance threshold test is indicated in the image with a blue square where the size of the square is
proportional to its similarity. This data is from the 38 samples in
the MGAP database containing duplicated spots. A) Main windows
with popup cluster similarity report and cluster distance threshold
slider. B) Scrollable list of EPplots of similar genes with the
red error bars indicating the variation for duplicated spots for each
HP sample. The Err checkbox may turn the error bar overlays on
and off.
For both of these commands, if you want to view the expression profile
plots, click on the EP plot button in the cluster window and it
pops up the scrollable expression profiles window. If you click on a
gene in the image, it will select it as the new current gene and seed
gene and recompute the cluster of genes most similar to the new see
gene.
For both of these commands, if you want a permanent report, click on
the "Cluster Report" button in the cluster window and it will generate
a report in the current modality (i.e. scrollable spreadsheet or
tab-delimited). You may switch between these two modes by pressing
the "Go '...'" button in the report.
Figure 2.4.5.2 Display of cluster counts for all genes less than
the cluster threshold from MGAP 38 sample database. The algorithm
counts the number of similar genes for each Filtered gene and draws
a blue circle whose size is proportional to
the number of genes similar to that gene. That is why there are a larger
number of the larger circles.
Figure 2.4.5.3 Genes clustered using the K-means cluster
method. A) Using the current gene as the initial cluster,
MAExplorer finds N orthogonal clusters assigning the set of filtered
genes to these clusters using the HP-E expression profiles. All
genes are iteratively assigned to these clusters. Genes belonging to
the current cluster are labeled with a green cluster number both in
the array and in the scatter plot. The slider determines the number of
clusters (set to 6 here). A 2D scatter plot shows the genes belonging
to cluster 6. The K-means cluster report on the right contains a sorted
list of the genes in each cluster and has buttons to generate EP
plots and reports as well as summary mean EP plots (shown) and mean
cluster reports. The detailed list is shown below. B) Part of
the scrollable EP plots for this data showing genes belonging to both
clusters #5 and #6. C) The mean EP plots for the 6 clusters.
We call the genes closest to the "center" of the K clusters primary
genes and they are reported with additional information. The "Cluster
[# genes]" entries in the distance-to-cluster fields indicates that
these genes are the center of the clusters (i.e. primary genes). The
distNext is the distance from this cluster center to the next nearest
K-means cluster center. The number of clusters N (6 in this example)
is set in the popup state scroller. If you change the value of N, it
will recompute the clusters and the primary genes.
It draws magenta circles around the
primary genes in the microarray and the cluster number to the right of
the circle. The size of a circle corresponds to the number of genes
clustered with that circle. If you click on a gene belonging to any
cluster, it defines that cluster as the "current cluster". It will
change the labels of the subset of genes that belong to the current
gene from red (white) circle to a green (yellow) cluster number of the
current cluster in the intensity (ratio) pseudoarray image. In addition,
the 'edited gene list' is set to the subset of genes that belong to
the current cluster. If you are also displaying a scatter plot, genes
in the current cluster have their red '+' characters changed to the
cluster number.
You can click on that gene in the array image to determine its
identity. You may also popup an ordered (same as the above report)
plot of the clusters expression profiles by clicking on the EP
plot button. You may plot the mean expression profiles of the N
clusters using the Mean EP plot button. You may generate a
report of all of the clustered genes or of the mean clusters using the
Cluster-Report or Mn-Cluster-Report buttons
respectively. If you change the Filter conditions, you may recompute
the clusters using the Recompute Clusters button. Closing the
text window will remove the magenta
circles. If you selected the current cluster, the genes that
belong to it will still be available in the 'edited gene list' for
making reports, saving as a gene subset or for additional gene
filtering. If you press the SaveAs GeneSets button, then K gene
sets are created with the names "Cluster#1", "Cluster#2", ...,
"Cluster#K". You can then save or rename the clusters you want and
delete the rest. If you press the ClusterGram button, it
displays the gene sets in a cluster gram order the same way as
the cluster report.
Clustering is represented by a binary tree and is visualized as an
ordered gene clustergram and optional dendrogram sub-plot. This is
similar to the methods of (DeRisi,
1996), (Eisen, 1998), and
(White, 1999). Currently,
MAExplorer does 1-way clustering - not the 2-way clustering of (Weinstein, 1998) and (Eisen, 1998). Each row of the
clustergram represents a gene and each column represents a HP in the
HP-E list of samples. Each box in a row represents the normalized
expression of that gene for the HP represented in that column. The
color of the box is one of 9 colors representing the normalized
expression ranges and assigned according to the following table:
Table 2.4.5.4. ClusterGram pseudocolor assignments. The
colors are assigned to "box" entries in the clustergram corresponding
to genes. The color represents data as either the X/Y ratio or X-Y
Zdiff relative to the normalizing HP.
Figure 2.4.5.4 Hierarchical clustering clustergram of genes
filtered by ratio histogram bins for 19 samples from the MGAP data
set. The hybridized samples are drawn as colored boxes in the 19
columns. Rows of boxes correspond to gene expression profiles. In
A), the set of all genes and ESTs was filtered by the CV filter
set to 0.387 and the normalization was the Zscore. The gene "Mus
musculus D9 spice variant 2 mRNA, complete cds" was selected as the
current gene in the clustergram. Data for this gene and the selected
HP column is indicated at the top of the clustergram. The list of the
19 samples is shown on the left. B) Details of clustergram and
dendrogram are shown where the user had selected a cluster distance
threshold at "Mouse mRNA for mitochondrial cytochrome c oxidase
subunit Vb" in the dendrogram part of the plot (zoomed by 2X). This
selection draws all parts of the dendrogram tree that are less than
this distance are drawn in red. C) shows the manual selection
of genes from the ClusterGram or Dendrogram by clicking on the genes
names you wish to capture in the Edited Gene List (EGL) while the
Control key is pressed. The zoomed subregion shows three genes in the
same cluster that were selected (magenta stars in the right edge of
the ClusterGram).
Figure 2.4.6 Reports menu. You may create either dynamic or
tab-delimited text reports of either Samples or of subsets of genes.
These may be presented as interactive dynamic tables as well as
scrollable text windows capable of being exported to Excel. If Web DB
access is enabled, clicking on an entry will bring up a Web browser
with access to GenBank data. If the report contains Clone ID as one
of the fields, you can click on it to have it define that gene as the
current gene and highlight it in the microarray image or scatter plot
(if it is being used). The reports are divided into two types - those
dealing with lists of arrays (i.e. the sample experimental condition)
and those dealing with lists of genes.
The Report menu includes:
The "Samples vs Samples correlation coefficients" computes the correlation
coefficients in an upper diagonal matrix for the current set of
Filtered genes showing HP samples similarity. Then entries are of the
following form where HP:1 and HP:2 correspond to samples listed in the
field names of the table and the data is the intensity values using
the current normalization method.
The "Calibration DNA summary" table contains the computed means,
std-dev, and computed normalization scale factor for all active
hybridized samples. The scale factors are used if the 'Calibration
DNA' normalization is used.
You must set the Web access checkbox if you want to click on a blue
hyperlink in the resulting report to access an associated Web
database.
Figure 2.4.6.1 Hybridized samples dynamic Report windows. A)
Samples Info report. B) Sample Web links. Clicking on a blue
hypertext link brings up the corresponding genomic Web database entry
in a separate Web browser window if the Web access is enabled. The
tab-delimited version of the same reports (not shown) may be cut and
then pasted into other programs such as an Excel spreadsheet.
C) HP vs HP correlation table on genes passing the data
Filter for all samples in the HP-=E list.
If Cy3/Cy5 ratio data is being analyzed, then the Highest (Lowest)
F1/F2 entries become
Figure 2.4.6.2 Gene Report windows of 50 named genes with highest
HP-X/HP-Y 'set' ratios. A) Dynamic gene report of 50
genes with highest HP-X/HP-Y 'set' ratios. A similar report may be
generated for the lowest ratios or for single HP-X/HP-Y samples. This
type of report may be generated for the highest or lowest Zdiff values
when the Zscore normalizations are used. Clicking on a blue hypertext
link brings up the corresponding genomic Web database entry in a
separate Web browser window if the Web access is enabled. It also sets
the current gene to the gene for that row. B) The
tab-delimited version of the same report may be cut and then pasted
into other programs such as an Excel spreadsheet.
Figure 2.5 View Menu options. These are divided into various
options for modifying the presentation as well as recording activity
such as the messages or history popup scrollable log windows.
Figure 2.5. Popup genomic browser database page. A) The
UniGene Web page pops up in a new Web browser window when the user
clicks on a gene in the array image, 2D scatter plot or Report and the
view is set to "Display current gene in Unigene Web Browser" toggle
was enabled in the View menu. The current gene was "Jun-B oncogene".
Alternatively, the B) mAdb Gene DB may be selected - as well as
GenBank or dbEST genomic databases. C) Alternatively, data from
the NCBI LocusLink database may be accessed if either the GenBank ID or
LocusID is available.
Figure 2.5.2 Examples of messages and command history popup log
windows. Measurements and other activity are shown in more detail
in the messages window whereas the command history indicates commands
(numbered in the order they are executed) in the command history window.
Data from either of these windows may be saved in text log files.
Figure 2.6 MAEPlugins paradigm. If you have a MAEPlugin .jar
file, then it may be specified using the "Load plugin" command. When
you invoke the command from the menus (or other methods), it accesses
data from the current MAExplorer database it may need from the Open
Java API.
The Save RLO reports in time-stamped Report/ folder [CB]
options puts files generated by R from successive executions of the
same RLO into separate sub-folders in the Report/ folder with names
"RLOname-YYMMDD-HHMMSS/" to peep the data separate. This is
useful when you want to compare results from the same RLO method but
with different MAExplorer preprocessing.
You may download the latest versions of all plugins using the (File |
Update Plugins from
maexplorer.sourceforge.net) menu command. Similarly, you can
update your versions of the RLO methods using (File |
Update RLO methods from maexplorer.sourceforge.net
Figure 2.6.1 Loading a MAEPlugin from your file system using
the Load Plugins command in the Plugins pull down menu. If you
have a plugin .jar or .class file, it may be specified using the "Load
plugin" command. This pops up a file browser to let you specify the
plugin file.
Figure 2.6.2 Executing the new command previously loaded in the
Plugin menu. Selecting the new "Show List Active Filters" command
that now appears in the Plugins menu invokes the plugin. This pops up
a report shown in the next figure.
Figure 2.6.3 Popup window from executing the MAEPlugin.
This plugin gives a full report on the data Filter status in a new
pop up window.
Figure 2.6.2 Plugins menu - executing a previously loaded
plugin. Plugins that do not go into particular MAExplorer
submenus go into the Plugins menu. Selecting the command will
invoke that MAEPlugin.
The Help menu includes:
This section briefly addresses some of the issues you need to
consider. However, a full discussion of the issues involved is beyond
the scope of this manual. These issues are covered in other more
focused statistical methods literature and you might also address them
in consultation with biostatisticians. The Internet has vast resources
for microarrays. A few to get you started might include: a microarray
citation electronic library
http://arrayit.com/e-library/, the National Library of Medicine PubMed
journal search engine, a general microarray Listserv
GENE-ARRAYS@ITSSRV1.UCSF.EDU. The MGED group (Brazma, 2001) has published the MIAME
standard which specifies (Minimum Information About a Microarray
Experiment). This information is useful in doing an analysis. Also
try searching using general Internet search engines. There are a
number of public microarray data repositories. One that we find useful
is NCBI's GEO (Gene
Expression Omnibus), that contains array data and MIAME compliant
information about the arrays.
A good and appropriate experimental design (i.e. the design and
setting up of experiments to subsequently be analyzed) is critical for
resolving significant differences in gene expression between
experimental conditions. We touch on some of the issues here. (Simon, 2001), (Dudoit,2000), and Kerr and
Churchill (2001a, 2001b) discuss some of the issues
of experimental design for microarrays. We do not currently implement
the Kerr-Churchill method. However, some of the issues involved in experimental design based on the types
of arrays are discussed in Section 3.1.1 for (Cy3/Cy5)-labeled as well
as 33P-labeled samples.
If users are comparing two different types of samples, the analysis
would be different than if they were comparing an ordered sequence of
samples (e.g. time series, cell cycle, dose-response, tumor-stage,
etc.). MAExplorer gives users the ability to:
Briefly, data mining is the discovery of potentially interesting
patterns in the data that were previously unknown. One approaches the
analysis of a set of data with minimal expectations. However, some
idea of what you are interested in helps focus the search. But beware
of the trap of mining the data until you get the results you hope
for. The following figure helps illustrate this process.
Figure 3.1 Flow chart of a typical data mining session. The
user makes some initial decisions on the experimental design such as
which hybridized samples to compare, the type and numbers of
replicates. They then make initial guesses as to the normalization
method to use, and the gene subset (the gene class) to concentrate on
when setting the data filter. The data is viewed in various modalities
to get a feeling for its inherent dynamic range and where interesting
outliers might appear. Clustering and plots helps bring these
differences into view. The results are then evaluated and either the
process is finished or the views are refined by adjusting data
normalization and filter parameters, data subsets to be investigated,
clustering methods, plots etc. and the process repeated until the user
is able to see the differences between gene subsets more clearly or no
significant differences appear to be found.
Obviously, this approach is a first approximation to what is
eventually required. But it does capture the flavor of the data-mining
process. Typically the user would refine the search using variations
of the data filters and might contrast (using gene sets and hybridized
sample condition lists operations) results found under one set of
conditions with those found under another set of conditions.
Proper experimental design of microarray experiments is critical to
successful use of microarray data. Several recent reports discuss some
of the key issues involved in various aspects of statistical analysis
of microarrays: (Radmacher,
2001), (McShane, 2001),
(Korn, 2001), (Simon, 2001), (Dudoit,2000).
An alternative method would be to compute
(Cy3X/Cy5Y) directly. However, this too
has its own sources of error and other problems, namely that not all
genes are labeled symmetrically with the two dyes since different dyes
may have different sequence specific affinities due to a variety of
causes. For that reason, dye-swap experiments are often
done. I.e. the two samples would be run as
(Cy3X/Cy5Y) as well as
(Cy3Y/Cy5X). If one were to plot
(Cy3X/Cy5Y) against
1.0/(Cy3Y/Cy5X) and the data were
perfectly symmetric (which they are not) then one would expect
a straight line. That is generally not what you get in practice.
Another issue is that when you have a number of samples A, B, C, D,
..., N and wish to compare them, there are a number of alternate
experimental designs you can use with different resulting sets of
advantages and problems. If a common pooled Cy5P
sample P were used, then the following experiments would be done:
MAExplorer is currently not oriented to handling these large
combinatoric types of non-pooled sets of experiments. However, you do
have the ability to swap (Cy3,Cy5) data on an individual basis so you
could compute an average of data from dye-swap experiments - but
with the caveats or non-uniform labeling mentioned above.
The major focus of the MAExplorer is interactive data mining with an
emphasis on direct graphical and tabular manipulation of the data.
The investigator is able to interact with the system by clicking on
spots in the array image, points in graphic plots, cells in
spreadsheets, by manipulating threshold sliders or typing in gene
names/clone Ids. This level of interaction allows investigators to
search for and identify patterns of differences with greater ease than
with a more static graphic system since it is easier to test ideas by
"grabbing onto the data". For example, "what" is the identity of
"this" outlier I am pointing to in a scatter plot; "which" genes are
best clustered with "this" gene in this clustergram and are perhaps
co-regulated; "which" genes have expression ratios within the range of
the histogram bins to that I am pointing?
Direct user manipulation of data, as incorporated in MAExplorer, was
defined by (Schneiderman,
1997) who defends the position that the direct manipulation of data
in data mining is an extremely effective means to amplify human
creativity in understanding patterns. Schneiderman's dogma states
"overview first, zoom, and then filter details on demand" and favors
the use of "shallow search trees, slide controllers, and
information-right screens with tightly coordinated panel view of
data", (Beardsly,
1999). MAExplorer also uses many of these direct manipulation
principles. It was designed to run on the desktop computers with data
residing on the same computer and loaded into its memory for rapid
direct manipulation - for both the Web browser and stand-alone
versions.
Part of the Flicker system allows comparison of user 2D gel images
with standard images from SWISS-2DPROT for putative identification of
unknown spots in the user gels. The user would select a standard
2D gel image from over 20 tissue types, enter their own 2D gel image
and align them at spots of interest. They could then switch to a
database access mode, click on those spots and generate popup
SWISS-2DPROT Web pages for those proteins - similar to Clone reports
in MAExplorer. That is accessed at
http://www.lecb.ncifcrf.gov/flicker/swissProtIdFlkPair.html.
MAExplorer will have a groupware facility similar to what we
have done with our
WebGel (http://www.lecb.ncifcrf.gov/webgel/) system described in
(Lemkin et al., 1999b). It is a
two-dimensional electrophoresis system for sharing data analyses. In
WebGel, users may perform a data-mining analysis and leave the state
of the their analysis and accompanying notes to share with their
collaborators on a login-protected basis.
We now discuss using these tools for analyzing ones data.
Table 3.2 Steps in a data-mining analysis.
In designing a data mining experiment, the first decision to be made
is selecting the set of hybridized samples to be compared (steps
1 and 2). This is accomplished by setting the current hybridized
sample-X (HP-X) and hybridized sample-Y (HP-Y). In Figure 2.4.4.2 for the
scatter plot we selected a single C57B6 pregnancy day 13 and a single
Stat5a (-,-) pregnancy day 13 as current HP-X and current HP-Y
samples. Changing the normalization changes the view in the scatter
plot so that hidden differences may be more apparent (see Figure 2.4.2.3)
The names of the current HP-X and HP-Y samples are displayed at the top
of the main window. The current HP-X and HP-Y samples may be changed
at any time by clicking on a new sample from a list of samples shown on
the left side of the main window or from lists of samples organized by
sample population in the Samples menu.
The next decision to be made is selection of the genes to be studied
by choosing a subset from the gene class menu list
(step 4). Further selection occurs throughout the analysis by
clicking on spots in microarray images, points in graphic plots or
cells in spreadsheets, by adjusting threshold sliders, or using the
text-entry "guesser" to type in gene names, clone IDs, genomic IDs,
samples, etc.
The next decision the user must make is to set the intensity data
normalization mode (step 3). Normalization of quantitative data is
crucial when comparing data between different hybridized microarrays
because of spotting, hybridization efficiency, uniformity, and
other systematic errors.
Genes of interest may be separated for all of the genes in the
database using a cascade of data filters (step 4). Additional
filtering options are easily accessible in the (data) Filter menu. Some of the filters
require additional parameters. These parameters are set by state
scroll bars that pop-up on the screen when data filters requiring them
are added to the filter cascade. Changing scroller values causes the
data filter to be automatically be reapplied and a new set of genes
to be computed.
It is desirable to reduce false-positives found by the data filter by
eliminating genes with high quantification variability between
duplicate spots on the same sample or spot duplicated in replicate
samples. If duplicate genes are available on the array (denoted by
Field 1 and Field 2 or F1 and F2 spots), this allows the computation
of a coefficient of variation (CV) for the duplicates. This CV may be
used in a data filter to reduce potential false-positives. CV is
computed as 2|F1-F2|/(F1+F2) using those spot values for each gene,
as StdDevHP/MeanHP for a set of replicate
hybridized samples.
Graphical views of the data give the user additional insights into the
data. These include spot intensity
and ratio or
Zdiff pseudoarray
images, scatter
plots
When there are too many EP-plots to be viewed simultaneously, you
might use a scrollable list of expression profile plots that lets you
scroll through an arbitrarily large list of genes. However, it is
difficult to compare genes that are not sorted in some way
(i.e. clustered). Therefore, these are most useful when used after
clustering the data and displaying the scrollable EP-plots of
the cluster-order data.
Clustering is one way of possibly finding co-expressed genes that
exhibit similar expression changes in a set of samples. Genes may show
similar co-expression, but that does not prove they are co-regulated
at the same point in a pathway - merely that measurements of those
genes in a particular set of experiments show similar
expression. However, identifying genes with similar expression for
which some information is already known about some of the genes may be
useful as a starting point to help figure out gene function and
pathway using additional experiments and analysis.
There are many methods for doing clustering - each with advantages and
disadvantages. We present three methods in MAExplorer and plan on
adding a variety of more powerful methods through the MAEPlugin
facility under development.
The first cluster method finds a cluster of genes whose expression
profiles are similar to that of the currently selected gene. This list
of genes is restricted by the constraint that the cluster distance
between each of these genes to the selected gene is less than the
"Cluster threshold" distance set by the user with a scroll bar. It
displays genes that are found both with blue boxes (the larger the
box, the higher the similarity) and in a text report window showing
the genes and their distances to the current gene. By varying the
threshold and observing the results, the user can find a set of highly
correlated genes. If the threshold is set to 0.0, no genes are
found. If it is set too high, all data filtered genes are found. So
it is critical to adjust the threshold to a reasonable level
commensurate with the type of data being analyzed and the approximate
number of genes expected.
A second cluster method draws blue circles in the array image around
all filtered genes meeting the threshold criteria, where the larger
the circle the larger the number of similar genes (i.e. passing the
threshold) are found to be clustered with that gene. Clicking on a
gene toggles between the first and second methods. For both of these
methods, it will pop-up a "Cluster Distance" threshold scroller and
recomputes the clusters if you change the scroller value or the current
gene. It also shows a text report that displays the number of genes
similar to each data filtered gene.
A third method called "K-means" clustering K genes (we call primary
nodes) whose expression profiles are most orthogonal to each other. It
uses the current gene as the first or "seed" node. It then finds the
gene furthest from this and assigns it as node 2. Then the gene
furthest from both nodes 1 and 2 is assigned to node 3, etc. This
process is repeated until all K nodes are assigned. Then the
remaining genes are assigned to the closest node. Having defined the
initial cluster centers, it recomputes the centroid of each of the
clusters. The centroid can alternatively be computed using a median
instead of a mean in which case we would be doing K-median clustering
(Bickel, 2001). K genes are
then reassigned to the nearest new centroids as the new K-means node
instances. Finally, the remaining genes are assigned to the nearest
centroid. A scrollable K-means cluster text window report pops up
with genes sorted by cluster. Clicking on a gene in either the array
image or scatter plot assigns all genes in the cluster to which that
gene belongs to the "current cluster". Genes in the current cluster
are labeled in the array and scatter plot with a small number of the
cluster. In addition, genes in the current cluster are copied to the
E.G.L. where they can be used in a report, saved in a named gene set,
or used for additional filtering. It also pops up a "N-clusters"
scroll bar window to let you dynamically adjust the number of
clusters. Changing N will recompute the clusters. When the K-means
is recomputed, it uses the current gene as the initial seed gene.
The fourth method is a hierarchical clustering method that generates a
clustergram and dendrogram similar to that of Eisen's red-black-green
clustergram (Eisen, 1998). This
was derived from the clustered correlation map (ClusCor) of Weinstein
et al. (Weinstein, 1997). The
MAExplorer clustergram and dendrogram are dynamic and may be
interrogated and used to set the current gene. This means that it may
also position a corresponding ordered list of expression profile plots
to the same gene so you may view the data as a plot as well. The
dendrogram may be zoomed in to explore a part of the dendrogram in
more detail. As with the K-means clustering, a report can be made of
the ordered genes.
Then, the expression profile is expressed as a list of values:
For scaled data such that dpq has a maximum value of 1.0
ovger all samples. A similarity measure could be computed as
1.0 - distance or
D can get quite large for clustering a large number of genes
N [for N=5000, this is > 50 Mbytes!]
The following is a simplified definition of one way to compute a
hierarchical clustering of gene expression profile data.
If there is only one field in the array, it will appear as field 1. In
the above example, [1-A4,3] is field 1 grid A
row 4 and column 3. Note that the pseudoarray coordinates are for
visualization purposes in MAExplorer and may or may not be the same as
the coordinates on the actual array. That depends on how the
MAExplorer database was defined in the configuration file described in
Appendix C.
When the current gene is defined, it will draw a yellow (green) circle
around the spot in the ratio (intensity) pseudoarray image and display
other features of the gene in the three-line status area near the top
of the main window. If background correction is enabled (the "Use
background intensity correction" in the Normalization menu), then spot
intensity values will appear as intensity' (with background
intensity subtraction) and intensity (without background
subtraction).
There are a number of different reporting formats available depending
on the array display mode and particular normalization method
selected. These include: the pseudoarray image of the intensity of a
single sample, the pseudocolor ratio X/Y or Zdiff (X-Y) image (using
either HP 'sets' or single samples), or the ratio of Cy3/Cy5 for
dual-labeled dyes or F1/F2 for replicate spots for a single sample.
In addition, the normalization mode is also displayed in the reporting
line. We will present examples of each of these different reporting
formats.
You may show the intensity data for a particular spot in the currently
displayed pseudoarray image. First select the "Pseudograyscale image"
option in the "Show Microarray" submenu in the "Plot menu". If your
data has duplicate grids (i.e. fields F1 and F2) then you may look at
F1, F2 and mean (F1+F2)/2 data in the reports when you click on a
spot. If the "Gang F1-F2 scrolling" switch is disabled in the "View
menu", then the intensity value is the intensity data value
for the gene at that location. If the "Gang F1-F2 scrolling" switch is
enabled, then it reports intensity[F1], intensity[F2], and the F1/F2
ratio. These two formats are shown in the following two examples for a
C57B6 pregnancy day 13 samples in the MGAP database:
a) Field F1 spot for a single spot in a single sample with the median
intensity selected.
c) Ratio data for two samples X and Y in separate hybridized
arrays. Ratio data for the field F1 and F2 spot data as well as the
mnX/mnY ratio is reported. The median normalization was used in this
example.
f) Multiple HP-XY 'sets' using median normalization for the pseudoarray
image display for the HP-X 'set' of three C57B6 samples.
j) Multiple HP-XY 'sets' p-value using median normalization for ratio
(HP-X/HP-Y) data for the "Pseudocolor (HP-X,HP-Y) 'sets' p-value
display.
For the intensity and ratio threshold filters, the range
interpretation may be inside, or outside the specified range. The
ratio range [R1:R2] is between 0.01 and 100.0. The Zdiff range [Z1:Z2]
and [CZ1:CZ2] are between -4.0 and +4.0. The intensity threshold range
[I1:I2] is set to the dynamic range of the min and max intensity for
the current normalization method.
A list of possible threshold sliders is shown in the following table.
When a Filter is enabled that requires a slider, it pops up the
State Scrollers window that contains one or more
slides. When you disable all filters that use these sliders, the
popup window will disappear. The corresponding Ratio R1[R2] or
Zdiff Z1[Z2] sliders are used if you are using a ratio or
Zscore normalization - and will change if the normalization changes
while the filter is active.
Some of the sliders are implemented with a non-linear scale so that
you have more resolution at the low end (eg. p-Value, Spot CV, Diff
HP-XY).
Depending on the set of data Filters selected, there may be multiple
sliders present in the State Slider popup window (eg. see Figure
2.4.3).
Table 3.3.1. List of threshold sliders. Sliders are enabled in
the State-Scroller popup window when the corresponding data filters
are enabled.
If you are running on a windowing system supporting cut and paste,
then you may cut and paste data from reports and plots into
applications on your system that allow you to save or print this
data. Set the Report menu table-format to "Tab-delimited". Then, in
Windows 95/98/NT/2000/XP, cut data from the popup tables (or other text
reports) and paste it into Microsoft Excel. In Windows, you can
capture (i.e. "cut") the entire screen by pressing the "Prt Sc" or
print screen button. To capture a specific window (e.g. a scatter
plot), hold the "Alt" key when pressing the "Prt Sc" key. Then go
into a Windows imaging application (such as PhotoShop) and paste it
into the application. In PhotoShop, in the File menu, select New (or
type Control/N). Then when the window is opened, click on the window
and paste the MAExplorer screen you had cut into the image window by
typing Control/V. In both Excel and PhotoShop you may print the data
or save it in a file.
Section 4.1 discusses known bugs, Section 4.2
lists the revision notes for older versions known bugs. If you have experienced bugs
with an older version of MAExplorer, you might check the revision notes
to see if the bug was fixed and download a new version. Section 4.3
discusses problems in using MAExplorer
as an applet with Web browsers. Section 4.4 describes handling fatal "DRYROT" errors.
If you encounter a fatal error that is detected by MAExplorer, it will
popup an error reporting
window. Please E-mail this data to us so we can try to resolve the
problem.
In the mean time, partially implemented commands are disabled to keep
you out of trouble :-) ...
You can help us and get MAExplorer to do more of the things you
would like to see. Let us know of problems that you encounter as well
as suggestions for changes or new methods you would like to see -
send us E-mail.
If you are experiencing Web browser problems using the MAExplorer
applet, you might check the discussion of possible solutions.
Figure 4.4 Example of a fatal Dryrot Error window. This may occur for
a variety of reasons. This window lists the main reason and also lists
some of the MAExplorer state information. If you wish, you may save this
window (press the "SaveAs" button) and mail it to us. We may try to
correct the problem in the next release if it is a problem with MAExplorer.
Alternatively, it could be a user data error.
Figure 4.4.1 Example of a fatal Dryrot Error window after SaveAs.
This tells you where the saved error message file was saved and
the email address to send it to if you wish.
Primary contributers to Cvt2Mae were Peter Lemkin (LECB/NCI), Greg
Thornwall (SAIC/FCRDC), Bob Stephens (ABCC/NIH).
We wish to thank the many members of Lothar Hennighausen's
Laboratory of Genetics and Physiology (NIDDK) who inspired the
initial development of MAExplorer and its continued
development. Thanks also to:
Greg Alvord (SAIC/FCRDC),
Thanks also to Jeff Thomas, Charmaine Richman, and Tom Stackhouse
(NCI) for helping with the MAExplorer Open Source process.
This tutorial lets you
There is also a
pre-computed example of an Ordered Condition List using 4
conditions of replicates of C57B6 (pregnancy day 13, lactation days 1
and 10, and stat5a(-,-) 15 samples. The database also includes 4
additional condition sets of this data and an Ordered Condition List
of the 4 conditions (in the State/ directory). This may be used to
demo the OCL F-test filter.
If you have access to another MAExplorer database, you can use it
instead since the tutorials are fairly generic.
Using the stand-alone application for the tutorial
These same subsets as well as other subsets of the MGAP data are
available in the set of .mae startup files distributed with
MAExplorer. To access these files,
When it starts, a main window will pop up. It then downloads a gene
database tables and the particular hybridized samples you specified.
When it is ready for you to begin interaction, the menu bar will
become active and it will display a green Ready -
click on a gene to query database message. Depending on your
Internet connection speed, it may take a few minutes to set up. If you
are running MAExplorer as a stand-alone application and it is getting
data from your local disk, startup will be much faster.
HINT: print this tutorial page and then read
the following instructions from the printout rather than trying to
keep this window visible. You might also print the parts of the
MAExplorer Reference Manual for the same reason.
HINT: You might want to keep a record of the
commands you have used or the messages and measurements you have
made. To do this you need to enable message and command history
logging. Go to the View pull-down menu and then select the type of
logging you want using the
Show log of messages or the Show log of command history
commands.
NOTES:. On computers with low resolution
(i.e. less than 1024 X 780) you may need to resize the windows and
move them to different parts of the screen to view them
simultaneously.
step 1: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y.
step 1: go to Analysis: Plot: Scatter plots: Cy3 vs. Cy5
If you are working with Cy3/Cy5 dye-swap data, you may swap the
Cy3/Cy5 channel data to Cy5/Cy3 for any selected subset of
samples. This may make it easier to use the data in various ways when
data mining. If you do not have this type of data, go to step 7.
step 5': go to Samples: Edit (Cy5/Cy3) else use (Cy3/Cy5) menu
Note of caution: if the signal is close to background the X/Y ratio
may be bogus.
In the Filter menu, add the "Filter by ratio or Zdiff sliders". Then
the [R1:R2] ratio range sliders are added to the state slider window
and may be used for filtering genes. If the normalization method is
one of the Zscore methods, it filters by the difference of the Zscores
otherwise by the ratio and the [Z1:Z2] range is used. Note that the
genes that pass the filter will appear to have a red (white) circle in
the pseudoarray intensity (ratio) grayscale (pseudocolor), or red "+"
in the scatter plots so you might try moving the controls while in
those plot modes. Try some of the other filters. The spot CV test
removes genes where replicate spot values (F1 and F2 in the case of a
single sample or replicate samples in the case of HP-X and HY-Y 'sets'
or the HP-E' list of genes) are not well correlated. The t-Test filter
may be used with sets of X and Y samples to find genes with a
p-value less than the specified threshold.
Turn on one or more Filters
to reduce the number of genes to say under 100 (e.g. t-test or spot CV
filters). Then press the "Go 'Cluster all genes'" button in the
cluster window. This is equivalent to invoking the "Cluster
counts of Filtered genes by expression profiles" command from the "Cluster plots" submenu.
Notice the Filtered genes has blue circles of different sizes. The
larger the circle, the more genes there are that are similar to that
gene. Move the cluster threshold slider and note that the number of
similar genes changes, the size of the blue circles will change. As
with the other cluster mode, you may generate a report of sorted
cluster counts. Click on a gene with the largest
green circle. This will then switch you back to single gene
clustering mode where you can investigate that gene in more detail.
Types of pseudoarray image displays
There are several differnt types of pseudoarray images that may be
displayed. The current type is set in the Show Microarray
submenu in the Plot menu selections including
Pseudograyscale intensity that approximates the intensity of a
single sample or average of samples. The Pseudocolor
Red(X)-Yellow-Green(Y) HP-X/HP-Y ratio or Zdiff and Pseudocolor
Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5 (or F1/F2) ratio or Zdiff add
the two samples or channels together as separate Red+Green channels to
give a color spectrum. The Pseudocolor HP-X/HP-Y ratio or Zdiff
Pseudocolor Cy3/Cy5 (or F1/F2) ratio or Zdiff gives a color
spectrum from a low ratio (zdiff) value (Green) to a high value (Red)
with a value of 1.0 (0.0) of Black. The Pseudocolor (HP-X,HP-Y)
'sets' p-value shows the p-Value between two X and Y sets in a
color spectrum.. If the Original image is set and the image
file is in the database, it will pop up a separate Web browser window
to display it. The Pseudograyscale display is a grayscale image, with
higher concentration genes appearing darker, on a light blue
background. The pseudocolor HP-X/HP-Y ratio of spots image is
constructed using a color scale going from bright green (<1) to
black (=0) to bright red (>1) on a black background. For the
pseudocolor Zdiff of (X-Y), the color scale goes from bright green
(<0) to black (=0) to bright red (>0). If the dichromasy
switch is set in the View menu, that a different set of colors is
selected that may be easier for some people to differentiate. If the
Use dual HP-X & HP-Y 'sets' else single samples toggle in
the Samples menu is set, it displays the mean HP-X data in the left
and HP-Y in the right for doing a side by side comparison. Popup windows
MAExplorer starts with the main pseudoarray image windows. This window
contains the pull-down menus where you may issue commands. As you
perform various operations, new windows may popup for some of these
commands. For most of these windows, you may click on the "Close"
button or click on the close window icon associated with your
operating system (generally one of the buttons at the top of the popup
window). However, some windows were designed to not close when you do
this. In particular the "State sliders" are not able to be
closed unless the associated data filtering or clustering operation is
closed. When you close the associated operation will automatically
close the state slider window.The current sample sample, HP-X, and HP-Y
In MAExplorer, a hybridized array sample is abbreviated HP. The
underlying data comparison model assumes, as a minimum, the comparison
of two different experimental conditions represented by samples HP-X
and HP-Y. A good way to think about this is that these variables are
the two axes of a scatter plot (one of the displays you may
generate). The HP-X and HP-Y may be thought of as containing data from
either single hybridized samples or containing mean data from multiple
replicate sets of sample. The HP-X and HP-Y are assigned using the
Set current HP-X and Set current HP-Y in the
Samples menu (hybridized sample is abbreviated HP in
MAExplorer. The sets are most easily changed using Choose HP-X,
HP-Y and HP-E to select the currently active samples. The
contents of the of multiple sample HP-X and HP-Y 'sets' may
alternatively be changed using the Edit HP-X & HP-Y 'sets' of
samples by source submenu, and the HP-E list of samples using the
Edit HP-E list of samples by source. Assigning single samples
to either HP-X or HP-Y may be done from the Samples menu. However, it
is easier to do it by clicking on the pseudoarray image. First click
on the magenta "[X]" or "[Y]" Current Sample box at the top of the
list of switch between HP-Y and HP-Y. Whichever is visible ([X] or
[Y]) is the one that will be the HP sample assigned. Then simply click
on the magenta "*" to the left of the sample name for the sample you
wish to assign.
Using 'sets' of HP-X and sets of HP-Y
Multiple samples may be assigned to the to the HP-X or HP-Y
sets. These are assigned using the Edit HP-X and HP-Y 'sets'
of microarrys in the Samples menu. The multiple sets are
enabled by setting the Use HP-X and HP-Y 'sets' else single
samples checkbox in the Samples menu. Then, when
statistical calculations are performed on that data, it will use the
means, std-deviations, etc. from each of these sets rather than
individual samples.The HP-E sample list for computing expression profiles
You may cluster sets of genes with similar expression profiles across
a set of hybridized samples. The set of HP samples used in doing these
profiles is specified by Edit expression profile 'list (HP-E)
in the Samples menu. The Choose HP-X, HP-Y, and HP-E
command may also be used for defining the members and order of the
samples in the HP-E 'list'. Then, gene intensity expression profiles
may be created in a popup window for hybridized samples in the HP-E
set by using the Expression profile plot commands in the Plot menu.
Several of these plots may be created on the screen at the same
time. Clicking on a vertical data line in the plot will show the name
of the HP, its intensity and coefficient of variation (CV) of the
(F1,F2) data for this gene. Note that you can order the hybridized
samples in the HP-E set by the order in which they are added.Data 'Filters' - the intersection of one or more data tests
A set of genes may be computed by taking the intersection selected
gene sets. These sets are determined by various logical, data range
and statistical tests. Genes passing each test are assigned to a
gene subset which in turn are used in the gene intersection
computation. The final gene subset is used in array, plots, and
reports, and subsequent data filtering. Changing any test parameters
causes the data filter to be re-computed.1.4 Exploratory data analysis - overview
MAExplorer may be used to perform various data explorations by looking
for patterns correlated with different sets of hybridized samples or
with expression profiles of genes. This is discussed in more detail
throughout this manual and later in Section 3 on Exploratory Data
Analysis. Detailed descriptions of all commands are given in Section 2 Menus.
1.4.1 Saving the state of a data-mining session in stand-alone mode
If you are running MAExplorer in stand-alone mode, you may save the
state of your session for later use using the "Save DB" or "SaveAs DB"
commands. Then, the checkpointed database could be accessed using the
"Open file DB command". It currently saves: the gene sets, condition
(HP) lists, current HP-X, HP-Y and HP-E lists, data Filter options and
slider value settings, display options, clustering options,
normalization options, etc. We recommend using the "SaveAs ... DB" so
you can save the state under a different name rather than overriding
the original state. This way you could backup to the original state if
you wanted to. The "SaveAs DB" and "Open file DB" commands are
described in the File menu.
1.4.2 Logging messages and command history
Often a user would like to review measurements of particular genes and
to review the list of commands they issued (also called the command
history). Various data measurements as well as many other types of
information in the three text lines in the status area of the main
window may optionally be recorded in a popup message log (Section
2.5.1) and the command history may also be reviewed in a separate
popup message log (Section
2.5.2). If you are running the stand-alone version, the logs may
be saved. Otherwise, you could cut and paste the log data into
other word processing applications.
1.5 Quick start - demonstration of MAExplorer
MAExplorer is used as a stand-alone application. You may download the stand-alone
application (see Appendix D). This download also include a demo data set
of 50 hybridized samples from the public MGAP database. In any case, you can
explicitly download the data at any time at
http://www.lecb.ncifcrf.gov/mae/MGAP-Array-database.zip or
HREF="http://prdownloads.sourceforge.net/maexplorer/MGAP-Array-database.tar.gz?download">
http://prdownloads.sourceforge.net/maexplorer/MGAP-Array-database.tar.gz?download
Exiting MAExplorer
If you are in MAExplorer and want close the program and exit, you may
use the Quit command in the File menu or click on the
"close application" button (found in the upper right hand corners of
MAExplorer windows put there by your operating system). 1.6 Tutorials for using MAExplorer
There are a number of things you may do in this data mining facility.
We wrote two tutorials to help you understand its capabilities. We
recommend you first try the
short tutorial before attempting the advanced tutorial. The latter
demonstrates some of the more advanced capabilities.
2. MAExplorer menus
A MAExplorer analysis is performed using various interactive
controls. These commands are selected from pull-down menus located on
the "menu bar" at the top of the main MAExplorer window. The primary
menus are:
2.4.1 GeneClass - select gene subset for gene class data Filter
2.4.2 Normalization - select gene intensity normalization mode
2.4.3 Filter - select data filters to compute gene subset of interest
2.4.4 Plot - pseudoarray image, scatter, histograms, expression profile popup plots
2.4.5 Cluster - perform cluster analysis on data filtered genes
2.4.6 Report - generate popup spreadsheet reports of genes and samples
Menu notation
In the following menus, selections that are sub-menus are
indicated by a '
'. Selections prefaced with a '
' and indicate '
' indicate that the command is a checkbox
that is enabled and disabled respectively. Checkbox menu items
have a "[CB]" at the end of the command. Selections prefaced with
a '
' and indicate '
' indicate that the command is a
multiple choice "radio button" that is enabled and disabled
respectively, and that only one member of the group is allowed to be
on at a time. Radio button menu items have a "[RB]" at the end of
the command. The default values set for an initial database are
shown in the menus. Selections prefaced with a '#' indicate
that the commands are available only when MAExplorer is run in the
stand-alone mode. Selections prefaced with a '*' commands
requires access to the backend Web server [Future]. Selections that
are not currently available will be grayed out in the menus of the
running program.

2.1 File menu
The File menu operations includes options and submenus
providing access to database data from disk and Web servers, state
saving, and groupware to share states between collaborators.
'. Selections prefaced with a '
' and indicate '
' indicate that the command is a checkbox
that is enabled and disabled respectively. Checkbox menu items
have a "[CB]" at the end of the command. Selections prefaced with
a '
' and indicate '
' indicate that the command is a
multiple choice "radio button" that is enabled and disabled
respectively, and that only one member of the group is allowed to be
on at a time. Radio button menu items have a "[RB]" at the end of
the command. Selections prefaced with a '#' indicate that
the commands are available only when MAExplorer is run in the
stand-alone mode. Selections prefaced with a '*' commands
requires access to the backend Web server [Future]. Selections that
are not currently available will be grayed out in the menus of the
running program.
- open and save
databases of hybridized samples
- save
and restore the user's data-mining explorations
- share some
exploration states with collaborators [Future]
2.1.1 Databases menu
The Databases submenu is currently available only in stand-alone
mode and contains the following selections for opening and saving
databases.
2.1.2 Exploratory state menu
The Exploratory 'state' submenu contains the following
selections for saving and later using the user's state of the
exploration. If MAExplorer is being run on a local computer, no login
is required.
2.1.3 Groupware facility for sharing user states menu [Future]
The groupware facility allows users to share their state data with
other users. However, If MAExplorer is being run on a local computer,
then groupware may not be available since it depends on using specific
Web servers with MAExplorer-specific groupware services available.
[We first developed these concepts in the context of 2D protein
gels. They include: WebGel (Lemkin
et al., 1999b) for Internet exploration of 2DE databases, Xconf (Lemkin et al., 1993) an early
X-windows based image conferencing similar to CU-See-Me or ALO Instant
messenger for sharing images in a conference over the net, Flicker (Lemkin, 1997) an image
comparison over the internet using a a Java applet, and GELLAB-II (Lipkin and Lemkin, 1981)
a system for data mining - see
GELLAB-II Poster on the Web) that embody many of the concepts used
in MAExplorer.]
The Groupware submenu contains the following selections:
Saving the state and databases on the local file system
The current state of an exploratory data analysis session may be saved
on the local file system. The state consists of: gene sets;
HP-X and HP-Y hybridized sample sets; HP-E hybridized sample
lists; thresholds and switch setting preferences; etc. The user may
save their current state and restore a previous state at any
time. Restoring the state will overide the current database.
Groupware sharing of intermediate exploratory results [FUTURE]
Each registered user will be able to save the current state of their
exploration of the data in named User State files on the
back-end Web server using the Save user's state command. The
user may keep multiple named states on the back-end secure server
where they be accessed to restore the state at a future time using the
Open user's state command. The user can request a list of their
states with the Directory of user's states command. They may
remove a particular state with Delete user's state. 
2.2 Samples menu
Each experimental condition sample is represented by a hybridized
sample (abbreviated HP in MAExplorer). The Samples menu
operations include operations to select the current hybridized sample
or samples. The simplest model of a MAExplorer analysis assumes (at
least) two microarray hybridized samples variables HP-X, HP-Y whose
data may be plotted against one another or compared. The default is a
single HP-X sample and a single HP-Y sample.
Figure 2.2.1 shows setting the HP-X, HP-Y, HP-E lists of samples using
the "Chooser" - the preferred method. Figure 2.2.2 shows setting the
HP-X sample from the menus. Figure 2.2.3 shows changing the current
HP-X sample by clicking on a sample name in the microarray image.
-
define HP-X, HP-Y, HP-X 'set', HP-Y 'set', or HP-E 'list' from lists
of samples.
Use HP-X & HP-Y 'sets' else single samples [CB] - toggle
between using HP-X and HP-Y 'sets' of multiple samples or
single HP-X and HP-Y samples.
2.2.1 Selecting hybridized samples with Chooser or pull-down menu
sample lists
The Set Samples from lists submenu lets you define HP-X, HP-Y,
HP-X 'set', HP-Y 'set', or HP-E 'list' from lists of samples. It
contains four submenus:
- i.e. single HP-X sample from the list of H.P.s
- i.e. single HP-Y sample from the list of H.P.s
- edit sets of X and Y samples by
source for advanced statistics comparisons.
- edit samples by source in the HP-E
ordered list of samples for expression profile statistics.
The Edit HP-E expr. profile 'list' by source menu allows the user to
define an ordered list of samples for use in expression profile
statistics. Then, an expression vector of normalized quantification
values (one for each sample in the HP-E list) is computed for each
gene. Note: to place the samples in a particular order, start with an
empty HP-E set and then add them in the order you desire.
- add the selected sample to HP-X set.
- add the selected sample to HP-Y set.
- remove the selected sample from HP-X set.
- remove the selected sample from HP-Y set.
- add the selected H.P. to HP-E list.
- remove the selected sample from HP-E list.
The Convention for pull-down menu sample selection lists
We use a common sample selection scheme when selecting a sample from a
pull-down menu list. This sub-sample "By Source" option is only
available if your database was set up to allow sub-sample source names
in the Samples database.
- selects the sample from one of additional submenus. These
database-specific menu entries might be categories such as
developmental stage, tumor models, time series, etc and are
set up for a specific database in its configuration.
2.2.2 Swapping selected samples' (Cy3,Cy5) channels in ratio data
dye-swap experiments
The Edit use (Cy5/Cy3) else (Cy3/Cy5) for each HP command may
be used to selectively swap (Cy3,Cy5) data entries so the user may use
the samples (carefully, since gene labeling efficiency is not always
symmetric!) dye-swap data for replicates. This is only available for
ratio data. It swaps the data contained in MAExplorer (memory only) so
that the Cy3 data is swapped for the Cy5 data. For example, consider
the case where there are two materials A and B hybridized in two
experiments and labeled as follows: E1 (A=Cy3,B=Cy5) and E2
(A=Cy5,B=Cy3). Then, assuming uniform symmetric labeling (which is
generally not the case - although it might be true for a subset
of genes), then one might average data from E1 and E2 if the data from
E1 (or E2) were swapped. This is shown in the following figure.
2.2.3 Viewing sample HP-X, HP-Y, and HP-E partitions
You setup sets of HP samples for the HP-X and HP-Y sample sets and
HP-E expression list of samples using the Chooser (above). The current
contents of these lists may be viewed using the List HP-X &
HP-Y sample 'sets' to list the samples in the HP-X and HP-Y
'sets'. The List HP-E sample 'list' may be used to list the
samples in the ordered HP-E 'list'.
2.2.4 Defining sample condition 'class' names
When using sets of conditions, the HP-X and HP-Y 'sets', you will
probably want to assign meaningful names to these sets. The commands
Define HP-X class name and 2.2.5 Toggling between single HP-X (-Y) samples and HP-X (-Y) sets
When MAExplorer first starts up, it assumes that you wish to treat the
data as single samples so that HP-X and HP-Y are assigned to single
samples. However, if you want to work with sets of multiple samples
then you must toggle the state using the Use HP-X & HP-Y 'sets'
else single samples [CB] check box command. This toggles the state
between treating the data as multiple samples (HP-X and HP-Y 'sets')
or as single HP-X and HP-Y sample samples.2.2.6 Create and edit named condition lists of samples
The command Choose named condition lists of samples lets you
define new or edit existing named lists of hybridized samples called
"Condition lists". Associated with each condition list is a set of
annotation parameters to document the condition. The condition lists
may be used in the (Edit | Sets of conditions) operations. Among other
operations, you may assign any condition list to the working HP-X
'set', HP-Y 'set', or HP-E list of samples used through
MAExplorer. The last condition list that was edited with the (Sample
menu | Choose named
condition lists of samples) is called the "current condition" that
could be used in various operations. Figure 2.2.5 shows a screen
illustrating a popup condition chooser session where the legend
describes the options.
2.2.7 Create and edit named ordered condition lists (OCL) of conditions
The command Choose ordered condition lists of conditions lets
you define new or edit existing named ordered lists of conditions
called "Ordered Condition Lists" (OCL). Associated with each ordered
condition list is a set of annotation (name, value) pairs to document
the condition. The last condition list that was edited with the
(Sample menu | Choose
ordered lists of conditions) is called the "current OCL". The
current OCL is used by the (Filter menu | Filter by current
Ordered Condition List (OCL) F-Test [p-Value] slider [RB])
test. Figure 2.2.6 shows a screen illustrating a popup
ordered condition list chooser session.

2.3 Edit menu
The Edit menu operations include operations to modify the
'edited gene list' that is set from a variety of Filters as well as
manually this menu. The user may perform set operations (union,
intersection, and difference) on named sets of gene and sets of sample
experiments (conditions). User preferences are also set in this menu.
First compare two HPs using a statistical test such as a
t-test. Then save the resulting set of genes under the name "virgin
vs. pregnancy". Then compare the next two HPs and save the resulting
genes under the name "lactation vs. involution". Finally, compute
the difference of genes found in "virgin vs. pregnancy" that are not
found in "lactation vs. involution". This resulting gene set could
then be saved (e.g. with a name "Genes found in virgin
vs. pregnancy, but not in lactation vs. involution"). Similarly,
taking the intersection of these two named sets shows genes that are
common between the two sets. Taking the union shows genes found in
either of the two named sets.
- edit
the user defined 'Edited Gene List' or E.G.L.
- operations
on named sets of genes.
-
operations on lists of hybridized samples (i.e. experimental
conditions).
- set various
statistical limits and other parameters.
2.3.1 User edited gene list - the 'Edited Gene List' menu
You may define and edit arbitrary sets of genes using the User
edited gene list submenu to modify the 'Edited Gene List'
(EGL). This has sub-modes of operation for adding or removing genes
from the image by clicking on spots. If the Show 'Edited Gene
List' mode is set, you may see exactly which genes you have
defined by the magenta squares drawn around each gene in the EGL. Many
of the clustering operations will leave the current cluster in the
EGL. The commands include:
This gives you the functionality of adding and deleting genes from a
user defined list of genes to be analyzed. The EGL may be used with
the gene-set operations discussed in Section
2.3.2. You may also define genes in the EGL using the "Gene Name
Guesser" shown in Figure 2.3.1.
Show 'Edited Gene List' [CB] - toggle showing the EGL as
magenta boxes in the pseudoarray image. If enabled, genes set by
manual selection or as the result of some filtering operations
Don't edit [RB] - clicking on a spot does nothing
(i.e. disable the 'click to add (remove) genes to (from) the E.G.L.'.
Click to add gene to E.G.L. (Ctrl/click) [RB] -
clicking on a spot adds the corresponding gene to the 'Edited
Gene List'.
Click to remove gene from E.G.L. (Shift/click) [RB] -
clicking on a spot removes the corresponding gene from the
'Edited Gene List'.
2.3.2 Sets of genes menu
These commands let you do comparisons of sets of genes generated under
different criteria. In addition, you may compute derived gene sets
from existing gene sets using set operations (OR, AND,
DIFFERENCE). You may also normalize the data by a gene subset. The
user may save the genes defined by: 1) by the Filter, or 2) the
manually defined 'Edited Gene List'. The gene set resulting
from a binary gene set operation OR (union), AND (intersection), or
DIFFERENCE are saved in a new named gene set. The set difference (A-B) is
defined as the gets in set A that are not in set B. Genes in set B
that are not in set A are ignored. The 'User Filter Gene Set' may be
set to any gene set and may then be used as part of the gene Filter
cascade. The 'User Normalization Gene Set' may be set to any gene set
and may then be used to normalize gene intensity values across
hybridized samples. (See normalization
algorithm for more information on this method.)
User Gene Sets
Set# |#genes| title
=======================
#1 |1727| ALL GENES
#2 |394| ALL NAMED GENES
#3 |246| ESTs similar to genes
#4 |456| ESTs
#5 |1096| All genes and ESTs
#6 |1681| Good genes
#7 |40| Replicate genes
#8 |0| HousekeepingGenes
#9 |96| Calibration DNA
#10 |77| Your plates
#11 |46| Empty wells
--------- User Assignable ----------
#12 |0| User Filter Gene Set
#13 |60| Edited Gene List
#14 |0| Normalization Gene Set
--------- User definable------------
#15 |60| The 60 genes closest to Carbonic Anhydrase-III
#16 |30| Named genes in the 60 genes closest to CA-III
#17 |4| Replicate genes in the 60 genes closes to CA-III
2.3.3 Sets of sample conditions menu
In addition, MAExplorer can operate on sets of hybridized samples. For
example, a sample set might be replicate hybridized samples from the
same biological experiment sample, or it could be repeated experiments
of different but the same types of samples. (One must be careful in
mixing data between the two cases because of the different expected
sources of variance). This means you can treat multiple replicate
samples as a distribution and compare the mean values for each gene in
one set of samples with the mean values for another set of samples. We
call these sets of hybridized samples conditions lists or HP
lists. You may then put one or more HP samples into a condition
set. These sets in turn can be used for computing statistics on
clonal differences between different condition sets. Note each
condition set may have multiple (i.e. different) samples. These
condition sets are saved with the user state when doing a
(File | Databases | SaveAs DB). As with sets of genes, there are a
number of operations to manipulate HP condition set in the Sets of
Conditions menu that includes:
Condition Lists
===============
Condition[1] #HPs 2, [Initial HP-X: C57B6 pregnancy day 13]
Condition[2] #HPs 2, [Initial HP-Y: Stat5a (-,-) pregnancy day 13]
Condition[3] #HPs 4, [Initial HP-E expression list]
Condition List #1 [Initial HP-X: C57B6 pregnancy day 13]
====================================
HP[1] Pregnancy 13 (1 hr) [C57B6-p13-totalRNA5ug]
HP[2] Pregnancy 13 (1 hr) [C57B6-p13.2poly-A]
2.3.4 Setting user preferences menu
The Preferences submenu is used to set various data labels,
statistical limits and other parameters. These include:
#Use Web DB [CB] - if Web DB was defined, get data from the Web
#Web DB data caching [CB] - if Web DB was defined, cache data
on local computer if getting data from the Web
Cluster on Filtered genes, else all genes [CB] -
genes to use when clustering from current gene [Future]

2.4 Analysis menu
The Analysis menu (see
Figure 1.4) contains an ordered list of six primary menus that
may be used, in that order, to perform an initial analysis. In more
complex analyses, the sequence of operations will vary and include
commands selected from other menus or will use these menus in
different order. The Analysis submenus are as follows:

2.4.1 GeneClass menu
A gene class (e.g. all named genes, ESTs, oncogenes, etc.) is a set of
genes that belongs to the class of genes in the universe of genes in
the particular microarray database. MAExplorer may restrict the set
of genes by "Gene Class" membership (currently includes All Genes, All
named Genes, ESTs similar to genes, Unknown ESTs, All genes and ESTs,
Good genes, Replicate
genes (i.e. with more than one copy of the gene in the array),
Calibration DNA, genes from user's plates). The additional Gene Class
list of names depends on its availability in a specific database.
- select one of the gene class subsets [Future]
Gene class
Rule for class membership
All genes
all genes on the array
All named
genes not starting with "EST"
ESTs similar to genes
genes starting with "EST,"
ESTs
genes with the name "EST"
Replicate genes
genes with multiple copies
Calibration DNA
genes using the configuration file name "calibDNAname" (optional - see
Appendix Table C.4.1 )
Your plates
clones using the configuration file name "yourPlates" (optional - see
Appendix Table C.5.1-C))
Empty Wells empty wells where no spot exists
on the array indicated by keywords "empty", "empty well" or
"EmptyWell" (optional - see
Appendix Table C.5.1-C) )
Good Genes
spots on the array where the GIPO QualCheck data was used and was
valid. If it was not used, then it assumes all spots are good. (optional - see
Appendix Table C.4.1 )
2.4.1.1 GeneClass ontology subsets [Future]
If the Set Gene Class subset were activated, it might
include categories such as the following. If the categories exist and
the data is made available to MAExplorer, then it is possible to
specify gene subsets by Gene Class name. It is the responsibility of
the database creator to define a mapping table supporting these named
subsets of named genes.
2.4.1.2 Simulating Gene Class ontologies using Gene Set
operations
You can effectively implement finding ontology subsets for Gene Class
subsets using the following procedure. The trick is to repeatedly
define an E.G.L. gene subset using the gene name guesser to find the
genes of interest and save it as a named gene subset. Edit out genes
you don't want. Then you would repeatedly do the OR of gene sets of
interest, saving the result as a new named set. Then doing the OR of
another gene set with the set you just created, etc.
Procedure

2.4.2 Normalization menu
The Normalization menu operations include operations to
normalize gene intensity data between hybridized samples. This is
critical in being able to compare samples because of differences in
amount of sample, labeling efficiency and variations in scanner
operation including gain and baseline settings. There are several
methods available including normalizing by Zscore, median, log mean,
Zscore of logs, calibration DNA, housekeeping genes, etc. The specific
microarray image
quantification is determined the image analysis program being used
to pre-process the arrays.
Note: although this set of normalization methods is limited, it is
adequate for some analyses of the data. We are in the process of
adding more normalization methods through MAEPlugin methods.
Zscore of intensity
[RB] - normalize by the (intensity-mean)/stdDev of raw
intensities for all spots in each sample.
Median intensity
[RB] - normalize by the median of the raw intensities for
all spots in each sample (the default normalization).
Log median intensity
[RB] - normalize by the log of median scaled raw
intensities for all spots in each sample.
Zscore log intensity,
stdDev [RB] - normalize by the Zscore of the log intensity
using (log(intensity)-meanlog)/stdDevlog,
standard deviation for all spots in each sample.
Zscore log intensity,
mnAbsDev [RB] - normalize by the Zscore of the log
intensity using
(log(intensity)-meanlog)/meanAbsDevlog,
mean absolute deviation for all spots in each sample.
By Calibration DNA set
of genes [RB] - normalize by the sum of the 'Calibration DNA'
genes for each sample (if it exists in your database).
By 'User Normalization
Gene Set' [RB] - normalize by the sum of the genes in a
user defined gene set in each sample. You assign this gene set
using the (Edit menu | Gene sets | Assign 'User Normalization
Gene Set') operation.
By housekeeping gene
set [RB] - normalize each HP data set by the sum of the
intensity values for known housekeeping genes in each sample
(if it exists in your database).
Scale intensity data to
65K [RB] - scale the data for the microarray by
65535/maxIntensity for each sample.
Unnormalized [RB] -
do not scale data between samples. I.e. use the raw
data.
Use background intensity
correction [CB] - enable/disable background correction to
gene intensity measurements.
Use ratio median intensity
correction [CB] - enable/disable ratio median correction to
clone intensity measurements by multiplying the ratio (Cy3/Cy5)
by medianCy5/medianCy3 intensies. If background correction is
enabled, correct by
(medianCy5-medianBkgdCy5)/(medianCy3-medianBkgdCy3).
2.4.2.1 Intensity background correction
The background intensity data from the spot quantification programs
may be used to correct spot intensity. Background may be specified as
either a global value or on a per-spot basis. If the array images have
low background, then this may not be too much of a problem if no
background values are available.
I'ij = Ij - bkgrdHPi
Ratio computation for Cy3 and Cy5 data
For most MAExplorer operations, the intensity of a gene is generally
computed as the mean intensity of the spots (background corrected or
not) which duplicate that gene on the microarray. When working with
dual hybridized samples using Cye-3 and Cye-5-dUTP labeling that results
in green and red fluorescence, this can be used in self-normalizing
intensity for each hybridized clone array using the Cy3/Cy5 ratio. If
local background is available, then the ratio can be computed for HP h
and spot j as
(Cy3hj - BkgrdCy3hj) / (Cy5hj - BkgrdCy5hj)
2.4.2.2 Normalization between microarrays to allow comparison
The normalization of quantitative data is crucial when comparing data
between different microarray samples. There are a number of different schemes
possible. One is to normalize by the sum of known calibration,
housekeeping genes or other "constant expression" genes in the
microarray. Another is to sum the background corrected integrated
density for all spots in an array and to normalize individual gene
measurements by that sum. These methods are now described in more
detail. As the MAEPlugins
facility becomes available, we will be adding a number of more
sophisticated gene-specific normalization methods that take many of
the problems specific to microarrays into account.
Normalizing by scaled Zscore of intensity
The "normalized Zscore of intensity" method normalizes each hybridized
sample by the mean and standard deviation of the raw intensities for
all of the spots in that sample. The mean intensity
mnIi and the standard deviation
sdIi are computed for the raw intensity of 'Good
genes'. It is useful for standardizing the mean (to 0.0) and the range
of data between hybridized samples to about -4.0 to +4.0. When using the
Zscore, you compute Zdiff(erences) not ratios. The Zscore intensity
Zscoreij for intensity Iij
for HP i and spot j is computed as
Zscoreij = (Iij - mnIi)/sdIi,
and
Zdiffj(x,y) = Zscorexj - Zscoreyj.
Normalizing by the median of intensity
The "Median intensity" method normalizes each hybridized sample by
the median of the raw intensities of 'Good genes' for all of the
spots in that sample. It is a useful normalization to use when you want
to compute X/Y ratios between hybridized samples.
Imij = (Iij/ medianIi)
Normalizing by the log of median of intensity
The "Log median intensity" method normalizes each hybridized sample
by the log of median scaled raw intensities of 'Good genes' for all
of the spots in that sample. The value 1.0 is added to the intensity
value to avoid taking the log(0.0) when intensity has zero value. This
is a useful normalization to use when you want to compute X/Y ratios
between hybridized samples and compress the scale. Because we are computing
a log, we report the difference between HP-X and HP-Y as (X-Y) instead
of a ratio (X/Y).
Imij = log(1.0 + (Iij/ medianIi))
Normalizing by scaled Zscore of log intensity, standard deviation
The "Normalize by Zscore of log intensity, stdDev" method normalizes
each hybridized sample by the mean and standard deviation of the
logs of the raw intensities for all of the spots in that sample. The
mean log intensity mnLIi and the standard
deviation log intensity sdLIi are computed for the
log of raw intensity of 'Good genes'. Then the Zscore intensity
ZlogSij for HP i and spot j is
ZlogSij = (log(Iij) - mnLIi)/sdLIi
Normalizing by scaled Zscore mean absolute deviation of log intensity
The "Normalize by Zscore of log intensity, mean absolute deviation"
method normalizes each hybridized sample by the mean and mean
absolute deviation of the logs of the raw intensities for all of the
spots in that sample. The mean log intensity mnLIi
and the mean absolute deviation log intensity
madLIi are computed for the log of raw intensity
of 'Good genes'. Then the Zscore intensity
ZlogAij for HP i and spot j is
ZlogAij = (log(Iij) - mnLIi)/madLIi
By 'User Normalization Gene Set'
This method is useful a subset of genes have been determined to have
relatively constant expression across the set of samples. It
normalizes by the sum of intensities for a subset of genes defined by
the user in the 'User
Normalization Gene Set' (Section 2.3.2)using the gene set editing
commands. Normalizing by the sum of genes uses the
Igsij that is computed for microarray HPi
with intensities Iij for all genes j in
the gene subset.
Igsi = Sum (Iij)
genes j
i in HPi
Then, the normalized intensity I'ij is computed as:
I'ij = Iij/Igsi
By 'Calibration DNA' set
If a predefined set of calibration DNA genes are available on the
array, they may be used to normalize density values between the
samples. The calibration DNA genes are defined by special gene names
that are declared in the Configuration file using the 'calibDNAname'
parameter (see Appendix C Table C.5.1(C)). If
there is no calibration DNA, this entry is not used. The algorithm is
the same as "User Normalization Gene Set" (above), but the set is
predefined as the genes flagged as calibration DNA. For example, in
the MGAP database, these spots are the "mouse genomic DNA" spots so
the Configuration file entry would be calibDNAname="m.g. DNA".
Scaling intensity data to 65K
Another method "Scale intensity data to 65K" scales the maximum
intensity of each sample to 65K (the maximum intensity). Since the
raw scanned data is often 16-bits, it can have a maximum value of
65535 (216-1) and so this does minimum scaling. This method
may make it easier to view the data initially using the
pseudoarray image. However, it may not properly scale the data between
arrays and should probably not be used in quantitative comparisons.
No normalization
You may also want to look at the raw intensity (or Cy3 and Cy5
channel) data. Turning off normalization gives you the raw data read
into MAExplorer.
2.4.2.3 Using different normalizations to 'see' different data views
Changing the normalization method will sometimes make differences
between data sets more apparent. The following figure shows the same
data in two different scatter plots but with two different
normalizations.

2.4.3 Filter menu
The final set of genes presented for display, plotting, reports, etc.
is determined by a cascade of gene "data filters" that generate a
restricted gene set. The cascade is computed in real-time using the intersection of individual
criteria and tests selected by the user. Examples of Filter
criteria include: membership in a particular gene set, ratio
(HP-X/HP-Y) within a range, passing statistical tests such as t-tests
or F-test, etc.
Filter by GeneClass membership [CB] - only include genes
that are members of the current GeneClass.
Filter by 'User Filter Gene Set' membership [CB] - only
include genes that are members of the current 'User Filter Gene
Set'.
Filter by 'Edited Genes List' membership [CB] - only include
genes that are members of the 'Edited Gene List'.
Filter by global 'Good Genes List' membership [CB] - only
include genes that are members of the list of good genes. [These
genes are indentified by a QualCheck entry in the GIPO database
file.]
Filter by 'Genes with replicates' [CB] - only include
genes that genes that have at least 2 copies of the gene
replicated on the array. Note: duplicated
genes (i.e. F1, F2, etc) are not considered replicates
for this purpose.
Filter by ratio or Zdiff histogram bin [CB] - only include
genes that are in the range of the ratio or Zdiff histogram
bin you have clicked on (should be set from histogram plot, but may be
turned off here)
Filter by intensity or (Cy3/Cy5) histogram bin [CB] - only
include genes that are in the range of the intensity histogram
bin you have clicked on (should be set from histogram plot, but may be
turned off here)
- filter by positive intensity data
if the data may contain negative numbers. Otherwise it will use
both positive and negative data. If the database has 2 channels
(F1, F2) or (Cy3,Cy5) each channel is checked. If the background
correction is enabled, the background corrected values are
tested to see if any of them are negative.
Filter by genes with non-zero intensity [CB] - only
include genes that have non-zero density. This protects against
zero data that may be present in the database when taking logs of
the data.
- filter out genes that do
not have "Good Spot" values (defined by the optional QualCheck
spot data on a per-sample (i.e. HP) basis. See the list of
codes in Appendix
C.4). If there is no such spot quality data, then all spots
are considered "good".
- filter out genes that do not have
"Detection Value" values (defined by the optional DetValue or
CorrCoef spot data on a per-sample (i.e. HP) basis. Typical
Detection Values could be the Affymetrix MAS5.0 "Detection
p-value" or other continuous value of spot detection quality.
- filter by individual spot intensity
(Cy3 and Cy5 channels if ratio data) within [SI1:SI2] threshold
ranges
- filter by gene expression (or
Cy3/Cy5 if ratio data) within [I1:I2] threshold ranges
- filter by gene ratios or Zdiff values
within [R1:R2] or [Z1:Z2] threshold ranges (depending on the
normalization method)
- filter by gene ratios or Zdiff values
within [CR1:CR2] or [CZ1:CZ2] threshold ranges (depending on the
normalization method). This is useful for filtering data from a
single sample.
- filter
out genes that do not meet minimum Coefficient of Variation
(CV) values of spot replicates (F1 and F2 for the same HP,
replicates in HP-X and HP-Y 'sets' of samples etc.).
Filter by HP-X,HP-Y t-Test
[p-value] slider [RB] - only include genes that meet the
HP-X,HP-Y t-Test criteria if they have (F1,F2) duplicate spot
(this is a weak form of the t-Test).
Filter by HP-X,HP-Y 'sets'
t-Test [p-value] slider [RB] - only include genes that meet the
HP-X,HP-Y 'sets' t-Test criteria (only works if using HP-X and
HP-Y 'sets' mode where there are replicate samples).
Filter by HP-X,HP-Y 'sets'
Kolmogorov-Smirnov test [p-value] slider [RB] - only include
genes that meet the HP-X,HP-Y 'sets' KS-Test criteria (only
works if using HP-X and HP-Y 'sets' mode where there are
replicate samples).
Filter by current Ordered
Condition List (OCL) F-Test [p-Value] slider [RB] - only
include genes that meet the F-test criteria on the current
OCL. This only works if there are at least 2 (replicate)
samples/condition for each of the condition sets in the OCL.
See info on defining
the OCL and
using the OCL data.
Filter by HP-E clustering
[Cluster dist] slider [CB] - only include genes that meet the
clustering criteria (alternatively, see the Cluster menu
commands).
Filter by Diff(HP-X,HP-Y) [Abs.Diff.] slider [CB] - only
include genes whose absolute difference between mean HP-X and
HP-Y (single or 'sets') is < threshold.
Filter N genes with highest
X/Y ratio or X-Y Zdiff [CB] - look at highest ratios or Zdiff
values. The value of N is set in the Edit menu preferences.
Filter N genes with lowest
X/Y ratio or X-Y Zdiff [CB] - look at lowest ratios. The value of
N is set in preferences. N is set in the Edit menu preferences.
Current HP [RB] - spots in current sample spots
HP-X & HP-Y [RB] - spots in X and Y single samples
HP-X or HP-Y 'sets' [RB] - spots in the HP-X set or HP-Y set
HP-X & HP-Y 'sets' [RB] - spots in both the HP-X set and
HP-Y set
HP-E [RB] - spots in HPs in expression profile list
Current HP [RB] - spots in current sample spots
HP-X and HP-Y [RB] - spots in X and Y single samples
HP-X or HP-Y 'sets' [RB] - spots in HP-X set or HP-Y set
HP-X and HP-Y 'sets' [RB] - spots in HP-X set and HP-Y set
HP-E [RB] - spots in HPs in expression profile list
Current HP [RB] - spots in current sample spots
HP-X and HP-Y [RB] - spots in X and Y single samples
HP-X or HP-Y 'sets' [RB] - spots in HP-X set or HP-Y set
HP-X and HP-Y 'sets' [RB] - spots in HP-X set and HP-Y set
HP-E [RB] - spots in HPs in expression profile list
Use spot intensity [SI1:SI2] sliders [CB] - use spot
intensity thresholding
Inside [RB] - test inside of [SI1:SI2] range
Outside [RB] - test outside of [SI1:SI2] range
-
specify which samples are tested
-
specify which additional constraints are used. This is useful
for finding genes with high or low expression but that has some
samples that have opposite expression.
Current HP [RB] - spots in current sample spots
HP-X & HP-Y [RB] - spots in X and Y single samples
HP-X & HP-Y 'sets' [RB] - spots in HP-X set and HP-Y set
HP-E [RB] - spots in HPs in expression profile list
ALL channels [RB] - ALL channels must meet the range specification
ANY channels [RB] - ANY channels may meet the range specification
AT MOST channels [RB] - AT MOST Percent SI OK channels
may meet the range specification
AT LEAST channels [RB] - AT LEAST Percent SI OK channels
may meet the range specification
PRODUCT of channels [RB] - the PRODUCT of all channels must meet
the range specification
SUM of channels [RB] - the SUM of all channels must meet the
range specification
Use intensity [I1:I2] sliders [CB] - use spot intensity
thresholds I1 (lower) and I2 (upper)
Inside [RB] - test for intensity inside of [I1:I2]
Outside [RB] - test for intensity outside of [I1:I2]
Use ratio [R1:R2] or Zdiff [Z1:Z2] sliders [CB] - use
spot ratio [R1:R2] or Zdiff [Z1:Z2] range thresholds
Inside [RB] - test inside of [R1:R2] or [Z1:Z2] range
Outside [RB] - test outside of [R1:R2] or [Z1:Z2] range
Use ratio [R1:R2] or Zdiff [Z1:Z2] sliders [CB] - use
spot ratio [R1:R2] or Zdiff [Z1:Z2] range thresholds
Inside [RB] - test inside of [R1:R2] or [Z1:Z2] range
Outside [RB] - test outside of [R1:R2] or [Z1:Z2] range
Use spot [CV] slider [CB] - apply one of the spot CV filter
modes as a Filter and popup a CV slider to set the threshold
- select
samples to be used in computing the CV
Use mean else max of CVs [CB] - compute the CV as the maximum
or the mean of the CVs of the samples selected
Filtering using statistical test by your selecting a p-value
These tests will filter genes meeting
the test criteria if the resulting p-value of that test is <= the
value specified by the p-Value state slider. Only one test may be
active at a time. If you switch to a new p-value test, it will disable
the previous p-value test. If any of these tests are selected, it
will pop up the p-Value state slider window for you to set the
p-Value. There are two t-tests: one operating on duplicate (F1,F2)
data if available, and the HP-X,HP-Y 'sets' if they are defined. The
Kolmogorov-Smirnov test operates on HP-X,HP-Y 'sets' if they are
defined. The F-test operates on the current Ordered
Condition List (OCL) consisting of any number of condition lists each
containing at least 2 (replicate) samples/condition.
Filter by current Ordered
Condition List (OCL) F-Test [p-Value] slider [RB] - only
include genes that meet the F-test criteria on the current
OCL. This only works if there are at least 2 (replicate)
samples/condition for each of the condition sets in the OCL.
See info on defining
the OCL and
using the OCL data.
Filtering out genes with high replicate spot variation
The Spot CV filter mode submenu contains options to select how
the spot CV filter is to be applied. It computes the maximum value of
CV for all of the samples in the particular sample set specified. That
maximum value is then used for the spot CV filter test. Genes may be
filtered out having a large difference between spot quantification
values of corresponding duplicate spots. You may compute the
coefficient of variation CVj for the two values
(f1j and f2j for a particular
gene j.
CVj = 2|f1j-f2j|/(f1j+f2j)
If the database only has one field but replicate HPs, then you may use
the HP-X & HP-Y 'sets' CVj to filter the
genes. Then CVj values are tested against a CV
threshold slider value to eliminate genes with a high coefficient of
variation.
Current HP [RB] - CV of (F1,F2) for each gene in current
sample [if duplicate spots are available on each sample]
HP-X or HP-Y [RB] -
CV of (F1,F2) for HP-X and HP-Y single samples [if duplicate
spots are available on each sample]
HP-X 'set' [RB] - CV of spots in HP-X set
HP-Y 'set' [RB] - CV of spots in HP-Y set
HP-X or HP-Y 'sets' [RB] - CV of spots in the HP-X set or HP-Y
set
HP-X and HP-Y 'sets' [RB] - CV of spots in both the HP-X
set and HP-Y set
HP-E [RB] - CV of HPs in expression profile list
2.4.3.1 Data filtering using multiple gene data filters
Any or all of the data filters may be selected simultaneously. In
particular, if you select filters that use parameter threshold
scrollers, they will be added to a state scroller window (see Figure
2.3.4.1 for details to allow adjustment of ALL sliders
simultaneously). You may change various thresholds and see the effect
in real time. Note: some of the scrollers are more sensitive to low
values. Therefore, we set them to respond non-linearly with a more
precise vernier at the low end.

2.4.4 Plot menu
The Plot menu lets you display a pseudoarray image, scatter plots,
ratio and intensity histograms,and expression profile plots. The
pseudoarray image is displayed in the main MAExplorer window. All of
the other plots are displayed in popup windows. Depending on the
particular plot, multiple instances may be allowed. The Plot
submenus are:
- display the pseudoarray image
for the current HP sample
- display various scatter plots
of selected pairs of HP samples
- display both ratio and
intensity histograms of HP sample data
- display expression profile
plots of genes or gene subsets
2.4.4.1 Show microarray pseudoarray images menu
You may show the pseudoarray
image of the current hybridized samples using several
modalities. The grayscale pseudoarray image is generated from the
quantified spot data. If the data contains the actual spot positions
of the genes (as generated by the various array image quantification
program), the spots may be drawn using a scaled version of those
coordinates. Otherwise, a generic set of grids (and fields in there
are multiple fields) is synthesized to represent the spot
positions. Pseudoarray images may also be useful as an alternative
modality for displaying X/Y ratio or X-Y Zdiff data. If the
normalized intensities are the same, then the spot will appear as
black with the overall spot intensity depending on the spot
concentrations. High ratios and Zdiffs will be red and low values
green as shown in Table
2.4.4.1. The p-Value results of comparing a HP-X 'set' with a
HP-Y 'set' of samples can be displayed as a color spectrum pseudoarray
image.
Pseudograyscale intensity [RB] - display a pseudograyscale
microarray pseudoarray image for the current H.P. where higher
intensity spots are mapped to black.
Pseudocolor Red(X)-Yellow-Green(Y) HP-XY ratio or Zdiff [RB] -
display the HP-X and HP-Y microarrays as a pseudocolor image as
either a ratio (HP-X/HP-Y) or Zdiff (HP-X - HP-Y) if Zscore
normalization is in effect. The HP-X (red) and HP-Y (green)
components are added together so that high total
intensity values are yellow and low values are black. High
HP-X/HP-Y ratio values are more red and low values are more
green.
Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5 data [RB] -
display the Cy3 and Cy5 microarray channels as a pseudocolor
array image. The the Cy5 (red) and Cy3 (green) components are
added together so that high total intensity values are
yellow and low values are black. High Cy3/Cy5 ratio values are
more green and low values are more red. Note: this is only
available with Cy3, Cy5 ratio data. Note: if the "Use ratio
median corrections" is enabled, the Cy3 channel is scaled using
the to the domain of the Cy5 channel by (median Cy5/median Cy3).
Pseudocolor HP-X/Y ratio or Zdiff [RB] - display the HP-X and
HP-Y data as a pseudocolor array image as either a ratio
(HP-X/HP-Y) or Zdiff (HP-X - HP-Y) if Zscore normalization is in
effect. The high ratios or Zscores are red and low values are
green. Values in the middle are black.
Pseudocolor (HP-X,HP-Y) 'sets' p-value [RB] - displays each
spot as a pseudocolor proportional to the p-Value in a t-Test of
the HP-X 'set' vs HP-Y 'set'. This can only be used when the
"Use HP 'sets' ..." option is enabled and there are at least 2
samples in both the HP-X 'set' and the HP-Y 'set'. Note that
unless proper normalization and filtering is used to remove poor
quality data, some of the p-Values will not be significant and
there will be a high false-positive rate. Use this display with
that in mind.
Pseudocolor HP-EP 'list' CV (Coefficient of Variation)
[RB] - displays each spot as a pseudocolor proportional to
the CV of the spots within the subset of samples in the current
HP-EP 'list'. It is useful to look at the variation in a set of
replicate samples. Use the (Samples | Choose HP-X, HP-Y and HP-E
samples) command to define the HP-EP list of samples you want to
investigate.
Pseudocolor Cy3/Cy5 (or F1/F2) ratio or Zdiff [RB] - display
the Cy3 and Cy5 (or F1 and F2 for duplicate spotted grids)
corresponding spots for the same genes as a pseudocolor array
image as either a ratio (Cy3/Cy5) or Zdiff (Cy3-Cy5) if Zscore
normalization is in effect. The high ratios or Zscores are red
and low values are green. Values in the middle are black.
Flicker HP-X & HP-Y Pseudoimages [RB] - toggle flickering the
HP-X and HP-Y pseudograyscale intensity array images.
Use dual HP-X & HP-Y Pseudoimage [CB] - If have replicate
spots (F1,F2), toggle the display between F1 data in left grids
and F2 data in the right grids to mean HP-X data in the left
grids and HP-Y in the right grids for side by side comparisons.
"Scale pseudoarray image by
1/100 to zoom low-range values [CB] - rescales intensity and
(Cy3+Cy5) and (HP-X + HY-Y) (Red-Yellow-Green) plots to so
mid values are easier to visualize.
.
.
.
.
.
.
.
.
.
Normalization
mode - RBG
bright green
.
.
dark green
Black
dark red
.
.
bright red
.
.
.
.
.
.
.
.
.
Normalization
mode - dichromasy
bright blue
.
.
dark blue
Black
dark orange
.
.
bright orange
<0.250X
0.307X
0.400X
0.571X
1.000X
1.75X
2.50X
3.25X
>4.00X
Ratio data
<-3.0
-2.25
-1.50
-0.75
0.00
0.75
1.50
2.75
>3.0
Zscore data
<-0.99
-0.742
-0.495
-0.247
0.000
0.247
0.495
0.742
>0.99
Zscore Log data
Clicking on a particular gene will report its specific quantification
and identification values (See
Section 3.3 on gene quantification). If the Enable display current
gene in popup genomic DB Web Browser option is set in the
View menu, then it will also pop up a Web browser with the
corresponding to the particular genomic DB data for that
database if it exists.2.4.4.1.1 Examples of microarray intensity data pseudoarray image
The relative intensity may be displayed for the current sample (last
HP-X or HP-Y selected) or two samples (HP-X or HP-Y samples or HP-X or
HP-Y 'sets' of samples). To show two samples side by side, enable the
Use dual HP-X & HP-Y Pseudoimage in the Show Microarray
submenu. To show averaged set data in the dual mode, enable the "Use
HP-X & HP-Y 'sets' option in the Samples menu. The grayscale
value reflects the current normalization mode.
2.4.4.1.2 Example of microarray ratio or Zdiff data pseudocolor image
The ratio (HP-X/HP-Y) or Zdiff HP-X - HP-Y) normalized intensity data
may be displayed as a pseudoarray image for HP-X and HP-Y, or HP-X and
HP-Y 'sets' of samples. To show averaged set data in the dual mode,
enable the "Use HP-X & HP-Y 'sets' option in the Samples menu.
The colors of the 9 scale boxes represent the normalized expression
ranges and is assigned according to the current normalization mode
listed in the table.
2.4.4.2 Scatter plots menu
Scatter plots include HP-X vs HP-Y intensity for comparing data
between HP-X and HP-Y samples (or sets if the HP-X and -Y sample
'sets' mode is enabled in the Samples menu - as is shown in Figure 2.2.1). You may
zoom into any area of the scatter plot as is shown in Figure
2.4.4.4.2(C). If there are duplicate spots for each gene, you may
plot F1 vs F2 intensity (or Cy3 vs Cy5 if using ratio data) for
comparing replicate data (or Cy3 and Cy5 ratio data channels) within
the same sample It will also compute the correlation coefficient for
the data and display it in the plot and in the message panel. The data
is the intensity values using the current normalization method. If you
are analyzing ratio Cy3/Cy5 data, you may compare Cy3 or Cy5 of the
HP-X sample against Cy3 or Cy5 of the HP-Y sample. If you are in
stand-alone mode, a SaveAs GIF button will also be
available. This saves the current plot as a full resolution GIF file
specified by the user in a popup file browser window.
rSq=0.974, n=1728, X(mn+-sd)=(4.477+-7.845), Y(mn+-sd)=(12.379+-24.810)
The Scatter plots submenu includes:
Scatter plots of data from multiple channels on the same sample
It is also possible to plot the separate channels within a single sample
against each other. For example F1 vs F2 in samples with replicate
data and Cy3 vs Cy5 in samples with separate ratio data channels.
2.4.4.3 Histogram plots menu
You may compare ratios or Zdiffs of data using the HP-XY ratios or
Zdiff command to display a ratio histogram of Filtered intensity
data from two samples selected from the Samples menu. The HP-XY
'set' ratio or Zdiff is used if there are multiple samples in the
HP-X or HP-Y sets, then the mean values in each of the sets is used in
the calculations. If there are duplicate spots for each gene, you may
plot the F1F2 ratio or Zdiff histogram of the F1/F2 ratios or
F1-F2 Zdiff values for normalized data for each spot in the currently
displayed sample. If you are in stand-alone mode, a SaveAs GIF
button will also be available to save the current plot as a full
resolution GIF file specified by the user in a popup file browser
window.
2.4.4.4 Expression profile plots menu
You may generate an individual expression profile plot (EP plot) or a
scrollable list of EP plots. The order list of hybridized samples to
plot are specified by the HP-E set. In the latter case, the genes are
specified by the data Filter.
Use EP overlay else EP list [CB] - display Filtered genes
as an overlay plot of expression profiles, else as a scrollable
list of EP plots.

2.4.5 Cluster menu
The Clustering menu lets perform various types of gene and condition
clustering operations. When you invoke a clustering operation it will
popup one or more windows and may modify the pseudoarray image. Some
of the popup windows include clustergram and dendrogram analysis plots
used with the hierarchical clustering.
Use of clustering to find patterns of similar gene expression
Clustering is a way of possibly finding co-expressed genes that
exhibit similar expression changes in a set of samples. Genes may show
similar co-expression, but that does not prove they are co-regulated
at the same point in a pathway - merely that measurements of those
genes in a particular set of experiments show similar
expression. However, identifying genes with similar expression for
which some information is already known about some of the genes may be
useful as a starting point to help figure out gene function and
possibly aspects of its pathways in cell function using additional
experiments and analysis.
Hint: when working with very large data sets with many samples, it
may be useful to pre-adjust the distance and/or number of clusters
threshold sliders to an approximate range using the (Edit Menu |
Preferences | Adjust all Filter threshold scrollers). This is because
once the clustering starts, it does not (currently) let you abort the
clustering to change the threshold value.
LSQdistij = Sqrt( Sum ( D'hj - D'hi) **2 ) / n
h in HP-E
i,j in Filtered genes, i not j
Let,
sumij = Sum( D'hj * D'hi ),
mni = (1/n)Sum( D'hi ),
mnj = (1/n)Sum( D'hj ),
sumSqi = Sum( D'hi * D'hi ),
sumSqj = Sum( D'hj * D'hj ),
then,
[sumij - n*(mni * mnj)]
rij = --------------------------------------------------------
[Sqrt(sumSqi - n*n*mni*mni) * Sqrt(sumSqj - n*n*mnj*mnj)]
h in HP-E
i,j in Filtered genes, i not j
Cluster genes with
expression profiles similar to current gene [RB] - click on
gene in image to find other genes with similar HP-E expression
profiles whose cluster distance is less than the cluster
distance threshold. The larger the blue
box, the higher the similarity.
Cluster counts of
similar Filtered genes by expression profiles [RB] - draw
blue circles around filtered genes
indicating the number of other genes whose cluster similarity is
less than the cluster distance threshold. The larger the circle,
the more similar genes were found. Clicking on a gene switches
to the above mode.
K-means clustering of
gene expression profiles [RB] - draw
magenta circles around the N primary-node gene clusters
representing the gene closest to representing the center of the
cluster. Each of the nodes is a maximum distance from all other
nodes in the recursive definition of nodes. N is determined by
the State Scroller "# of Clusters". Changing N will recompute
the clusters. It then pops up a scrollable text window with the
clusters and indicates which genes belong to it. If you select
the EP plot button, it will draw the expression profiles
for the clustered genes. The Mn-Cluster-Report button
will generate report for all genes sorted by K-means
cluster. Summaries can be generated using the Mean EP
plot and Mn-Cluster-Report buttons. The SaveAs
GeneSets button saves all of the clusters as named Gene Sets
("Cluster #1", "Cluster #2", etc). If you change the filter or
current gene, you should explicitly use the Recompute
Clusters button to regenerate the new set of clustered
genes. When you recompute the K-means clusters, it uses the
current gene as the initial node.
- this computes the hierarchical
clustering of the expression profiles (normalized by HP-X sample
data for each gene) of Filtered genes. The hierarchical
clusters are displayed in an ordered gene clustergram and
optional dendrogram. Sub-regions of the clustergram may be
explored in more detail using the EP-subset plot button,
or a report of the ordered genes can be created using the
ClustGram Report Note: you may add (remove) genes you
select from the Clustergram to the E.G.L. by holding the
Control(Shift) key while clicking on the gene name.
S.O.M. gene clusters by
expr profiles [RB] - [Future MAEPlugin]
Multi-Dimensional
Scaling of genes by expr profiles [RB] - [Future MAEPlugin]
Multi-Dimensional
Scaling of genes by exprprofiles [RB] - [Future MAEPlugin]
Clusters of (HP-E)
samples as fct of Filtered genes [RB] - [Future]
Use correlation-coefficient
else Euclidian-distance [CB] - use the (1.0 - correlation
coefficient) as the distance metric instead of the default
Euclidean distance.
Scale EP vector by max
magnitude prior to clustering [CB] - scale each sample in
the EP by the max magnitude for all sample values in the EP.
Normalize by HP-X sample else
HP max intensities [CB] - normalize data by the
corresponding HP-X sample data for each gene or the maximum raw
intensity for each HP in the expression profile.
Use median instead of mean
for K-means clustering [CB] - use the clustering (see (Bickel, 2001)).
Display ClusterGram of gene
expr profiles [CB] - compute the hierarchical clustering of
the expression profiles (normalized by HP-X sample data for each
gene) of Filtered genes. Then display the hierarchical clusters
in an ordered gene clustergram and optional dendrogram when the
dendrogram checkbox is selected. Expression profile
plots of the clustergram may be explored in more detail using
the EP plot button that generates a scrollable list of
all EP plots ordered by the same order as the clustergram. A
full report of the ordered genes expression profiles may be
created using the ClustGram Report button.
Use avg-arithmetic-linkage [RB] - set the hierarchical
clustering linkage method to the average arithmetic linkage of
sub-clusters.[ Future]
Use avg-centroid-linkage
[RB] - set the hierarchical clustering linkage method to
average centroid linkage of sub-clusters (default).
Use next-min-linkage
[RB] - set the hierarchical clustering linkage method to the
next minimum distance sub-cluster linkage in random order.
Use cluster-distance matrix
cache [CB] - if you do not have enough memory for clustering
large gene sets, disable the cache. It will take MUCH longer
without the cache. When clustering, if there is
not enough memory available for the cache, it will warn you and
suggest you either reduce the number of genes being clustered or
use a computer with more memory.)
Use short else float
cluster-distance matrix cache [CB] - if there is not enough
memory for the set of genes you wish to cluster and you still
want to use the cache, you can use 16-bit (i.e. short) data
instead of the 32-bit (i.e. float) data. The results will be
less precise.
Use un-weighted else
weighted average [CB] - set the hierarchical clustering vector
averaging to un-weighted (the default weights it by the number
of genes in that sub-cluster). Otherwise using weighted gives
equal (0.50) weighting to each sub-cluster.
Handling of hierarchical clustering of large numbers of genes -
problem with slow response
The hierarchical clustering algorithm uses a gene-gene floating
point (i.e. 32-bit) distance matrix of order N2 (for N data
filtered genes). This means that if you are experiencing a slow
response, this may be due to several factors some of which you may not
be able to control. You might:
2.4.5.1 Cluster genes with expression profiles similar to current gene
The Cluster genes with expression profiles similar to current
gene is used to find genes with similar HP-E expression profiles
as measured by the least square error that are less than the cluster
distance threshold. It pops up the "Cluster Distance" threshold
scroller. Then click on a gene in the microarray image. It then pops
up up a window with a list of the similar genes and their expression
profile distances to the current gene. Each gene that passes the
cluster distance threshold test is indicated in the image with a blue square where the size of the square is
proportional to its similarity. It also displays a sorted list of the
genes with the cluster distance in the cluster panel that was popped
up. On each lines is a series of '*****' - the more stars the higher
the similarity to the seed gene. This is a silhouette plot that
is used to display a sorted list of similar objects and is described
to that described in (Kaufman and
Rousseeuw, 1990).
Larger squares indicate that more genes are similar. You may
change the cluster distance threshold and it will update the display
and the list. In addition, the 'edited gene list' is set to the
subset of genes that belong to the current cluster.
2.4.5.2 Cluster counts of similar filtered genes by expression profiles
The Cluster counts of similar Filtered genes by expression
profiles command analyzes the set of all Filtered genes for the
expression profile defined by the HP-E samples. It counts the number
of similar genes for each Filtered gene and draws a
blue circle whose size is proportional to the number of genes
similar to that gene. After it analyses these genes it lists the
genes and their counts in the cluster panel. You may change the
cluster distance threshold and/or Filter parameters and it will update
the display and the list. If you click on a gene with a green circle, it will switch to single gene
cluster mode (with the blue squares).
2.4.5.3 K-means clustering' gene expression profiles for filtered genes
The K-means cluster gene expression profiles for Filtered genes
command searches the data Filtered gene list for the genes
(i.e. primary genes) with the N most orthogonal expression
profiles. It will start this recursive computation from the gene with
minimum distance to all other genes unless you have selected a
"current gene" with the mouse. All Filtered genes are assigned to the
nearest K-means primary node. The mean cluster vector is computed and
used as the new definition of the cluster center. If you set the "Use
median instead of mean for K-means clustering" option in the
Clustering submenu, it will compute the center as a median instead of
a mean (Bickel, 2001). K-means
clustering is described in (Sneath
and Sokol, 1973). A new K-means primary gene (i.e. gene for the
cluster center) is found that is closest to this new center. Then all
of the data Filtered genes are reassigned to the new cluster
centers. The mean+-stdDev of the within-cluster distance to its center
is computed. It then pops up a text window with an ordered report of
the Filtered genes illustrated by part of a report shown below. [This
is part of a report from a 38 sample MGAP database subset of 141 genes
from the set of named genes restricted by the CV data filter.] Note
that clusters where the "Similarity" data is plotted as a silhouette plot use
variable length strings of '****' is about the same for the entire
cluster (e.g. cluster #4) contain genes that probably belong together
in the same cluster. Clusters that do not (e.g. Cluster 6) probably
contain two smaller more robust clusters.
Cluster report for 6 K-means clusters with 141 genes being clustered.
The seed gene is [1248564] Jun-B oncogene.
Clone ID Similarity Cluster-# Distance-to-cluster Gene-Name
-------- -------------- --------- ------------------- ----------------
1248411 ************** 1 Cluster [26 genes] in cluster [distNext: 1.035] wiCdist:mn+-sd=1.223+-0.453 CV=0.371 Calpactin I light chain
1381592 ********** 1 0.448 Surfeit gene 4
1247956 ********* 1 0.706 Protein kinase, cAMP dependent, catalytic, beta
1381836 ******** 1 0.761 Prohibitin
1382325 ******** 1 0.771 M.musculus mRNA for C1D protein
1248270 ******** 1 0.775 Seven in absentia 1A
1247716 ******** 1 0.794 Lipoprotein lipase
1248184 ******** 1 0.847 Mus musculus bromodomain-containing protein BP75 mRNA, complete cds
1248564 ******* 1 0.864 Jun-B oncogene
1382667 ******* 1 0.888 SERINE/THREONINE PROTEIN PHOSPHATASE PP2A-BETA, CATALYTIC SUBUNIT
1382561 ******* 1 0.931 Mus musculus GTP-specific succinyl-CoA synthetase beta subunit (Scs) mRNA, partial cds
1248089 ****** 1 1.013 M.musculus RPS3a gene
1247780 ****** 1 1.088 Proprotein convertase subtilisin/kexin type 7
1247557 ****** 1 1.104 M.musculus L28 mRNA for ribosomal protein L28
1248321 ***** 1 1.278 Decay accelerating factor 1
1382751 **** 1 1.311 Clusterin
1382007 **** 1 1.357 Murine mRNA with homology to yeast L29 ribosomal protein gene
1382074 **** 1 1.390 Orosomucoid 1
1381963 **** 1 1.417 M.musculus mRNA for ribosomal protein L36
1248278 ** 1 1.658 HISTONE H3.3
1247630 ** 1 1.675 Procollagen, type I, alpha 2
1247865 * 1 1.837 Mouse beta-D-galactosidase fusion protein mRNA, complete cds
1382236 * 1 1.85 Caspase 7
1247833 1 1.882 Mus musculus radio-resistance/chemo-resistance/cell cycle checkpoint control protein (Rad9) mRNA, complete cds
1248535 1 1.953 M.musculus mRNA for selenoprotein P
1247702 1 2.157 Cytochrome C oxidase, subunit Va
1382282 ************** 2 Cluster [13 genes] in cluster [distNext: 24.199] wiCdist:mn+-sd=16.184+-6.667 CV=0.412 Max interacting protein 1
1382159 ********** 2 9.086 TRANSPLANTATION ANTIGEN P35B
1247854 ********* 2 11.002 Prolyl 4-hydroxylase, beta polypeptide
1247970 ******** 2 11.786 Mouse mRNA for osteoblast specific factor 2 (OSF-2)
1381663 ******** 2 12.948 Mus musculus vacuolar adenosine triphosphatase subunit A gene, complete cds
1382100 ******** 2 13.34 T-complex protein 1, related sequence 1
1248366 ******** 2 13.541 Mus musculus cytochrome c oxidase subunit VIIa-L precursor (Cox7al) mRNA, nuclear gene encoding mitochondrial protein, complete cds
1247568 ******** 2 13.762 Cathepsin D
1247872 ******* 2 14.015 Mus musculus endothelial monocyte-activating polypeptide I mRNA, complete cds
1382333 ******* 2 14.065 Stromal cell derived factor 5
1382008 ******* 2 15.985 Mus musculus FK-506 binding protein homolog (SAM11) mRNA, complete cds
1247724 **** 2 21.964 Glutathione-S-transferase, alpha 3
1247846 2 34.704 House mouse; Musculus domesticus kidney mRNA for Phosphatidic acid phosphatase, complete cds
1247945 ************** 3 Cluster [22 genes] in cluster [distNext: 11.979] wiCdist:mn+-sd=7.559+-3.347 CV=0.443 Mus musculus mRNA for DEDD protein
1247797 ********** 3 4.159 Mus musculus Btk locus, alpha-D-galactosidase A (Ags), ribosomal protein (L44L), and Bruton's tyrosine kinase (Btk) genes, complete cds
1382087 ********** 3 4.494 Cell division cycle 42
1247539 ********** 3 4.511 EST
1248212 ********** 3 5.009 Murine mRNA for integrin beta subunit
1248470 ********** 3 5.044 EST
1247521 ********* 3 5.299 Mus musculus mRNA for peroxisomal integral membrane protein PMP34
1381808 ********* 3 5.924 Mus musculus UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase-T3 mRNA, complete cds
1381970 ********* 3 6.285 Mus musculus thioredoxin mRNA, nuclear gene encoding mitochondrial protein, complete cds
1382168 ********* 3 6.343 N-terminal Asn amidase
1382704 ********* 3 6.36 Mus musculus N-myristoyltransferase 1 mRNA, complete cds
1248548 ********* 3 6.378 Mus musculus WDR protein mRNA, complete cds
1247564 ******** 3 6.652 Erythrocyte protein band 7.2
1248588 ******** 3 6.67 M.musculus BAP31 mRNA
1247541 ******** 3 6.690 Apolipoprotein D
1248462 ******** 3 7.322 Sterol O-acyltransferase 1
1248462 ******** 3 7.42 Sterol O-acyltransferase 1
1248521 ****** 3 9.121 Mus domesticus nuclear binding factor NF2d9 mRNA, complete cds
1382212 ****** 3 10.137 Thyroid autoantigen 70 kDa
1382270 ***** 3 10.529 Voltage-dependent anion channel 2
1248152 ***** 3 10.541 M. musculus mRNA for MAP kinase-activated protein kinase 2
1247678 3 19.431 Casein alpha
1247543 ************** 4 Cluster [44 genes] in cluster [distNext: 1.035] wiCdist:mn+-sd=0.439+-0.266 CV=0.606 RAS-related C3 botulinum substrate 1
1381923 ************ 4 0.158 Prolyl 4-hydroxylase, beta polypeptide
1382052 ************ 4 0.209 Trans-acting transcription factor 1
1247882 *********** 4 0.237 Mus musculus AMP activated protein kinase mRNA, complete cds
1248099 *********** 4 0.246 Mus musculus mitogen-responsive 96 kDa phosphoprotein p96 mRNA, alternatively spliced p67 mRNA, and alternatively spliced p93 mRNA, complete cds
1248351 *********** 4 0.251 Abl-interactor 1
1247540 *********** 4 0.255 Mus musculus mRNA for ZIP-kinase, complete cds
1248316 *********** 4 0.26 Mus musculus proteasome alpha7/C8 subunit mRNA, complete cds
1382671 *********** 4 0.264 Mouse MA-3 (apoptosis-related gene) mRNA, complete cds
1382014 *********** 4 0.277 Transcription elongation factor B (SIII), polypeptide 1 (15 kDa),-like
1247885 *********** 4 0.289 Mus musculus mRNA for ryudocan core protein, complete cds
1248294 *********** 4 0.292 Mus musculus thioredoxin-related protein mRNA, complete cds
1382066 *********** 4 0.306 Inhibitor of DNA binding 2
1248597 *********** 4 0.307 Lipocortin 1
1248591 *********** 4 0.324 Interferon beta, fibroblast
1248445 ********** 4 0.333 Mus musculus beta prime coatomer protein mRNA, partial cds
1247775 ********** 4 0.34 House mouse; Musculus domesticus male brain mRNA for ARF1, complete cds
1382750 ********** 4 0.340 Thymoma viral proto-oncogene
1247905 ********** 4 0.341 Monokine induced by gamma interferon
1381668 ********** 4 0.351 Mus musculus mitogen-activated protein kinase-activated protein kinase mRNA, complete cds
1381811 ********** 4 0.356 Protein tyrosine phosphatase, receptor type, D
1382031 ********** 4 0.358 Protease (prosome, macropain) 28 subunit, beta
1248345 ********** 4 0.363 Mus musculus alpha-methylacyl-CoA racemase mRNA, complete cds
1382555 ********** 4 0.364 Lysosomal membrane glycoprotein 1
1247820 ********** 4 0.367 Tight junction protein 1
1247598 ********** 4 0.374 Retinoblastoma 1
1247595 ********** 4 0.378 PROBABLE CALCIUM-BINDING PROTEIN PMP41
1381928 ********** 4 0.379 Mus musculus MRJ (Mrj) mRNA, complete cds
1248196 ********** 4 0.399 Max protein
1381691 ********** 4 0.423 SRY-box containing gene 17
1248225 ********** 4 0.434 Mus musculus heat shock transcription factor 1 (Hsf1) gene, partial cds
1248084 ********** 4 0.442 Mus musculus Supl15h gene
1247941 ********* 4 0.453 Fibroblast growth factor inducible 14
1381623 ********* 4 0.468 Stearoyl-coenzyme A desaturase 1
1248202 ********* 4 0.473 Mouse mRNA for PAP-1, complete cds
1382115 ********* 4 0.512 GLUTATHIONE S-TRANSFERASE GT8.7
1382044 ********* 4 0.515 Cartilage derived retinoic acid sensitive protein
1381636 ******** 4 0.567 Lymphotoxin B
1381920 ******** 4 0.569 Mus musculus mRNA for NEFA protein, complete cds
1247757 ******** 4 0.596 Granzyme B
1382094 ******** 4 0.609 High mobility group protein 1
1247545 ******** 4 0.638 Carbon catabolite repression 4 homolog (S. cerevisiae)
1247607 *** 4 1.188 POLYADENYLATE-BINDING PROTEIN
1247727 4 1.667 Malate dehydrogenase, mitochondrial
1248244 ************** 5 Cluster [19 genes] in cluster [distNext: 3.473] wiCdist:mn+-sd=4.273+-2.059 CV=0.482 CD80 antigen
1248534 ********** 5 1.648 Carbonyl reductase
1247764 ********** 5 1.776 H-2 CLASS II HISTOCOMPATIBILITY ANTIGEN, GAMMA CHAIN
1381933 ********* 5 2.345 Mouse rpS17 mRNA for ribosomal protein S17, complete cds
1381616 ********* 5 2.42 Mus musculus oral tumor suppressor homolog (Doc-1) mRNA, partial cds
1248232 ********* 5 2.486 Mus musculus putative glycogen storage disease type 1b protein mRNA, complete cds
1382644 ******** 5 2.717 Cyclin G
1248125 ******** 5 2.791 Histocompatibility 2, class II, locus Mb2
1247799 ******** 5 2.869 Mus musculus signal recognition particle receptor beta subunit mRNA, complete cds
1247708 ******** 5 3.024 Ephrin A1
1247932 ****** 5 4.235 Mus musculus (clone: pMAT1) mRNA, complete cds
1382515 ***** 5 4.668 ATPase, Na+/K+ beta 3 polypeptide
1248586 ***** 5 4.838 Mus musculus viral envelope like protein (G7e) gene, complete cds
1248198 *** 5 5.874 Mus musculus D9 splice variant 2 mRNA, complete cds
1381623 ** 5 6.224 Stearoyl-coenzyme A desaturase 1
1382086 * 5 6.885 Mus musculus (strain C57Bl/6) mRNA sequence
1247887 * 5 7.014 Mouse chromosome 6 BAC-284H12 (Research Genetics mouse BAC library) complete sequence
1247886 5 7.810 Cut (Drosophila)-like 1
1248303 5 8.094 Lipopolysaccharide response
1247621 ************** 6 Cluster [17 genes] in cluster [distNext: 19.157] wiCdist:mn+-sd=12.410+-3.024 CV=0.244 Mus musculus Lsc (lsc) oncogene mRNA, complete cds
1248050 ******* 6 7.407 Mus musculus C57BL/6J ribosomal protein S28 mRNA, complete cds
1247698 ******* 6 7.571 Adipocyte protein aP2
1248240 ***** 6 9.198 Mus musculus mRNA, complete cds
1247862 **** 6 9.844 Mus musculus Nmi mRNA, complete cds
1382162 **** 6 10.330 CAMP responsive element modulator
1248398 *** 6 11.007 Mouse mRNA for ribosomal protein S12
1248281 *** 6 11.143 M.musculus mRNA for histone H3.3A
1247852 *** 6 11.576 Twist gene homolog, (Drosophila)
1381991 ** 6 12.809 Prolyl 4-hydroxylase, beta polypeptide
1382753 ** 6 13.019 Mus musculus cleavage and polyadenylation specificity factor (MCPSF) mRNA, complete cds
1248368 * 6 13.639 Mus musculus ribosomal protein S26 (RPS26) mRNA, complete cds
1247639 * 6 13.692 SRY-box containing gene 4
1248435 6 14.262 Thymus cell antigen 1, theta
1247961 6 14.75 ATP SYNTHASE ALPHA CHAIN, MITOCHONDRIAL PRECURSOR
1248344 6 15.217 Gut enriched Kruppel-like factor
1382234 6 16.351 CD8 antigen, beta chain
2.4.5.4 Hierarchical clustering of expression profiles
The Hierarchical clustering of expression profiles computes the
hierarchical clustering of the expression profiles of data Filtered
genes and displays a clustergram and optional dendrogram.
Hierarchical clustering is described in ( Sneath and Sokol, 1973). The gene
data is normalized either by the corresponding HP-X sample data for
each gene or the maximum raw intensities for each HP sample in the
expression profile set by the Normalize by HP-X else HP's max
intensities menu toggle. There are three types of clustering
linkages: average-arithmetic-linkage,
average-centroid-linkage, and next minimum
linkage. These may be modified using the weighted average
that gives equi-weighting to the child clusters in computing the mean
of a new cluster, and un-weighted-average that weights them by
the number of non-terminal clusters. The average-linkage clustering is
very compute intensive and takes a while. The next-minimum-linkage is
much faster and may result in adequate clustering for some
situations.
.
.
.
.
.
.
.
.
.
bright green
.
.
dark green
Black
dark red
.
.
bright red
<1/8X
1/6X
1/4X
1/2X
1X
2X
4X
6X
>8X
The current gene may be set by clicking on a row that is then
highlighted in green. If you click on a colored box, it will also
report the HP name for that column and its normalized expression value
(highlighting that box with a white circle). If the Web genomic
databases are enabled (through the View menu, then it will also popup
a Web page for that gene). If you set the current gene in any of the
array, scatter plot, gene guesser, etc. displays, it will set it for
and position the clustergram at that gene. If the Dendrogram
checkbox is enabled, then a dendrogram is drawn to the left of the
clustergram boxes. Clicking on a region in the dendrogram sets a
distance threshold (displayed at the top) and displays all parts of
the dendrogram tree in red that have a cluster distance less than what
you defined. If the zoom nnX button is pressed, then the
of dendrogram drawing is magnified by nnnn-fold to make highly similar
clusters more visible. Pressing the button repeatedly cycles through:
1X, 2X, 5X, 10X, 20X. Sub-regions of the clustergram may be explored
in more detail using the EP plot button that pops up a
scrollable window of the ordered gene list. You may generate
multiple EP-subset plots so as to compare different parts of the
clustergram. A report of all of the ordered genes may be created
using the ClustGram Report button. The Show HP names
button pops up a numbered list of all samples used in the expression
profiles and clustergram. This report has all of the normalized
expression profiles on the right side of the report.

2.4.6 Report menu
Various reports summarizing gene or sample data may be generated and
appear in popup tables. These include:
-
show tables of sample array information.
-
show tables of sets of genes information.
- generates the
table as 'Spreadsheet', 'Tab delimited', or 'Name=Value list'.
- changes
the report font size from default 10 point.
2.4.6.1 Array report menu - hybridized samples global data
You may generate reports of sample array information. The first two
menu selections contain descriptive information about specific
hybridized microarrays samples. The "Extra Samples info" contains
quantitative and extra descriptive information (if available for your
database).
rSq=0.748, n=1656, HP:1(mn+-sd)=(28991+-19564), HP:2(mn+-sd)=(5044+-9766)
2.4.6.2 Gene reports menu
You may generate gene reports with various additional options. You
must set the Web access checkbox if you want to click on a blue
hyperlink in the resulting report to access an associated Web
database. In addition, specialized gene reports may be generated from
some of the cluster plot command windows. These include lists of
genes sorted by cluster (K-means cluster #), by hierarchical cluster order,
by similarity to a gene, etc. The mean cluster expression values may
be reported for K-means clustering.
- show genes
that meet the data Filter criteria.
2.4.6.2.1 Filtered gene reports menu
You may generate gene reports of Filtered genes with various
additional presentation options. In the highest/lowest N genes, N
defaults to 100 and is set by (Report | Table format | Set max # genes
in highest/lowest report) command.
2.4.6.3 Table format menu
The report is presented as a table. However, it may be visualized
several different ways. The scrollable spreadsheet includes the
ability to click on blue hypertext items and have a Web browser pop up
for that item on a Web database (e.g. GenBank, dbEST, UniGene,
LocusLink, mAdb Genes, GeneCard, etc). The tab-delimited option
enables you to cut the table and paste it into a separate spreadsheet
program such as Excel. You may also extend the data in the table to by
'Adding' expression profile ratios and statistics from the HP-X and
HP-Y 'set' comparisons.
Spreadsheet [RB] - some cells (indicated by blue) are
connected Web databases that are accessed by clicking on them (default).
Tab delimited - suitable for export [RB] - as in Excel, etc.
Set max # genes in highest/lowest report or filter [CB] -
sets the number of genes N to report or use in data Filter.
Add EP data to Gene-Reports [CB] - for each of the
samples in HP-E.
Use EP data in Gene-Reports [CB] - use raw EP data (under
the current normalization) else use data normalized to
0.0 to 1.0 by maximum value for all genes being displayed.
Add HP-X/-Y 'set' statistics data to Gene-Reports [CB] -
that includes (mean,stdDev,CV,n) for both the HP-X and
HP-Y sets.
2.4.6.4 Table font size menu
For wider tables, you can see more information if you use a smaller
font to display the table. The font sizes available are:
12pt [RB]
10pt [RB] - (the default size)
8pt [RB]

2.5 View menu
The View menu options are used to modify the view of genes
visible in the pseudoarray image. Genes may be displayed with
additional properties or capabilities including access to Web-based
genomic database entries for specific genes. Note that depending on
your particular database, if some genomic identifiers are not
available then the corresponding "Enable display current gene in
genomic DB Web browser" will not appear in the menu.
Show 'Edited Gene List' [CB]
- toggle showing the EGL as magenta boxes in the pseudoarray
image. If enabled, genes set by manual selection or as the
result of some filtering operations (see Edit menu, the
Filter and Clustering)
Enable display current
gene in GenBank Web Browser [RB] - when click on a gene in
microarray image or scatter plot to access NCBI data. It will
use RefSeqID data if available.
Enable display current
gene in dbEST Web Browser [RB] - when click on a gene in
microarray image or scatter plot to access NCBI data.
Enable display current
gene in Unigene Web Browser [RB] - when click on a gene in
microarray image or scatter plot to access NCBI data.
Enable display current
gene in mAdb Web Browser [RB] - when click on a gene in
microarray image or scatter plot to access NCI/CIT data.
Enable display current
gene in LocusLink Web Browser [RB] - when click on a gene in
microarray image or scatter plot to access NCBI LocusLink data
(by Locus ID if available, and GenBank ID if it] is not).
Enable display current
gene in OMIM Web Browser [RB] - when click on a gene in
microarray image or scatter plot to access NCBI OMIM data by
OMIM ID if available.
Show Filtered spots in array
[CB] - if on, show Filtered genes in the microarray image
using circle overlays just outside of each spot. Genes not
passing the Filter have no circle.
Gang F1-F2 scrolling
[CB] - toggle gang scrolling. If on, gang measure F1 and F2
when click on one of them. If off, only measure only the spot
that you click on.
Show mouse-over info
[CB] - toggle mouse over. If enabled, report HP or gene
details when the mouse is moved over the sample names or spots
in the array image or genes in scatterplot or expression profile
overlay plot.
Presentation view mode
[CB] - toggle presentation mode. If enabled, increase the
fonts to 12pt and draw darker circles, squares, and "+" so the
display details are easier to see when projecting the display or
making slides.
Color scheme (red-green) or
dichromasy [CB] - toggle between two color schemes red-green
and orange-blue for dichromasy which may be easier for some
people.
Show log of messages
[CB] - toggle message logging mode. If on, it pops up a
scrollable log of all messages to the three line status
area. This is useful for recording measurements and other
activity. The messages may be saved in log file.
Show log of command history
[CB] - toggle command history logging mode. If enabled, it
pops up a scrollable history of all commands issued to
MAExplorer. The commands are automatically numbered. This is
useful for recording what steps you took during an analysis.
The history may be saved in log file.
2.5.1 Logging MAExplorer messages
MAExplorer shows various data measurements as well as many other types
of information in the three text lines in the status area of the main
window. The Show log of messages pops up a scrollable log of
all messages to the three line status area. This is useful for
recording measurements and other activity. The messages may be saved
in log file (typically maeMessages.log). Figure 2.5.2 shows an example
of the messages popup log window. Clicking on genes in the pseudoarray
image or in plots will log the gene data (see Section 3.3) given the
current normalization, Samples use (single or multiple), and
pseudoarray display mode. The current values of all of the State
Threshold scrollers are saved in the message log when the (Edit menu |
Preferences | Adjust all Filter threshold scrollers) State
Thresholds popup window is closed. This is useful for capturing the
current settings at any time.
2.5.2 Logging command history
During a datamining session, the user will typically execute many
commands from the menu as well as clicking on genes in the pseudoarray
image or in plots. It is useful to recording the steps you took during
this analysis. The Show log of command history pops up a
scrollable history of all commands issued to MAExplorer. The commands
are automatically numbered. The history may be saved in log file
(typically maeHistory.log). Figure 2.5.2 shows an example
of the command history popup log window.

2.6 Plugins menu
MAExplorer may be extended by users to use new analysis methods using
Java plugins. We call these new methods MAEPlugins which are small
Java programs written by users that may be dynamically loaded into
MAExplorer and then applied to their data. These plugins will include
plugins written by LECB, those written by academic or commercial
groups. See the MAEplugins for
details. If you have a Java compiled plugin in the form of either a
Java .class or .jar file, you may load it at run time using the "Load
plugin" command in the Plugins menu. If specified in the MAEPlugin, it
will be added to the appropriate menu in the MAExplorer menu tree at
the end of the specified submenu (see Appendix C. Table C.5.7). If
this submenu "stub" is not specified, it will place in the list of
plugins in the Plugins menu (e.g. plugin #1, ..., plugin
#n).
The Plugins menu includes:
- list of
executable R methods (R LayOuts). These may be added by the user
using the RtestPlugin. An RLO
analysis allows you to export data from MAExplorer, execute it
with the R program, and import the R results back into
MAExplorer. [This is under development and is alpha-level.]
RLO methods menu
This contains a list of executable R
analyses methods (called R LayOuts or RLOs created with the
RtestPlugin) for evaluating MAExplorer data with R analysis scripts.
It is only available if you have installed the R language program (www.r-project.org) on your
computer. An RLO analysis allows you to automatically export data from
MAExplorer, execute it with the associated R program, and import the R
results back into MAExplorer. [This is under development and is
alpha-level.] A recent poster on
Extending MAExplorer with R is available as a PDF file.2.6.1 Example of using a Plugin
This shows a short demonstration of what is involved in using a
MAEPlugin. The user first load the plugin from the disk. Generally the
plugins .jar or .class files are stored in the Plugins/ directory
where you have installed MAExplorer. Then they load a particular
plugin which installs it in the Plugins pull-down menu. Then they
revisit that menu to invoke the particular plugin. You may load any
number of plugins (until you run out of computer memory if that should
occur).

2.7 Help menu
Various on-line help and documents are available if you are connected
to the Internet. These will appear in a separate pop-up Web browser
window so you may view them while working with MAExplorer. This
includes on-line documentation (including this reference manual),
tutorials, and other information. This may be links to other Web pages
describing key areas of specific databases. For example, for the MGAP
database, the point back to key areas of MGAP including the MGAP Animal Models, Histology
atlas, etc. You can then use the browser's "Save as" and "Print"
options to save the data to a file or print it.
Database-specific help menu entries -
entries defined for a particular database (see below)
2.7.1 Adding custom help links to your database to the Help menu
These Database-specific help menu
entries list of entries are keyed to the database you are
using and may be
customized by the database maintainer in the configuration file
(Section C.5.6) to links relating to the particular database. For
example, database specific help for the MGAP database is:

3. Exploratory Data Analysis - Introduction to Data Mining
Data mining is the uncovering of relevant patterns of interest in data
from a particular problem domain (Tukey, 1977). Typically this
involves using various statistical techniques to identify the patterns
including cluster analysis. See
StatSoft Inc's, 2002 on-line statistics textbook for definitions of
clustering and other statistical terms. Researchers across a wide
range of fields such as (Tufte,
1997) and (Cleveland,
1985) have suggested that a major aspect of this problem is
finding the correct means of graphical presentation to allow humans to
be a part of the pattern recognition process. Tufte argues that the
proper display of quantitative data in the context of the problem
domain can aid in the understanding of complex sets of data. This
carries over to the analysis of microarrays with data mining involves
having statistical, genomic knowledge database, and graphical
components for success. (Jagota,
2001) discusses a number of methods and applications for
microarray data analysis and visualization. Other useful resources are
the sets of papers in (
"Chipping Forecast", Nature Genetics supplement, Jan,
1999), and (
"Chipping Forecast II, Nature Genetics supplement, Dec,
2002).
Organization of Sections in this Chapter
3.1 Objectives in data mining, discovery and analysis
There are a number of objectives an investigator has when analyzing a
set of data. The types of analyses and how useful they are depends on
what they wish to get out of the analyses as well as the type of
data.
Recording the analysis steps during your data mining session -
command history
Because of the iterative nature of this process, you might want to
keep a record of the commands you have used or the messages and
measurements you have made. To do this you need to enable message and
command history logging. Go to the View pull-down menu and then select
the type of logging you want using the Show log of messages or the
Show log of command
history commands.
3.1.1 Some experimental design issues of microarray experiments
*** THIS SUBSECTION IS IN THE PROCESS OF BEING UPDATED *** Comparing HP-X/HP-Y for Cy3/Cy5 data as 'ratio of ratios'
If we have two samples HP-X and HP-Y with a common reference sample P
(e.g. Cy5P), then we would be comparing the HP-X
"intensity" Cy3X/Cy5X against the HP-Y
"intensity" Cy3Y/Cy5Y. Alternatively,
you can label Cy3 as the common reference sample P in which case just
swap Cy3 and Cy5 in these equations. If you are using a common
reference standard (i.e. Cy5X1) is the same sample
as Cy5Y1 eg. a pooled sample
Cy5P, then
a) (Cy3X/Cy5X1) / (Cy3Y/Cy5Y1)
becomes
b) (Cy3X/Cy3Y)
However, this new comparison is accompanied by additional noise
because of use of the two Cy5P intermediaries.
(Cy3A/Cy5P), (Cy3B/Cy5P), ... , (Cy3N/Cy5P)
This assumes that there is enough of the pooled sample P to be used
for all of the experiments - otherwise additional sources of
error would be introduced. MAExplorer is ideally used with this
common reference sample P.
It a common pooled sample is not used, then the experimental
design becomes more complicated - especially if dye-swap experiments
are performed for all samples. For N samples taken 2 at a time
(i.e. Cy3 and Cy5), then the number of experiments may be impossibly
large to perform for other than a very small N. Eg. for N of 3, the
number of experiments is 3 and 6 if dye swap experiments are also
performed. For N of 4, the number of experiments is 6 and 12. And
this is without doing any replicate experiments. If a reasonable
number of replicates is added, then this set of experiments becomes
even difficult to perform.
[(Cy3X/Cy5Y) + 1.0/(Cy3Y/Cy5X)]/2
In general, this is probably not a very good estimate.
3.1.2 Design philosophy of MAExplorer methodology
There are several ways to implement a data mining system on moderate
size databases. The first is that all computations are performed on a
Web server and the user's Web browser displays the results. The second
is download an applet from the Web server, get the data from the
Web server and do computations in the Web browser. A third way is do
download data from a Web server and run a local stand-alone program on
the data. MAExplorer can be run using both the second and third
ways. However, we encourage the use of the stand-alone paradigm as
having the best bandwidth and being the most robust.
The browser-based computation paradigm (as opposed to server-based) is
somewhat unusual. It keeps both the program and data on the server,
making user maintenance of the latest versions easier than if they had
to constantly upgrade the program or data. This also has the distinct
advantage of giving the user instantaneous feedback through rapid
visual and tabular views and the ability to more effectively navigate
the data since the analysis is done on their desktop computer. Because
it is easy to access reference data from other genomic sources
(e.g. UniGene, GenBank, NCI/CIT's mAdb clone DB, dbEST, GeneCard,
etc.), it can be accessed from their respective Web servers as needed.
Complex browser-based computations are used in other data mining or
intensive computation domains. With the increased bandwidth of the
Internet and compute power and memory of PCs approaching the Cray
supercomputers of the previous decade, this paradigm becomes even more
feasible. However there are limits to how well it scales because of
Web browser limitations. Appendix E.2
discusses these issues in more detail3.1.3 Evolution of MAExplorer from earlier proteomic data mining systems
MAExplorer was designed to do flexible exploratory quantitative data
analysis of gene data from microarray hybridized sample experiments.
Many of the data-mining concepts are derived from a system called GELLAB-II
(http://www.lecb.ncifcrf.gov/lemkin/gellab.html) that is a UNIX-based
stand-alone exploratory data analysis system for 2D protein gels over
multiple experiments (Lipkin and
Lemkin, 1981), a review (Lemkin
and Lester, 1989) and examples of graphical representations of
this type of data (Lemkin,
1995). An on-line
GELLAB-II Web-Poster
(http://www.lecb.ncifcrf.gov/lemkin/gellab-ep93wd.html) is available
showing various screen shots of GELLAB-II in action. Whereas GELLAB
works with sets of corresponding spots (i.e. proteins) across sets of
2D gel samples, MAExplorer works with sets of genes (spots in the
microarray) across sets of hybridized sample microarrays. With
protein gels, one typically has spot alignment problems since gels are
generally not superimposable. This is often called the rubber-sheet
distortion problem and requires localized alignment of spots based of
neighboring spot constellation morphology. We have used Web-based
visual methods to visually compare gels including the Flicker
(http://www.lecb.ncifcrf.gov/flicker/) image comparison system a Java
applet, (Lemkin, 1997), and the 2DWG
(http://www.lecb.ncifcrf.gov/2dwgDB/) meta-database of 2D gel images,
(Lemkin, 1999a). Since the genes
are precisely spotted on the arrays, aligning spots between arrays is
not required and greatly simplifies that the data analysis problem.3.1.4 Concepts used in data mining with MAExplorer
This section introduces some of the concepts used in data mining
microarrays with an emphasis on how they are used with MAExplorer.
Gene data filters - a Boolean AND of gene set tests
A primary MAExplorer concept is that of gene data filter that selects
a working set of genes by the conjunction (Boolean AND) of user
selectable tests. Each test further restricts the working set of genes
to those meeting the test. These criteria include gene membership in
particular gene classes, membership in particular user defined or
computed gene subsets, and meeting a variety of statistical
constraints. Statistics include intra- and inter-array CV, X-Y sets
t-tests. Range test criteria include X/Y ratio ranges and histogram
bins, intensity ranges and histogram bins. Membership criteria
include test if genes are in the current-cluster (derived from
cluster-analysis), gene set membership, etc. By selectively including
one or more of these filter restrictions, the user can home in on the
data that appears to be interest. Of course as in real mining, what
appears interesting may not be interesting based on further
investigation.Set operations on gene subsets
Because of the complexity of comparing many different replicated
samples, it may be difficult to manually organize the resulting
comparisons. MAExplorer offers set-theoretic operations on sets of genes and sets of hybridized samples
(i.e. intersection, union, difference) to help with this organization
(step 9 in Table 3.2). The results
of set operations may be saved and used in subsequent set operations,
normalization, as well as with the data filter. This is useful when
comparing and documenting procedures, methods, and analyses from
several subsets of experiments. User exploration states
Users needs to be able to save and restore the current state of their
explorations of the data and option settings to document and continue
at later times. When running in stand-alone mode, the user
may save their data mining session on the local disk as in named (.mae
file extension) startup files. Clicking on one of a startup file will
restart MAExplorer and restore the state to that of the time it was
saved. In addition to filter and parameter status, the HP-X, HP-Y,
HP-X and HP-Y 'sets', HP-E 'list', the named gene sets and HP
condition sets are saved as part of the state User groupware sharing of exploration states with collaborators
[In the future], these could be saved on a public Web server using
multiple named state files. These are protected for the user using a
login procedure. A groupware sharing of these intermediate
exploratory results is available when they allow another user to
access selected states. User states and groupware sharing complete
step 11 in the analysis described in Table 3.2. 3.2 Steps in data mining, discover, and analysis
An analysis scenario may use many methods for viewing the data. A
typical sequence of analysis steps is listed below in Table 3.2 in the
order they might be performed. Note that this is a rough guide
for a possible analysis and the iteration and backing up for
of some of these steps is required for data mining complex sets of
conditions, especially in the setting of constraints for the "data
filter" (step 4) when the user focuses on subtle patterns of interest
(c.f. Figure 3.1).
Scatter plots are useful for visualizing data from two conditions
The scatter plot method (step 5) allows the user to plot the intensity
data between two samples, the X-sample and the Y-sample. Gene data
may be spot data for two different samples (HP-X and HP-Y), means of
two different sets of hybridized samples of replicate samples (sets of
HP-X and sets of HP-Y), or the left and right normalized replicate
data (F1 vs. F2) for the current hybridized sample. If Cy3/Cy5 data is
used, then each sample is the ratio of data from two different
hybridized samples. So if we have sample Cy3a and Cy3b then HP-X could
be Cy3a/Cy5 and HP-Y could be Cy3b/Cy5 such that we are scaling the
Cy3a and Cy3b samples using a common Cy5 normalization sample.
Scatter plots are useful for obtaining a better understanding of the
outliers when comparing different hybridized samples and determining
the reproducibility of spotting when comparing F1 vs. F2 data or
replicate sample data.Filtering genes by histogram plots of ratios, Zdiffs or intensities
Histogram plots may be generated from either X/Y ratios or (X-Y)
Zdiffs of two different hybridized samples (single samples or X and Y
replicates) or from the F1/F2 intensities of a single hybridized
sample. Selecting a bin in a histogram restricts filtered genes to
those that are contained in that histogram bin. As an alternate
method, data filtering by ratio (Zdiff) or intensity range may
be used with adjustable range scrollers independent of the
histograms. However, histograms and scrollers may be used together.
For example, one could filter by the ratio histogram after filtering
out genes with low-intensity values that may be considered noise using
the intensity sliders. That might help eliminate falsely high ratios
resulting from dividing high X values by a very small noisy Y values.
Histograms are useful for getting a better understanding of the range
and distribution of the gene intensities or ratios. Expression profile plots (EP-plot) of N conditions for viewing
time series, etc.
List HP-E is an ordered list of samples - as different from HP-X and
HP-Y that are unordered sets of samples. The expression profile (step
5) of a gene is the plot of its normalized intensity as a function of
the samples in the ordered HP-E list. It may be plotted for the
current gene in a pop-up window. Selecting a different current gene
causes the EP-plot to be displayed for that gene. Multiple EP-plots
may be created to view the differences between a few genes you are
investigating further. The HP name button pops up a window with
the ordered list of samples so you can see the details of the sample
names being plotted. Selecting a line in a plot displays the intensity
data and sample name for that hybridized sample. The data may be
plotted as a bar, point or continuous curve and error bars may be
turned off to better compare multiple plots.Finding clusters of genes with similar expression profiles:
similar, cluster counts, K-means, and hierarchical methods
We may define a cluster of genes as a set of genes whose expression
profiles are found to be similar (step 6). The samples used in
computing the expression profiles are specified by the HP-E ordered
list. You can scale the list of normalized intensity data for each
gene to 1.0 (resulting in finding genes with similar shaped
EP-plots). Alternatively, if you don't scale this data it will
cluster more on magnitude changes. You can select either the Euclidean
distance or the correlation coefficient of the EP lists between two
genes as the measure of gene-gene distance. Similarity is 1.0 -
normalized distance.Gene reports: dynamic spreadsheets for Web access or
tab-delimited for exporting data to Excel
Pop-up report windows (step 7) may be generated for either individual
genes or a global array sample data. Instances of the latter include
experimental information and Web links, global statistics, correlation
coefficients between array samples, etc. Gene reports may present
this data in a number of ways. These include: highest/lowest gene
ratios, profiles, parametric and cluster statistics, etc. Reports may
be presented as either dynamic Web-interactive spreadsheet tables or
as static tab-delimited tables. The latter is useful for exporting
data using cut and paste into Excel (step 10). If the user clicks on
a blue hyperlinked cell in a dynamic spreadsheet table, it pops up
another Web browser window and loads it with data (step 8) from the
respective Internet genomic database such as mAdb Clone DB, UniGene,
GenBank, dbEST, GeneCard, and MGAP model and histology Web pages.Collaborative groupware environment
Having immediate access to collaborator's data is a powerful research
tool. A collaborative environment is being implemented for MAExplorer
that allows groups of users to share data and intermediate
results. These capabilities include: 1) the ability to save and
restore exploratory data mining sessions (states) through the Web
server including named sets of genes, and 2) to selectively share
these states with collaborators. The latter process is sometimes
called a groupware environment because if offers a collaborative group
the ability to share and interact. These capabilities are modeled
after our WebGel system (Lemkin
et al., 1999b). In addition, users can create a custom database
Web page as a subset of samples from the entire database. This may be
saved on their own computer through their Web browser's "File/Save as"
command. This hypertext file could then be used at a later time to
access the database or be E-mailed to a collaborator to do the same.
3.2.1 Definition of expression profile
It is helpful to define an expression profile. There may be alternate
definitions, but the following is useful for getting an understanding
of how it might be computed. An expression profile
ej of an ordered list of N samples (k=1
to N) for a particular gene j is a vector of scaled
expression values vjk.
ej = (vj1, vj2, vj3, ..., vjN)
A difference between two genes p and q may be estimated as a
N-dimensional metric "distance" between ep and
eq. The Euclidean distance is then defined as
dpq = (1/N SUMj=1:N (vjp - vjp)2 )1/2
Other distance measures may include correlation coefficient,
city-block (or manhatten distance) etc.
spq = 1 - dpq
3.2.2 Clustering Methods
Clusters represent one way to identify similar gene expression across
a set of experiment samples. There are many ways to cluster the data,
some of which are available in MAExplorer. These include:
Other methods include Self Organizing Memory (SOM), fuzzy clustering,
Support Vector Machines (SVM), etc.
3.2.2.1 Clustering similar genes
If we have a particular gene s (the "seed" gene), we may want
to find a set of all genes {gj} similar to
gs. We can find this set of genes by testing
We define a particular gene gj as similar to seed
gene if the distance between genes s and j meets
the following criteria.
djs < T
The threshold T is set by the investigator and in MAExplorer
is changed using a slider. Typically, the set of all genes
{gj} found is sorted by similarity before being
viewed.
3.2.2.2 K-means clustering
K-means clustering finds K clusters of genes with similar expression
profiles to a given gene (see
Sneath and Sokol, 1973). Given the number of clusters K,
we could use high variance of clusters to determine if they should
split into sub-clusters. K-means clustering does not need a distance
matrix (see Hierarchical clustering which follows), so it is faster
and may cluster large numbers of N genes. However, it is
highly dependent on seed selection. It may be useful for getting an
initial estimate - especially if other techniques (such as silhouette
plots) are also used. The following is a simplified definition of one
way to compute a set of K-means clusters of gene expression profile
data.
Algorithm:
3.2.2.3 Hierarchical clustering
Hierarchical clustering of a set of genes will generate a binary tree
of clusters with the genes at the terminal ends of the tree and a
single cluster of the entire tree at the top (also called the root) of
the tree. See (Sneath and Sokol,
1973) for a discussion on hierarchical clustering. There are many
other variants of hierarchical clustering. Hierarchical clustering
requires a distance matrix or the equivalent of one [there are more
efficient ways to compute it]. ForN genes (terminal
clusters), it generates 2N-1 clusters. Distance matrix is
upper diagonal matrix D of dpq of size
N(N-1)/2.
Algorithm:
3.3 Display of gene spot intensity and identification data measurements
You may select the current gene by clicking in the pseudoarray image or in the X-Y scatter plot and
MAExplorer reports. The
microarray grid coordinates, normalized quantified spot intensity
data, plate coordinates, gene name (if known) and associated data for
that gene. If you are displaying a pseudocolor ratio (X/Y) or Zdiff
(X-Y) image, it will report HP-X/HP-Y or (HP-X - HP-Y) data
respectively. It also sets the gene as the current gene. The
pseudoarray image coordinates are reported as:
[<field>-<grid name><row#>,<col#>].
e.g.
[1-A4,3]
[1-A4,5] intensity=4.5267, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
b) Field F1 and F2 replicate spots for a single sample. The top line
is shown for each of the different normalization methods.
[1-A4,5] intensity[F1]=-0.3067, intensity[F2]=-0.2312, F1-F2=-0.0755, (Norm.: Zscore intensity)
[1-A4,5] intensity[F1]=4.5267, intensity[F2]=6.2408, F1/F2=0.7253, (Norm.: median intensity)
[1-A4,5] intensity[F1]=0.8755, intensity[F2]=1.1457, F1-F2=-0.2701, (Norm.: log median intensity)
[1-A4,5] intensity[F1]=-0.1442, intensity[F2]=-0.0945, F1-F2=-0.0497, (Norm.: Z-score, stdDev, log intensity)
[1-A4,5] intensity[F1]=-0.1533, intensity[F2]=-0.1004, F1-F2=-0.0528, (Norm.: Z-score, mean abs.deviation, log intensity)
[1-A4,5] intensity[F1]=630.9911, intensity[F2]=869.9273, F1/F2=0.7253, (Norm.: calibration DNA intensity)
[1-A4,5] intensity[F1]=1919.9376, intensity[F2]=2646.957, F1/F2=0.7253, (Norm.: scale to max. (65K) intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
If the "Pseudocolor HP-X/HP-Y ratio or Zdiff" option is selected in
the "Show Microarray" submenu, data is reported as either Ratio or
Zdiff data depending on the normalization method selected. The data
used in the following examples is for C57B6 pregnancy day 13 (HP-X)
compared with Stat5a (-,-) pregnancy day 13 (HP-Y).
[1-A4,5] HP-XY: mn(X,Y)=(5.383,6.834) (X/Y)(F1,F2,mean)=(0.651,0.928,0.787), (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
d) Zdiff data for two separate samples X and Y. Ratio data for the
field F1 and F2 spot data as well as the mnX-mnY Zscore difference is
reported. The three Zscore, ZscoreLog, and logMean normalizations were
used in this example (first lines are shown).
[1-A4,5] HP-XY: mn(X,Y)=(-0.269,0.151) (X-Y)(F1,F2,mean)=(-0.470,-0.370,-0.420), (Norm.: Zscore intensity)
[1-A4,5] HP-XY: mn(X,Y)=(-0.119,0.051) (X-Y)(F1,F2,mean)=(-0.199,-0.142,-0.170), (Norm.: Z-score, stdDev, log intensity)
[1-A4,5] HP-XY: mn(X,Y)=(1.010,1.224) (X-Y)(F1,F2,mean)=(-0.362,-0.064,-0.213), (Norm.: log median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
e) Example of when the "Use dual HP-X & HP-Y Pseudoimage" mode is
enabled in the "Show Microarray" submenu of the "Plot" menu. This
displays mean data for the HP-X and HP-Y data side-by-side. The median
normalization was selected.
[1-A4,5] intensity[X]=5.3837, intensity[Y]=6.8342, X/Y=0.7877, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
Reporting for multiple hybridized samples when using HP-X/-Y 'sets'
If you have enabled MAExplorer to "use HP-X and HP-Y 'sets' of
multiple samples" rather than single samples" in the Samples menu, it
will report a spot differently using the means (mn), standard
deviations (S.D.), coefficient of variations (CV) for the samples in
the HP-X and HP-Y 'sets'. For duplicate fields, these are computed
using the normalized average of F1 and F2 spots for each gene in each
samples. The data used in the following examples is for three C57B6
pregnancy day 13 (HP-X) samples, and five Stat5a (-,-) pregnancy day
13 (HP-Y) samples.
[1-A4,5] HP-X 'set' mean intensity=3.295 stdDev=1.482 CV=0.449 n=3, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
g) Multiple HP-XY 'sets' using median normalization for the pseudoarray image
display for the HP-Y 'set' of five Stat5a (-,-) samples.
[1-A4,5] HP-Y 'set' mean intensity=8.180 stdDev=0.986 CV=0.120 n=5, (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
h) Multiple HP-XY 'sets' using median normalization for the
pseudoarray image display for the HP-X and HP-Y 'sets' when the "Use dual
HP-X & HP-Y Pseudoimage" mode is enabled in the "Show Microarray"
submenu of the "Plot" menu.
[1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120)\
n(X,Y)=(3,5), (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
i) Multiple HP-XY 'sets' using median normalization for ratio (HP-X/HP-Y) data
for the "Pseudocolor HP-X/HP-Y Ratio or Zdiff" display.
[1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120) \
n(X,Y)=(3,5), (Norm.: median intensity)
CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, platey[5,A,5]
GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
[1-A7,20] HP-XY: mn(X,Y)=(3.449,0.853) (X/Y)(F1,F2,mean)=(4.09,4.008,4.041), (Norm.: median intensity)
CloneID: 1382656, dbEST5': 1775754, GenBank 5': AI036495, UniGene: Mm.300, plate[12,A,8]
GeneName: Carbonic anhydrase 3
Reporting for Cy3 and Cy5 channels for a single hybridized sample
k) If you have Cy3/Cy5 data, then you can look at the two channels for
a single sample ( the current HP sample). For median normalization and
the display set to "Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5
data" display.
[1-A6,11] Cy5/Cy3=0.3588, Cy5=67.324, Cy3=187.622, (Norm.: median intensity)
CloneID: IMAGE:1054189,
GeneName: expressed sequence AW213287
Reporting HP-X (Cy3 or Cy5) vs HP-Y (Cy3 or Cy5) data for 2 samples
l) If you want to compare Cy3 or Cy5 in the HP-X sample with a Cy3 or Cy5
value in the HP-Y sample, you do it through the special Cy3,Cy5
scatter plots. There are four types of plots:
After the plot is started, clicking on a scatter plot will report data
from the point in that plot will print the following data as shown in
the following example where HP-X Cy3 is plotted against HP-Y Cy3.
[1-A5,16] intensX=4.695, intensY=5.923, (X-Y)=-1.2275, (Norm.: log median intensity)
CloneID: IMAGE:963758,
GeneName: RIKEN cDNA 2410114O14 gene
3.4 Selecting subsets of genes using the data Filter
Genes may be selected on a number of criteria specified by the data Filter (Section 2.4.3)
that is a cascade of data tests. The first might be the gene class (Section 2.4.1) to
restrict the set of all genes to a particular subset. Various numeric
and statistical data tests might be applied on the remaining genes to
exclude those not meeting these tests. For example, genes having a
high coefficient of variation between duplicate spots on the same
sample or duplicate samples could be eliminated. Then one could
select genes that had a (HP-X/HP-Y) ratio greater than 4.0 but less
than 8.0, etc. The latter could be done using either the ratio
scrollers or by clicking on that bin in the ratio histogram plot. See
the Filter menu options and look at one of the tutorials (Appendix A) for ideas on
adjusting the Filter to close in on a particular subset of genes.3.5 Selecting subsets of hybridized sample conditions
Sets and lists of hybridized sample conditions (HP-X and HP-Y sets,
HP-E list) may be selected using various commands from the Samples Menu (Section 2.2)
including pull-down menus, guessing by name or part of a name, or
using a
"Chooser" (see Figure 2.2.1) to design your settings for the
current (HP-X and HP-Y sets, HP-E list). The Chooser is the easiest
way to select these entries. In addition, if you want to change the
current HP-X or HP-Y individual sample, you can do this directly from the array
pseudoarray image by clicking on the [X] or [Y] part of the image
and then selecting the particular sample to use. Note that if the
mouse-over checkbox is enabled, then moving the mouse over the sample
names gives you the full sample name. Otherwise, the sample name may
be truncated.
3.6 Setting threshold values using the state-scroller sliders
You may filter genes using a variety of thresholding operations (see
the Filter menu (Section
2.4.3) to select any of these). For example, these include a spot
intensity (per channel) range [SI1:SI2], gene intensity range [I1:I2],
ratio range [R1:R2], Zdiff range [Z1:Z2], Coefficient Of Variation
(CV) range [0:1.0], p-value range [0:1.0] for the
t-test, etc. Additional threshold scrollers are used with the
clustering methods including the number of clusters (default 6), the
maximum cluster distance from a gene to a another gene for the latter
to be considered in the same cluster, and the absolute difference
between HP-X and HP-Y.
Slider name Associated with operation Spot Intensity SI1 Filter by spot intensity range per channel Spot Intensity SI2 Filter by spot intensity range per channel Percent SI OK Filter by percent of spots whose spot intensity
is in threshold range criteria meets the AT LEAST or AT MOST criteria Intensity I1 Filter by gene intensity range Intensity I2 Filter by gene intensity range Ratio R1 Filter by gene X/Y ratio range Ratio R2 Filter by gene X/Y ratio range Zdiff Z1 Filter by gene X-Y Zdiff range Zdiff Z2 Filter by gene X-Y Zdiff range Ratio CR1 Filter by Cy3/Cy5 gene X/Y ratio range Ratio CR2 Filter by Cy3/Cy5 gene X/Y ratio range Zdiff CZ1 Filter by gene (Cy3-Cy5) X-Y Zdiff range Zdiff CZ2 Filter by gene (Cy3-Cy5) X-Y Zdiff range p-Value Filter by t-Test Spot CV Filter by Coefficient of Variation Cluster Distance Plot - cluster by expression similarity # of Clusters Plot - K-means clustering Diff HP-XY Filter by absolute difference (HP-X,HP-Y) Spot Quality Filter by continuous spot quality (If data available)
3.7 Exporting report and plot data
Data is typically reported in MAExplorer in report and plot
windows. These may be saved using cut and paste if your are using
MAExplorer as an Applet or with "SaveAs" buttons on the popup
windows if you are running it as an stand-alone application . Reports
are then saved as text (.txt extension) files, and plots are saved as
GIF (.gif extension) files.

4. Status and Bugs of MAExplorer
This section discusses the status and known bugs in MAExplorer. It
also discusses dealing with the reporting of fatal errors so we can
resolve them.
Disclaimer: none of our code ever has bugs... :-). So despite
this, we are working on resolving these bugs and implementing planned
functionality. Here is a short non-inclusive list of known problems
that we are resolving. We welcome and encourage you to E-mail us with
any bugs that you find do exist as well as suggestions for
capabilities you would like to see. As the new open-source MAEPlugins facility evolves, most
new (and some old) functionality will migrate to these plugins. Then
the user community can help maintain these analytic methods.
4.1 Known Bugs in MAExplorer
4.1.1 Browser Applet Bugs
4.1.3 Downloading and Installer Bugs
"Recommended version for your computer
Download installer for ...your OS..."?
Occasionally, we have seen instances where you can not install
MAExplorer from within the Web browser. The solution is to explicitly
download the particular Platform for your OS in the Available
Installers list. And then to follow the instructions on running it.
limit stacksize unlimited
4.1.4 Computation speed and display Bugs
4.1.5 User state and login Status
4.1.6 Data file names Bug
4.1.7 Gene Sets Bugs
4.1.8 Clustering Bugs
4.1.9 Expression profile plots
4.1.10 Data conversion problems
4.1.11 Java Plugins bugs
4.2 Revision notes
This section lists the revision history and is useful for deciding
whether to upgrade to the most recent release. You may want to check
for the latest the current
"Stable release" available on the MAExplorer Web site. That may be
different than the Stable
release listed in this copy of the Reference Manual. The "Beta
release" listed below the Stable release in the previous links is
experimental and may generally be downloaded as it has more
functionality. If you experience problems, you can just reinstall the
Stable version.
Note: An archive of some of the
stable older releases is available on the NCI/LECB Web site for a
limited period.
A Cvt2Mae Version 0.66: bug has been fixed that makes it easier
to automatically find the first row of spot data when scanning the
users data input data file. Previously, the user might have to
manually enter that starting data arow number in the Edit Layout
wizard.
Renamed all previous references in the program to "hybridization probe" or
"hybridization sample probe" to the new term "hybridized sample" for
clarity. Changed many "HybProbe Menu" to "Samples Menu" and also many
menu selections as well as plot and report labels to reflect this
change. We are in the process of updating the manual computer screen
figures and PDF slide presentations so that "hybridization sample
probe" is shown as "hybridized sample". Also fixed other minor
problems including fixing the inverted color scale for the
"Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy5+Cy3 ratio or Zdiff"
command.
Version 0.94.01:
Major version release.
Moved "Cluster Plots" submenu of the "Plots" submenu
up one level in the "Analysis" menu.
Version 0.93.01: Major version release.
Major version release. Optimized colors in grayscale display for
"Pseudocolor (HP-X,HP-Y) 'sets' p-value". Fixed error and optimized
t-test computation.
Version 0.92.22: Last Stable Release.
This corrects a few minor bugs including crashing when starting up
an empty database (that bug was introduced sometime in the last month).
It is the first release with some reorganized code.
Version 0.91.01: Major version release.
(very Beta) Changing convention so use "MasterID" as the master gene
index. This makes it more flexible than when used the CloneID as the
master gene index. Added LocusLink.
Version 0.90.01: Major version release.
Because MAExplorer can be used with both spotted clone arrays and
oligo arrays,we renamed clones as genes (except where Clone ID is
used) in both the MAExplorer program and in the Reference Manual.
Version 0.89.01: Major version release.
There is not enough memory to cluster current filtered clones. Options:
1. reduce the number of filtered clones and try again, or,
2. disable cluster-cache (Clustering menu) - will be VERY slow.
4.3 Web Browser problems when running MAExplorer as an applet
Because MAExplorer is a large system, and there may be occasional
problems running it in some Web browsers on some operating systems. We
recommend you run MAExplorer as a stand-alone application as it is
more robust.
. You can download
HotJava for the Windows-95/-98/-NT/-2000/-XP and for Solaris
(Unix).
4.4 Handling fatal error reporting (i.e. DRYROT errors)
If you encounter a fatal error that is detected by MAExplorer, it will
popup an error reporting window. We call this a "DRYROT" error (thanks
to "S.A.I.L." - Stanford AI Lab) because something is wrong in the
program or in the user's data files and from which it can not
recover. This type of error should not have happened. Please save and
e-mail the report to us so we can try to fix the bug or diagnose the
problem. The following figure shows an example of part of a DRYROT
error report.

Release Archive for stand-alone MicroArray Explorer on NCI/LECB
This is an archive of some of the older stable versions of the
MAExplorer stand-alone application program. These are the full
installers which include the Java JDKs for all operating systems as
well as the MGAP database. The user reference manual (available as a
zip file) specific for that version is also included. After a while,
we will remove some of the older releases. To find what the current
and beta releases are, see the Install home page. The
changes between releases are listed in Section 4.2 Revision Notes.
Release
Release Date
Manual (.zip) for Release
0.96.02
07-02-2002
-
0.95.20
05-31-2002
MaeRefMan.zip (10Mb)
0.95.16
05-24-2002
MaeRefMan.zip (10Mb)
0.95.04
03-22-2002
MaeRefMan.zip (10Mb)
Acknowledgements
Primary contributers to MAExplorer were Peter Lemkin
(LECB/NCI), Greg Thornwall (SAIC/FCRDC) in the Laboratory of Experimental and
Computational Biology, NCI/NIH, and Jai Evans (DECA/CIT/NIH).
Kevin Becker and Chris Cheadle (NIA/NIH),
Breast Cancer Think Tank (NCI),
Damien Chaussabel (NIAID),
Terry Clark and Josef Jurrek (U. Chicago),
Mitko Dimitrov (LECB/NCI),
Jai Evans and Chris Santos (DECA/CIT/NIH),
Troy Moore (Research Genetics),
Peter Munson (CIT/NIH),
Alan Li (SourceForge),
Quang Tri Nguyen (LECB+LCRC/NCI),
John Powell and Esther Asaki (CIT/NIH),
Eric Shen (U. Arizona),
Moshe Shani (Agr. U. Israel),
Richard Simon (NCI/NIH),
Bob Stephens and Gary Smithers (ABCC/FCRDC),
Ron Taylor (U. Colorado),
Mark Vawter (NIDA/NIH, UC-Irvine),
John Weinstein (LMP/NCI), David Kane (SRA/NCI), Ajay (LMP/NCI),
and to many others for useful discussions and suggestions that have
helped improve the MAExplorer's capabilities and usability.

References to related exploratory data analysis methods and MAExplorer
This short list of references is limited to a few related to
exploratory data analysis methods for microarrays as they relate to
MAExplorer. It is not meant to be inclusive. More extensive
lists of references to many of the array preparation and data mining
methods can be found in some of these papers and on the Internet.
Appendix A. Short tutorial for MAExplorer
This tutorial is for use with MAExplorer, an exploratory data analysis
facility for microarray DNA databases. It may be used with any
MAExplorer database. As with all tutorials, they are only starting
points for getting you started - in this case into understanding the
data mining analysis environment. Try out new options on your own,
you can't break anything :-).
NOTE: THIS APPENDIX IS BEING REVISED AND EXPANDED... A.1 Demonstration data
Note that the downloadable MAExplorer stand-alone application includes
a subset of 50 hybridized samples from the MGAP database including a
number of startup files for that data (see the the list of startup .mae
files included in the download
installation).
A.2 General instructions:
Throughout this tutorial we refer to condition X and condition Y.
These are different hybridized samples in the particular database you
have loaded. For example, in the MGAP database X might be lactation
and Y might be pregnancy. X and Y 'sets' are multiple samples of these
two conditions.
First, select one of the start up databases.
If the particular samples you want to analyze are not listed in that
example, after it starts you will be able to add samples you do want
and remove samples you don't want - regardless of which example was
intially used if the database "Samples" database contains additional
hybridized samples.
Second, go to the A.3 instructions for
self-guided tutorial below for instructions on what to do next.
A.3 Self-guided tutorial of MAExplorer - notation and examples
The following is a self-guided tutorial
(you issue the commands) that illustrates some of the data
analysis capabilities. In the following examples, the notation "go
to A:B:C" means go pull-down menu A, then submenu B and, then make
selection C. "Selecting a gene" from the microarray image or scatter
plot means clicking on a spot in the pseudoarray image or a point in
the any of the plots.
A.3.1 Review of types of gene data available in the database
step 1: go to Analysis: GeneClass: All genes
the array shows all genes with white circles.
step 2: go to Analysis: GeneClass: All named genes
the array shows named genes with white circles.
step 3: go to Analysis: GeneClass: ESTs similar to genes
the array shows ESTs similar to named genes with white circles.
step 4: go to Analysis: GeneClass: ESTs
the array shows unknown ESTs with white circles.
step 5: go to Analysis: GeneClass: All genes and ESTs
the array shows all named genes and all ESTs with white circles.
step 6: go to Analysis: GeneClass: Replicate genes
the array shows replicate genes having at least 2 copies in the
array with white circles.
step 7: go to Analysis: GeneClass: Calibration DNA
the array shows calibration DNA (if present) with white circles.
step 8: go to Analysis: GeneClass: Your plates
the array shows clones from user's plates (if present) with white circles.
A.3.1.1 Analysis of the expression of a single known gene
ratio between two conditions X and Y (HP-X, HP-Y)
expression profile of a set of conditions (HP-E) (see
Example A.3.1.7)
step 1: click on the blue "Enter gene name" button to pop up a name
entry window
step 2: start typing gene name into blue text entry window
step 3: once gene names appear, click on gene of choice
step 4: press "Done" button in pop up window
A yellow circle will define the gene as the "current gene" in the microarray
pseudoarray image (info on gene is also provided in the status area above the array).
If there are replicate grids (left and right fields of repeated genes are denoted
by F1 and F2) in the array (HP). The mean(HP-X,HP-Y) values and the (HP-X/HP-Y)
values for the specified gene are reported are reported.
step 5: alternatively, click on an array spot of choice to define any gene
in the array as the new current gene
A.3.1.2 Find a subset of genes with a common substring (e.g. *ONCO*)
step 1: click on the blue "Enter gene name" button to pop up a name
entry window
step 2: start typing "*ONCO*" (without the quotes) into blue
text entry window
step 3: once gene names appear, press "Set E.G.L." button in pop up
window
Magenta squares will indicate these genes in the pseudoarray image.
These include the 'onco'genes and the proto-'onco'genes
A.3.1.3 Two conditions - scatter plots:
Create a scatter plot of two hybridized samples where condition X data
is on the X axis and condition Y data on the Y axis.
then click on yellow circle in scatter plot to get HP-X/HP-Y
ratio for the gene
step 2: click on any point in the scatter plot
this also alternatively defines any gene in the plot as the new
current gene
step 3: zoom in on a region of the plot using the vertical or
horizontal scroll bars
step 4: click on another point in the scatter plot to get the
HP-X/HP-Y ratio another gene
step 5: press "Close" button to remove pop up window
A.3.1.4 Scatter plot of Cy3 vs Cy5 or replicate spots (F1 vs F2) of one sample
Create a scatter plot of Cy3 vs Cy5 channels or replicate spot F1, F2
data if your database is contains (Cy3,Cy5) ratio data or it contains
replicate spot fields (F1,F2).
or go to Analysis: Plot: Scatter plots: F1 vs. F2
Then, click on green circle in scatter plot to get Cy3/CY5
ratio for the gene
or F1/F2 ratio for replicate spots for that gene
step 2: click on any point in the scatter plot
this also alternatively defines any gene in the plot as the new
current gene
step 3: zoom in on a region of the plot using the vertical or
horizontal scroll bars
step 4: click on another point in the scatter plot to get the
HP-X/HP-Y ratio another gene
step 6': select the samples you wish to swap and press "Done". This
enables you to see the swapped results in the scatter plot
step 7: press "Close" button to remove pop up window
A.3.1.5 Filter by expression ratio between two conditions X and Y
step 1: go to Analysis: Plot: Histograms: HP-X/HP-Y
the histogram shows the ratios
step 2: move pop up plot so you can see it and the array
simultaneously
step 3: choose (click on) a ratio bin
genes filtered by the ratio range of the bin will light up on the array ('+'s)
step 4: click on different bin in the histogram to select
another bin
step 5: click on word "Freq" on left in histogram to remove the
histogram bin filter
You can filter out low intensity genes by
A.3.1.6 Filter by spot intensity range
step 1: go to Analysis: Filter: Filter by spot intensity [SI1:SI2]
sliders: Use spot intensity [SI1:SI2] sliders
step 2: adjust intensity lower bound (SI1) to remove low ratio
genes
step 3: when done, remove the 'Filter by intensity sliders' by
toggling it off (redo step 1 to toggle it off)
step 4: repeat steps 1-3, but this time use Filter : Filter
by [I1:I2] sliders :
Use spot intensity (or Cy3/Cy5) [I1:I2] sliders
A.3.1.7 Multiple conditions - expression profile plots of HP-E data:
step 1: go to Analysis: Plot: Expression profile: Display a gene's
expression profile
step 2:
after the expression profile window pops up, click on a
gene in array to see its profile
step 3: click on a line in the profile plot to see its intensity
step 4: click on a different gene in the array to see its
profile
step 5: press "Show HPs" button to see the list of samples used
step 6: press "Close" button to remove pop up windows
A.3.2 Changing the normalization between hybridized samples
You may change the normalization method used to scale data between
hybridized samples so they may be compared.
A.3.2.1 Set normalization
step 1: go to Analysis: Normalization: Median intensity
step 2: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
to see the effect of normalization on the scatter plot.
Note how outliers appear.
step 3: go to Analysis: Normalization: Zscore of intensity
step 4: go to Analysis: Normalization: Zscore of log intensity, stdDev
step 5: go to Analysis: Normalization: Unnormalized
this does not scale data between samples.
step 6: go to Analysis: Normalization: Median intensity
this leaves the normalization method in Median mode.
A.3.3 Analysis of the expression profiles of gene classes
You may restrict the set of genes by Gene Class. Several built
in gene classes are defined. You may also set up additional ones and
filter by those (not covered in this short tutorial).
A.3.3.1 Filter by gene class membership
step 1: go to Analysis: GeneClass: All known genes
the array only shows named genes (additional gene subclasses
are being added)
step 2: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
to see the two condition expression of just these genes
step 3: go to Analysis: Plot: Expression profiles: Display Filtered
genes expression profiles
to see the multiple condition expression of just these genes. This may take a
while if there are many genes
step 4: you can click on a line in any of the plots to see the
samples' intensity value for that gene
step 5: when done, press "Close" button in all pop up plot windows
A.3.3.2 Gene Reports
step 1: go to Analysis: Report: Gene reports: Filtered genes: Genes passing Filter
Clicking on a blue entry will bring up I.M.A.G.E, dbEST, UniGene, or GenBank,
LocusLink, or mAdb Clone database in pop up Web page
step 2: press "Close" button in report, and close this pop up Web page
step 3: go to Analysis: Report: Table format: Tab-delimited
to enable creating Excel-compatible reports
A.3.3.3 Exporting Gene Reports to Excel
step 1: repeat step 1 of the Gene Report, but this time
to make text-formated report
step 2: cut the text from this window and paste it into an Excel window.
This is useful for exporting data if you are on a Windows PC
step 3: go to Analysis: Filter: all genes
to restore it to all of the genes from all named genes
step 4: go to Analysis: Report: Table format: Spreadsheet
step 5: press "Close" button in report
A.3.4 Analysis of the expression profile of multiple hand picked genes
Users can manually define a set of genes which are kept in the
Edited Gene List (E.G.L.). Various operations can then use the
EGL to restrict the set of data being analyzed.
A.3.4.1 Define a list of edited genes, then plot all their expression
profiles at one time
step 1: go to View: Show 'Edited Gene List'
this turns on the
'Edited Gene List' magenta square box overlays
step 2: hold CONTROL key and click on genes in array to add a gene
step 2': hold SHIFT key and click on genes in array to delete a gene.
This lets you edit a list of genes. It also works when clicking in a scatter plot
step 3: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
to see the Edited Gene List in the scatter plot
step 4: try defining (or removing) E.G.L. genes in the scatter
plot by holding the
CONTROL (or SHIFT) key when clicking on points in the
scatter plot
A.3.4.2 Filtering by edited gene list
step 1: go to Analysis: Filter: Filter by 'Edited Gene List'
step 2: go to Analysis: Plot: Expression profiles: Display Filtered genes expression profiles
scroll through the plots to see all of the profiles
step 3: go to Analysis: Filter: Filter by 'edited gene list'
this turns off the 'edited gene list' filter
step 4: press "Close" button in expression profiles window
A.3.4.3 Report of edited gene list
step 1: go to Analysis: Report: Gene report: genes in 'edited gene list'
reports edited genes
step 2: press "Close" button in report
step 3: go to Analysis: Filter: Filter by 'edited gene list'
this turns off the 'edited gene list' filter
step 4: go to View: Show 'edited gene list'
this turns off the 'edited gene list' squares overlay
A.3.5 Identify a cluster of genes with similar expression
profile to the current selected gene
step 1: go to GeneClasses: All named genes and ESTs
step 2: go to Analysis: Plot: Cluster plots: Cluster genes with expression profiles similar to current gene
this will pop up a cluster summary and cluster distance slider control window.
Move the summary and slider windows so you can see all 3 windows. The size of
the cyan boxes on similar genes in the pseudoarray is proportional to the similarity.
Adjust the cluster distance slider to smaller values and note how
the number of genes clustered decreases.
It should be set for a reasonable number considering the material you
are analyzing.
step 3: select (click on) a new current gene
the genes which belong to that cluster are labeled in the array with cyan boxes
and are defined as the "current cluster". The current gene you click on has
a green circle around it
step 4: press "Cluster Report" button in the cluster summary
this pops up a Gene Report for the clustered genes
step 5: press "Close" button in the report
step 6: press "EP plot" button in the cluster summary
this pops up a scrollable list of expression profile plots sorted by similarity
to the current selected gene.
step 7: press "Close" button in the report
step 8: press "Close" button in the cluster summary
A.3.6 Identify clusters of genes with similar expression under
various conditions using data mining filters
step 1: go to GeneClasses: ESTs similar to genes
step 2: go to Analysis: Plot: Cluster plots: K-means clustering of
gene expression profiles
this will pop up a cluster summary and slider control window. Move the
summary
and slider windows so you can see all 3 windows. The size of the
magenta circles in the array is proportional to # genes/cluster
step 3: select (click on) a new current gene
the genes which belong to that cluster are labeled in the array with tiny green
numbers are defined as the "current cluster". The current gene you click
on has a green circle around it
step 4: go to View: Show 'edited gene list'
genes in the current cluster were also copied to the edited gene
list
step 5: go to Analysis: Report: Gene report: genes in 'edited gene list'
reports genes in the current cluster
step 6: press "Close" button in report
step 7: go to View: Show 'edited gene list'
this turns off the 'edited gene list' squares overlay
A.3.6.1 Varying the number of clusters
step 1: vary the "# of clusters" slider value from 6 to 10, then 20
note the number of clusters changes and the gene cluster composition also
changes
A.3.6.2 Defining a new cluster "seed" to recluster the genes
step 1: select a new current gene in array and press the
"Recompute clusters" button
this recomputes the clusters using the current gene as the new seed gene
A.3.6.3 Cluster expression profile plots
step 1: press "EP plot" button and scroll down the list after they appear
the primary nodes for each cluster are indicated with red labels
in the set of
profiles, and the other genes are labeled with their cluster number
step 2: press "Mean EP plot" button and scroll down the list after they appear
these are the mean expression plots of the primary nodes clusters.
A.3.6.4 Report of all clusters
step 1: press the "Cluster-Report" button to get a sorted cluster
list scroll the spreadsheet to the right to see the cluster statistics
step 2: press the "Mn-Cluster-Report" button to get a sorted cluster list
scroll the spreadsheet to the right to see the mean expression profiles
step 3: press "Close" button in pop up windows
A.3.6.5 Current cluster in scatter plot
step 1: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
step 2: move the plot so you can see both scatter plot and array
step 3: click on a gene in the cluster or on spots in the scatter plot
note that the green cluster numbers are drawn in the scatter plot
step 4: go to Edit: Sets of genes : Save 'Edited gene list' as gene sets
this will pop up a dialog box requesting "Enter new gene set name"
step 5: type "Genes in current cluster class"
this will save the current cluster in a gene set.
This gene set will
be used in the next example
step 6: press "Close" button in pop up windows
step 7: (optionally) investigate hierarchical cluster with clustergrams and
dendrograms by going to Plot : Cluster Plots : Hierarchical clustering plot for HP-E
A.3.7 User Gene Set operations
You may manipulate sets of genes. Some of these are predefined for you
by the database (eg. All named genes, ESTs, etc.). Others are defined
by particular operations (E.G.L., clustering, etc.), and lastly others
may be defined by you using logical operations on these sets (OR, AND,
DIFFERENCE).
A.3.7.1 List of the current gene sets
step 1: go to Edit: Sets of genes : List saved gene sets
this lists the current list of gene sets
step 2: Change the E.G.L. set
of genes and note how the # of E.G.L. genes changes in the list.
You can add (remove) genes to the E.G.L. by clicking on a spot in the array while the
CONTROL (SHIFT) key is held down.
A.3.7.2 Filter by user defined gene set
step 1: go to Edit: Sets of genes : Set 'User Filter Gene Set' (for Filter)
this will request a gene set to use with the Filter in a pop up dialog box.
Enter gene set # for the set for "Genes in current cluster class" which you saved
in the previous example.
then press "Ok" in the dialog box.
step 2: go to Analysis: GeneClass: All genes and ESTs
this resets the filter to look at all genes and ESTs
step 3: go to Analysis: Filter: Filter by 'User Gene Set' membership
this restricts the genes to the saved current cluster in the previous
example
A.3.7.3 Gene set operations
step 1: go to Edit: Sets of genes : OR (Union) of 2 gene sets
this will request 3 gene set names in a pop up dialog box.
Enter set # for (All known genes) for the 1st gene set name,
Enter set # for (Genes in current cluster class) for the 2nd gene set name,
Enter "Union of known genes and genes in current cluster" for new gene set name.
then press "Ok" in the dialog box.
this computes the union of the two gene sets into a new gene set
step 2: go to Edit: Sets of genes : Set 'User Filter Gene Set'
this will reset the 'User Filter Gene Set' for the Filter in a pop up dialog box.
Enter the set number or the beginning of the set name 'Union' that is the
set for "Union of known genes and genes in current cluster" just saved.
step 3: try saving other Filtered genes sets and doing other
gene set operations.
A.4 Additional tutorials
If you wish to investigate MAExplorer in more detail, try some of the
suggested examples in the advanced
tutorial (Appendix B) in the reference manual.


Appendix B. Advanced tutorial for MAExplorer
There are a number of things you may do in this facility. We wrote
this advanced tutorial to help demonstrate some of its capabilities. A
short tutorial (Appendix A)
is also available and we recommend doing it before attempting the
advanced tutorial. Sources of startup data to use with the tutorials
are listed in the short tutorial. As with all tutorials, they are only
starting points for getting you into the analysis environment - try
out new options on your own, you can't break anything :-).
NOTE: THIS APPENDIX IS BEING REVISED... Here are some things to try
When first started, it loads some initial data it needs as well as the
particular hybridized samples you specified. After MAExplorer starts,
it displays "Ready - click on a gene to query
database" and the menus becomes active. Here are some things to
try.