MAExplorer - Microarray Exploratory Data Analysis

Reference Manual++ - MAExplorer Microarray
Exploratory Data Analysis

The MicroArray Explorer Java tool for data mining microarrays **Legend
(Click on figures to show higher resolution versions)

 


*** DRAFT ***
$Date: 2003/11/23 10:29 $ MicroArray Explorer-V0.96.34.01 "Millenium" edition

Peter F. Lemkin
Laboratory of Experimental and Computational Biology
Center for Cancer Research, National Cancer Institute - Frederick, Frederick, MD 21702

MAExplorer home | Table of Contents | Overview | Introduction | Menu summary | PDF documents |
Newsletters |
Plugins | Quick start | Short tutorial | Advanced Tutorial | Glossary | Figures | Tables | Index | Help desk

SourceForge     http://maexplorer.sourceforge.net/
  or
LECB/NCI       http://www.lecb.ncifcrf.gov/MAExplorer

++ Note: This hypertext manual is divided into chapters and appendices Web pages. These may be printed individually from your Web browser by (1) clicking in the text window to be printed, and (2) using the "Print Frame" in Netscape or "Print" in Internet Explorer. Some of the chapters (eg. 2) have many images. The entire manual may be downloaded at one time with low resolution figures and is suitable for printing in the Web browser. You may also download a an Adobe acrobate PDF file version of the entire manual with the lower resolution figures (~5Mb). The Unix script for creating the full reference manual from the individual HTML pages is CreateMaeFullRefManual.do.

The MAExplorer is a Java-based bioinformatics exploratory data-analysis and data-mining program for analyzing sets of quantitative spotted cDNA or oligonucleotide microarray data (Lemkin et al., 2000) - (see (Schulze, 2001) for a review of microarray technology).

Prior to its release on SourceForge, MAExplorer was developed by Dr. Peter Lemkin (LECB/NCI-Frederick) with help from Gregory Thornwall (SAIC) and Jai Evans (DECA/CIT, NIH). It was initially created for analyzing 33P labeled membrane array data from the mouse mammary tissue from Mammary Genome Anatomy Project (MGAP) http://mammary.nih.gov/ with the help of many researchers in the Laboratory of Genetics and Physiology, NIDDK under Dr. Lothar Hennighausen. Since the early work with MGAP it was extended to work with other types of cDNA and oligo arrays and various nucleotide labeling methods. These include spotted Cy3/Cy5 glass slides, spotted membranes, non-geometric chip data, and other chip supports with different geometries and numbers of duplicate spots/gene, clones as well as oligo chip data such as Affymetrix. A wizard tool called Cvt2Mae was developed to make it easier for other researchers to convert their data to the format required by MAExplorer. Cvt2Mae was developed by Peter Lemkin, Greg Thornwall and Bob Stephens (ABCC/SAIC). You may extend the set of builtin analysis methods by writing Java plugins called MAEPlugins.

This document describes the MAExplorer's functionality, provides tutorials and contains documentation for using it with various types arrays.

With this program, you may: 1) analyze expression of individual genes; 2) analyze expression of gene families and clusters; 3) compare expression patterns for multiple hybridized samples.

MAExplorer is written in Java and runs as a stand-alone application that you download to your computer. Although MAExplorer began out as a Java applet for use with with Web browsers for the MGAP Web database ( http://www.lecb.ncifcrf.gov/mae ), we have depricated its use as an applet because of many problems with running large Java applets in some Web browsers. Instead, we recommend downloading MAExplorer which includes the public MGAP array data as a demonstration data set. Then run MAExplorer on this data after you have installed it on your computer.

Notation: MAExplorer uses the notation that the sample probe total mRNA is labeled and then hybridized against the known cDNA targets tethered to the microarray. Because of this notation, we refer to a hybridized sample as a HP. An alternative notation that reverses these terms is also commonly used (see "Chipping Forecast", Nature Genetics supplement, Jan, 1999, pg 1). Also, because arrays may be constructed from either spotted clones or oligonucleotides, we refer to hybridized chip DNA from any of these sources genericlly as "genes".

Throughout this document we use the abbreviations HP for hybridized sample, GC for gene class. These and other terms are explained in the Glossary and Index . There are a number of figures and tables illustrating various features of MAExplorer throughout this manual. Figures are presented at low-resolution. By clicking on the lower-resolution figure, the high-resolution versions can be viewed.

NOTES: because MAExplorer is under development, there may be occasional problems with some of its functionality. There may also be some problems (mostly bad HTML links) with migrating from LECB/NCI to the SourceForge Web site. Some operations that are under development are labeled with "[Future]" in this manual. We welcome your suggestions for improvements as well as letting us know about problems that you encounter. Occasionally the manual or the figures in the manual may not be quite in phase with the software. Please notify us of problems or suggestions by E-mail so we can try to fix or implement them. If you are a bioinformatics developer and would be interested on working with the MAExplorer project, consider joining the MAExplorer development team on SourceForge.net.


**Icon Legend

Data from a 38 sample subset of hybridized samples from the MGAP mouse microarray database. This screen illustrates a synthetic pseudoarray image showing the ratios of duplicated grids of genes comparing day 13 pregnancy in C57B6 mouse (sample HP-X 'set') with Lactation day 1 (sample HP-Y 'set'). The color scale of the spots is indicated on the left as is the current data normalization mode (median). Genes with white circles are named genes and were selected by the data filter. A scatter plot of this data is shown on the right with genes passing the data filter indicated as red + and those not passing the filter (i.e. ESTs, calibration DNA, user's genes) shown as gray + symbols. A single gene was selected by clicking on it in the array image and has a yellow circle (grid 1-D) and a corresponding green circle in the scatter plot. Information on that gene is indicated above the array and at the top of the scatter plot. MAExplorer can also be used to view mean data from sets of samples e.g. Day 13 pregnancies from C57B6 (3 HP-X samples) vs. Day 1 Lactation (4 HP-Y samples) at low or high resolution.


Table of Contents

Overview
Menu summary
Quick start

1. Introduction
1.1 Microarrays and notation used with MAExplorer
1.2 Microarray image quantification
   1.2.1 Ratio and Zscore comparison of data from different hybridized samples
1.3 Microarray image and plot display
1.4 Exploratory data analysis - overview
   1.4.1 Saving the state of a data-mining session in stand-alone mode
   1.4.2 Logging messages and command history
1.5 Quick start - demonstration of MAExplorer
1.6 Tutorials for using MAExplorer

2. MAExplorer menus
2.1 File menu
   2.1.1 Databases menu
   2.1.2 Exploratory state menu
   2.1.3 Groupware facility for sharing user states menu
2.2 Samples menu
   2.2.1 Selecting sample HP with chooser or menu sample lists
   2.2.2 Swapping selected samples's (Cy3,Cy5) channels in ratio data dye-swap experiments
   2.2.3 Viewing sample HP-X, HP-Y, and HP-E partitions
   2.2.4 Defining sample condition 'class' names
   2.2.5 Toggling between single HP-X (-Y) samples and HP-X (-Y) sets
2.3 Edit menu
   2.3.1 User edited gene list - the 'Edited Gene List' menu
   2.3.2 Sets of genes menu
   2.3.3 Sets of Sample Conditions menu
   2.3.4 Setting user preferences menu
2.4 Analysis
   2.4.1 GeneClass menu
      2.4.1.1 GeneClass ontology subsets
      2.4.1.2 Simulating Gene Class ontologies using Gene Set operations
   2.4.2 Normalization menu
      2.4.2.1 Intensity background correction
      2.4.2.2 Normalization between microarrays to allow comparison
      2.4.2.3 Using different normalizations to 'see' different data views
   2.4.3 Filter menu
      2.4.3.1 Data filtering using multiple gene data filters
   2.4.4 Plot menu
      2.4.4.1 Show microarray pseudoarray images menu
      2.4.4.2 Scatter plots menu
      2.4.4.3 Histogram plots menu
      2.4.4.4 Expression profile plots menu
   2.4.5 Cluster menu
      2.4.5.1 Cluster genes with expression profiles similar to current gene
      2.4.5.2 Cluster counts of similar filtered genes by expression profiles
      2.4.5.3 K-means clustering' gene expression profiles for filtered genes
      2.4.5.4 Hierarchical clustering of expression profiles
   2.4.6 Report menu
      2.4.6.1 Array report menu - hybridized samples global data
      2.4.6.2 Gene reports menu
      2.4.6.3 Table format menu
      2.4.6.4 Table font size menu
2.5 View menu
   2.5.1 Logging MAExplorer messages
   2.5.2 Logging command history
2.6 Plugins menu
2.7 Help menu

3. Exploratory Data Analysis - Data Mining
3.1 Analysis objectives
   3.1.1 Some experimental design issues of microarray experiments
   3.1.2 Design philosophy of MAExplorer methodology
   3.1.3 Evolution of MAExplorer from earlier proteomic data mining systems
   3.1.4 Concepts used in data mining with MAExplorer
3.2 Steps in an analysis
   3.2.1 Definition of expression profile
   3.2.2 Clustering Methods
      3.2.2.1 Clustering similar genes
      3.2.2.2 K-means clustering
      3.2.2.3 Hierarchical clustering
3.3 Display gene intensity and identification data measurements
3.4 Selecting subsets of genes using the data Filter
3.5 Selecting subsets of hybridized sample conditions
3.6 Setting threshold values using the state-scroller sliders
3.7 Exporting report and plot data

4. Status and Bugs of MAExplorer
4.1 Known Bugs in MAExplorer
   4.1.1 Browser Applet Bugs
   4.1.2 Downloading and Installer Bugs
   4.1.3 Computation speed and display Bugs
   4.1.4 User state and login Status
   4.1.5 Data file names Bug
   4.1.6 Gene Sets Bugs
   4.1.7 Clustering Bugs
   4.1.8 Expression profile Bugs
   4.1.9 Data conversion problems
   4.1.10 Java Plugins bugs
4.2 Revision notes
4.3 Web Browser problems when running MAExplorer as an applet
4.4 Handling fatal error reporting (i.e. DRYROT errors)

Release archive

Acknowedgments

References to related exploratory data analysis methods
R.1 Nucleic Acids Res. paper (PDF)
R.2 Overview (PDF)
R.3 Examples (PDF)
R.4 Using mAdb data with MAExplorer (PDF)
R.5 Introduction to Data Mining with MAExplorer(PDF) or (PPT)
R.6 Using Cvt2Mae to convert array data for use with MAExplorer.(PDF)
R.7 Statistics in Functional Genomics workshop paper (PDF)
R.8 Software design of the MAExplorer data mining tool (PDF) or (PPT)

Newsletters

Appendices
A. Short tutorial for MAExplorer
A.1 Demonstration data
A.2 General instructions
A.3 Self-guided tutorial of MAExplorer - notation and examples

B. Advanced tutorial

C. Use of MAExplorer with user's microarray data
C.1 Creating quantified spot data files from hybridized sample arrays
C.2 Table of samples that can be loaded into MAExplorer
C.3 Quantified spot data file format
C.4 GIPO table database file format
C.5 Configuring MAExplorer for use with other arrays
C.6 Using the Cvt2Mae 'wizard' tool to convert array data for use with MAExplorer

D. Use of MAExplorer as a stand-alone application
D.1 Installing MAExplorer as stand-alone application
D.2 Downloading MAExplorer for stand-alone use with other arrays
D.3 Starting MAExplorer by clicking on a .mae file
D.4 The data file format for .mae files
D.5 Using MAExplorer as an Applet on your computer
D.6 List of startup .mae files included in the download installation

E. Design issues
E.1 Internal data structures design to facilitate direct manipulation
E.2 Approaches to data mining: client-centric and server-centric models
E.3 Conversion of microarray data files to MAExplorer format using Cvt2Mae
E.4 Extending MAExplorer functionality using Java Plugins
E.5 Web database server design

Download Installers
Installer information

MAExplorer Plugins

Cvt2Mae wizard

MAExplorer Open Source
Download source
javadocs for source
MPL1.1 Public License
Legal

List of Figures
List of Tables
Glossary of terms used in MAExplorer
Index

Help desk


MAExplorer - Overview

MAExplorer is a bioinformatics microarray data mining Java application that may help in the discovery of genes regulated in cancer and other diseases. MAExplorer is generally run as as a stand-alone application on a local computer. By running as a local application, it is able to access your local disk to save the state of your data mining session as well as plots and reports. Using the previously saved data mining state, you can continue a data-mining session at a later date after exiting the program.

Recommended Hardware

Because data mining is a computationally and graphically intensive activity, a reasonable level of computation resources are required for adequate response. The same Java program runs on a variety of operating systems including Windows 95/98/Me/NT/2000, Macintosh OS8/9/X, Solaris, Linux, etc. so the choice of computer is not that critical. We recommend the following hardware:

Addition of user defined analysis methods using Java Plugins

We have provided the ability for users to add their own Java Plugin Extensions to MAExplorer. These extend the capabilities of the core MAExplorer program to other more sophisticated analysis methods created by users and allow interaction with specialized genomic servers. This is described in Appendix E, Section 2.6, and in the MAExplorer Plugins Web page.

MAExplorer - Microarray Exploratory Data Analysis

1. Introduction

This hyperlinked manual provides a detailed description of the MAExplorer conventions (Section 1) and operation (Section 2). The latter contains many figures of computer screens showing the operations described in the Section. Section 3 discusses typical scenarios in using MAExplorer for data-mining microarrays and contains a brief introduction to the process of data mining. Section 4 lists currently known bugs and the revision history. Appendix A is a short tutorial. Appendix B is a more advanced tutorial. Appendix C describes the files required by MAExplorer and how they may be created for using MAExplorer with other array data. It also describes the data conversion tool Cvt2Mae. The Appendix D covers downloading, installing and running MAExplorer as a stand-alone Java application on a local computer. Appendix E discusses design issues for the MAExplorer Java program and supporting Web servers. Users may create new analytic methods and add them as MAExplorer Plugins as Java extensions. There is a glossary of terms used in MAExplorer. There is also a List of Figures, a List of Tables, and an Index to help find material of interest.

MAExplorer is normally used as stand-alone program

Figure 1 gives an overview of the system. Note that MAExplorer does not perform spot quantification from raw scanned images - it is used for the subsequent data mining analysis of quantified spot data. Figures 1.1.1 through 1.1.3 describe this in more detail.

Overview of MAExplorer showing it use as either a stand-alone application of Web applet

Figure 1. Overview of MAExplorer exploratory data analysis system. Initial data preparation steps are performed prior to analysis by MAExplorer and are indicated by cyan italics at the top of the figure. The primary data consists of quantified microarray image data as well as corresponding qualitative clone ID, gene-in-plate-order (GIPO or print-table, etc.), gene name, hypertext base references and related information. After the microarrays are hybridized, they are scanned and spots quantified using image spot quantification programs. These lists are then saved for each array in a tab-delimited file. Microarray image quantification may be performed by various software such as Axon's GenePix(TM), Scanalyze, Molecular Dynamics ImageQuant(TM), Research Genetics' Pathways(TM), etc. When used as a stand-alone application, data may be saved on the local computer for local off-line use, and direct access to other Internet genomic databases may be made without using a proxy server.

[DEPRICATED: When used as an applet, this auxiliary databases and the MAExplorer Jar files are copied to the Web server or local file system (in the case of the stand-alone version) where they are then available to be downloaded by users. When a user invokes a Web page containing the Java applet, it first downloads the applet that then downloads auxiliary databases including a configuration file that describes the array data. It then downloads the subset of quantified microarray spot data files requested for the set of hybridized samples being investigated. Additional samples may be downloaded at any time. When the user selects an operation that requires access to Web databases not residing on the MAExplorer Web server, implicit Java security restrictions prevent the applet from going directly to these other Web servers. Instead, it requests the MAExplorer proxy server request the data from the foreign Web server, and then returns it back to the user's Web browser. ]



Overview of MAExplorer exploratory data analysis system

Figure 1.1.1 Overview of MAExplorer exploratory data analysis system. MAExplorer is used as a stand-alone application on local data. [Its use as a Web browser applet has been DEPRICATED. In the case of the applet, it may only access quantified array data from the Web server that launched the applet.]



Overview of data preparation for quantified spot data used by MAExplorer

Figure 1.1.2 Overview of data preparation for quantified spot data used by MAExplorer. MAExplorer handles quantified spot data as shown in this figure. Arrays are hybridized against labeled samples are scanned and spots are quantified into spot data files. Quantified spot data is represented as tab-delimited data with data for one spot/row. Each spot is identified in this file by its grid coordinates (grid, grid row, grid column) with image (X,Y) coordinates being optional. Quantified spot data includes the raw spot intensity for each channel (in the case of multiple channels such as Cy3, Cy5, etc.). If the original data has background spot intensity values, then that may be included as well - otherwise no background data will be available for background correction. The spot data is discussed in more detail in Section 1.1 and Appendix C.1, and Appendix C.3.



Overview of running MAExplorer as a stand-alone application

Figure 1.1.3 Overview of running MAExplorer as a stand-alone application. The preferred way of running MAExplorer is as a stand-alone application. There are distinct advantages in running MAExplorer as an application in that data and the exploration state may be saved on the users local computer, direct access to genomic servers is easier (no proxy server required - see Figure 1.4). MAExplorer plugin extensions (MAEPlugins) may only be used with the stand-alone version. Since MAExplorer is packaged for download for a variety of operating systems, using this method is not difficult to set up and the MAEPlugins should run on a variety of operating systems.



Overview of running MAExplorer as a Web browser applet

Figure 1.1.4 [DEPRICATED] Overview of running MAExplorer as a Web browser applet. An alternative way of running MAExplorer on existing databases is as a Web-browser applet. There advantage of this method is that no software installation is required on the user's computer. However, the user may not save data and the exploration state on their local computer. Furthermore, direct access to genomic servers requires a proxy server. MAExplorer plugin extensions (MAEPlugins) may not be used with the the applet version. The Mammary Genome Anatomy Program (MGAP) originally used the MAExplorer applet.



Example of a MAExplorer database - http://www.lecb.ncifcrf.gov/mae - the public MGAP DB

The Mammary Genome Anatomy Program (MGAP) microarrays of cDNA clones from mouse mammary tissue (collaboration with Research Genetics) were hybridized with 33P radio-labeled samples. These were then used to charge fluorescing plates. See the MGAP site for more documentation on the database and preparation procedures. The hybridized arrays are scanned on a phospho-imager scanner at high resolution. Spot data was quantified from these images using the Research Genetics' "Pathways 2.01" program which generated tab-delimited data files. This data also includes the microarray grid point locations (field, grid, grid row, grid col) from the associated microarray description data files (grid-in-plate-order data). When you download MAExplorer, you will also download the public MGAP dataset.

1.1 Microarrays and notation used with MAExplorer

In general, microarrays are hybridized using cDNA samples derived from mRNA labeled with either radio-label, biotin, fluorescent dyes, or other methods (see Schulze, 2001) for review of the technology). MAExplorer may be used to construct databases using single-labeled sample intensity (e.g., Affymetrix, 33P radio-labeled, etc.) and double-labeled ratio fluorescent (i.e. Cy3/Cy5) data arrays with different GIPO geometries.

Definition of "Condition list of samples"

Samples are organized into Condition Lists of samples (generally replicate samples). These may be used in various statistical and clustering tests. There are three built-in lists of samples called the HP-X 'set', the HP-Y 'set' and the HP-E list. The X and Y sets are used in various 2 condition tests such as the t-Test between the X and Y sets (Section 2.4.3). The HP-E list is an ordered expression list of samples used in clustering and in displaying expression profiles. You may interactively define new or edit named condition lists using a graphical wizard (Section 2.6), manipulate and assign them to the HP-X 'set', HP-Y 'set' and HP-E list. Some examples of condition lists might be (assuming you have the data available in your database):
  Virgin     = ( V.1, V.2, V.3 )
  Pregnacy   = ( P13.1, P13.2, P13.3 )
  Lactation  = ( L3.1, L3.2, L3.3 )
  Involution = ( I4.1, I4.2, I4.3 )

Definition of "Ordered Condition list" of multiple condition lists

We further extend this paradigm by defining a meta-data structure called the "Ordered Condition List" or OCL. This is an list of multiple conditions that you have previously defined. The OCL may be sorted if you want and the data lends itself to sorting. E.g., a time series of conditions lends itself to sorting - different types of diagnoses may not. The OCL may be used in various statistical tests (e.g., the F-test applied to the current OCL - see Section 2.4.3)). You may interactively define new or edit named Ordered Condition Lists using a graphical wizard (Section 2.7).An example of an ordered condition list might be:
  Partuition= ( Virgin, Pregnacy, Lactation, Involution )

Definition of "intensity" for single-labeled samples

MAExplorer uses the term "intensity" in slightly different ways dependent on whether you are using the single-labeled or fluorescent double-labeled data. For single-labeled data, "intensity" is the raw quantified data value as measured by the image scanner. Raw data must be normalized between samples in order to compare it between samples. Therefore, to compare N samples, you must first normalize the data and then compare them.

Definition of "intensity" for fluorescent double-labeled samples

For fluorescent double-labeled data, the Cy3 and Cy5 dye-labeled (for example) measurements are the raw quantified data values as measured by the image scanner. In this case, "intensity" is defined as the ratio of Cy3 to Cy5 (i.e. Cy3/Cy5). If you wish to look at the ratio as Cy5/Cy3, you may flip the two channels on a per-sample basis (see Section 2.2.2 for more details).

Issues of experimental design of microarray experiments

Some of the issues involved in experimental design (setting up experiments) based on the types of arrays are discussed in Section 3.1.1 for (Cy3/Cy5)-labeled as well as 33P-labeled samples. Poorly designed experiments will not yield significant statistical results, so attention should be paid to developing an adequate and robust design for your data given costs of doing experiments as well as statistical constraints on analyzing the data.

Actual and "Pseudoarray" image geometry

The main MAExplorer windows contains a pseudoarray image for visualization purposes. It may or may not correspond the spot positions on the actual array. This array geometry is defined by the number of replicate Fields (normally 1) each of which contains a number of grids (also called "blocks") containing a number of rows/grid and columns/grid of spots. If there is no explicit array geometry or spot (X,Y) coordinate data available but simply gene identifiers and intensity data, then an arbitrary pseudoarray geometry is generated. If there is an explicit array geometry, then it waill draw the pseudoarray using this geometry. The database configuration determines which method will be used and is discussed in Appendix C.5. If there is no explicit grid geometry, the number of spot Locations (e.g., IncyteID, Affymetrix probe_set) may be used to synthesize a set of grids of a size that is reasonable for viewing with MAExplorer. This is done in the Cvt2Mae array data conversion program when the array geometry (#grids, #rows/grid, #columns/grid) is not known. This conversion is not done in MAExplorer itself. In Cvt2Mae we generate a visually appealing pseudoarray image geometry if no array geometry is specified with the data (e.g. Affymetrix data, etc). It maps the number of N spot data entries to a (#grids,#grid-rows,#grid-columns). The algorithm is given in Appendix C.6 as well as a suggestion for handling non-standard geometries using Cvt2Mae.

Gene coordinate numbering on the microarray

A
gene coordinate numbering is a mapping of gene identifiers to locations on the array for a particular array geometry. These are described by grids (or blocks), each consisting of grid rows by grid columns of spots. The grids may be repeated on the array and constitute duplicate fields. Some arrays group subsets of grids into meta-grids which are specified by meta-grid rows by meta-grid columns of grids. MAExplorer can handle grids but not meta-grids. In the case where there is no array grid geometry specified or meta-grids are used, an arbitrary pseudoarray geometry can be constructed to serve as a basis to display the microarray pseudoimage (see the Algorithm for constructing the pseudo array from a list of spots in Appendix C.6).

In MAExplorer we refer to grids by letter names (A,B,C,...) and fields by F1 and F2. If you are using Cy3/Cy5 ratio data and the Cy3 and Cy5 data is available as independent channels for each HP sample, then operations that use F1 and F2 will use the Cy3 and Cy5 data for various operations such as scatter plots (Cy3 vs Cy5), etc. If there is only one field in an array (i.e. no duplicate grids), then when MAExplorer is run, operations and menus describing F1 and F2 operations will not be available.

Using duplicate (F1 and F2) spots allows us to get an estimate of the hybridization variance within an array and is used to compute the (F1,F2) gene coefficient of variation (CV) used in the gene data Filter to remove noisy data before looking for additional differences. Note that if Cy3/Cy5 data is used, then F1 and F2 duplicates are not allowed as MAExplorer uses the (F1,F2) data to hold the(Cy3,Cy5) data for a hybridized sample.

Example: special array spot coordinate numbering for the MGAP array

As an example of this coordinate system, the following describes the array geometry for the array used in the NIDDK MGAP database. The general principal with different sizes and numbers of fields is the same for other arrays. The MGAP array was spotted by Research Genetics for MGAP. Clones in the array are laid down in grids consists of 8 rows and 24 columns per grid. There are 8 grids (named A through H or 1 to 8) to a field with a space between grids. Finally, there are two fields (left and right named 1 and 2 or F1 and F2) that are duplicates.

Note: we currently present the MGAP arrays with grids A through H oriented from top to bottom - whereas Research Genetics orients them rotated +90 degrees with grid H to the left and grid A to the right. This occurred when the images were scanned with a -90 degree change in the orientation. Therefore, we have swapped rows and columns in our relative orientations so it meets with users normal expectations of row-column orientation. This could be easily changed to the Research Genetics convention using a parameter in the configuration file. Since the actual plate coordinates are tracked with each clone and reported when it is accessed in MAExplorer, the image coordinate system is not that critical - although the verisimilitude of actual array layout and the data-mining layout can be useful.

Setting the "current gene" to a specific gene by "Master gene ID"

The MAExplorer uses the concept of the "current gene" to indicate a particular gene to be analyzed. You may interrogate the microarray database or Internet databases for data on the current gene or to use it in one of the operations. For example, you might cluster genes by expression profiles to find other genes with profiles similar to the current gene.

Various gene identifiers may be present in the GIPO data file associated with the array. One of these is selected to as a unique identifier to represent genes in the MAExplorer database. Normally, the Master gene ID is defined as the Clone ID. However if the Clone ID is not present, but the GenBank ID is, it will use the latter as the identifier. If neither GenBank nor Clone ID is present, it will use GenBank5' then GenBank3' if present. If that is not present, it will use the UniGene ID if is present. If that is not present, it will use dbEST5' then dbEST3' if present. If that is not present, it will use LocusLink LocusID if present. Finally, if none of those identifiers are present, you can specify a 'Generic ID' that is related to some other database gene identifier such as a 'Location' identifier.

The current gene may be specified by clicking on a spot in the microarray image or on a point in the popup scatter plot, or a gene ID cell in a report.

Setting the "current gene" by Gene Name Guesser

In addition, the user may type a specific gene name or clone ID into a popup Gene Name Guesser dialog text window. This is invoked by clicking on the blue button "Enter gene name or clone ID" at the top right in the control panel. When the "guesser" window pops up, start typing the gene name or clone ID in the blue text entry field. You select either the Gene Names, Clone ID, UniGene ID, GenBank, GenBank 3' or GenBank 5',dbEST 3', dbEST 5', or LocusID identifier. Then you may start typing letters and it will match all names or identifiers which are prefixed with the sub-string you have typed so far. As you type more characters, it will limit the list of possible completions of what you are typing. After selecting the gene you want, you then press the "Done" button to use this entry to set the current gene and remove the guesser popup window. You may press the "Clear" button to clear what you have typed and the "Cancel" button to cancel the current gene selection process.

Setting the "Edited Gene List" subset of genes using wildcard names

You may also define a set of genes from the guesser window using wildcard names where the character '*' matches zero or more characters. First you specify a sub-string common to gene names. Then press the "Set E.G.L." (set 'Edited Gene List') button. For  
example (see Figure 2.3.1), you could find all oncogenes and proto-oncogenes by typing "*ONCO*" in the guesser. It automatically enables the View 'Edited Gene List' in the array that shows genes in the E.G.L. enclosed in magenta boxes.

The current gene cluster

Some operations involving clustering will automatically assign the gene cluster to the E.G.L. This includes clustering of genes similar to a selected (i.e. current) gene and K-means clustering. In the case of K-means clustering, the cluster you select by picking a gene belonging to that cluster will cause it to be defined as the current cluster and also assigned to the E.G.L. This will be discussed in more detail in the section on clustering.

The current Condition List of samples

The current
condition list of samples is the last condition edited with the interactive graphical wizard (Section 2.6) used to define new or edit condition lists.

The current Ordered Condition List (OCL) of multiple conditions

The current ordered condition list (is a possibly ordered list of Multiple Condition Lists) is the last condition edited with the interactive graphical wizard (Section 2.7) used to define new or edit ordered condition lists.

Saving full resolution plots as GIF files in stand-alone mode

The various plots may be saved as full resolution GIF files when running MAExplorer in stand-alone mode. The various plots have "SaveAs" buttons which appear in stand-alone mode. Saving your intermediate results may be useful for documenting your data mining session or for subsequent publication. (Here is an example of a full resolution clustergram of 38 MGAP hybridized samples for 1076 named and EST genes).

Saving Text windows as .txt files in stand-alone mode

The various text windows may be saved as .txt files when running MAExpplorer in stand-alone mode. The various text windows have "SaveAs" buttons which appear in stand-alone mode. Saving your intermediate results may be useful for documenting your data mining session or for subsequent publication.

1.2 Microarray image quantification

Quantification data for all genes in a hybridized sample (x and y coordinates, intensity, background density) is obtained by reading data from a quantification file for that hybridized sample. The quantification file for each hybridized sample resides on the local file system (for stand-alone) or MAExplorer Web server (for applet use) and is derived from image quantification programs such as Axon's GenePix(TM) program, Scanalyze, Molecular Dynamics' ImageQuant(TM) program, Research Genetics' Pathways(TM) program, etc.   These programs are independent of MAExplorer and are not part of our downloadable software distribution. Normalization between hybridized samples must be performed to allow comparison between different hybridized array samples. File formats are discussed in Appendix C).

1.2.1 Ratio and Zscore comparison of data from different hybridized samples

Because of variation between hybridized samples, data is normalized. Methods that are pure scaling transformations (such as Median, Scale to 65K, By Calibration DNA, By Use Gene Set, etc.) allow you to compare data using the ratio between two normalized sets of data. We define the ratio for two samples as follows:
    ratio(x,y,c) = Ixc / Iyc
where: 
    samples x,y have values Ixc and Iyc for the same 
    gene c in samples HP-X and HP-Y
The Zscore method transforms the data such that it can not be used with the ratio comparison. Instead we use the Zdiff(x,y) method for comparing Zscore developed by Mark Vawter (Vawter, 2000). Zscores typically cover the range of -3.0 to +3.0 (standard deviations) with a transformed mean of 0.0. Therefore the Zdiff will typically cover the range of -6.0 to +6.0.
Let
    Zscore(p,c) = (Ipc - meanp)/stdDevp
where:
    Ipc is the intensity of gene c for sample p. Sample p has meanp 
    and stdDevp

Then,
    Zdiff(x,y,c) = Zscore(x,c) - Zscore(y,c),
where: 
    samples x,y have Zscore(x,c) and Zscore(y,c) normalized values for the 
    same gene c in samples HP-X and HP-Y, or HP-X 'sets' and HP-Y 'sets'.

Table 1.2.1 Displays affected by the normalization mode. When comparing two hybridized samples or sets of hybridized samples, the metric used is either ratio or Zdiff depending on whether the Zscore normalization was selected in the Normalization menu. This will affect a variety of data displays and some of the data Filter methods listed here. In addition, all of the other graphics (EP plots, intensity histogram plot, cluster plots including clustergrams and dendrograms) are also affected by the normalizations.

  1. Pseudocolor X/Y ratio, X-Y Z-diff, and other pseudoarray images
  2. the 3-line gene data displayed in the main MAExplorer window
  3. the gene data display at the top of the scatter plot when clicking on a point in the scatter plot
  4. report tables of genes with the highest/lowest X/Y (Cy3/Cy5) (F1/F2) ratio or X-Y (F1-F2) Zdiff
  5. Ratio histogram plot of X/Y (Cy3/Cy5) (F1/F2) ratios or X-Y (Cy3-Cy5) (F1-F2) Zdiff data
  6. data Filter: Spot Intensity [SI1:SI2] range or Zdiff [Z1:Z2] range for HP data
  7. data Filter: Intensity [I1:I2] range or Zdiff [Z1:Z2] range for (HP-X/HP-Y) or (HP-X - HP-Y) data
  8. data Filter: Ratio [R1:R2] range or Zdiff [Z1:Z2] range for (HP-X/HP-Y) or (HP-X - HP-Y) data
  9. data Filter: Ratio [CR1:CR2] range or Zdiff [CZ1:CZ2] range for (Cy3/Cy5) or (Cy3-Cy5) data of a single sample
  10. Range and scale of data in EP plots, cluster plots, clustergrams, dendrograms, etc.
  11. Statistical tests t-tests (X and Y sets), Kolmogorov-Smirnov test (X and Y sets), ANOVA F-test (OCL), etc.


1.3 Microarray image and plot display

The MAExplorer displays one microarray pseudoarray image of the hybridized samples. This is either for a single sample, the ratio of two samples, the average of replicate samples or the ratio of two sets of replicate samples, the ratio Cy3/Cy5 or Cy5/Cy3, or other mappings. Section 2.4.4.1 Show microarray pseudoarray images menu describes these options and shows some examples.

The Filter menu is used to select a set of data filters that determines which genes are selected. These are highlighted in the array image in different ways - with a red (white) circle in the intensity (ratio) pseudoarray image each spot meeting the range threshold criteria. How these are highlighted depends on which Plot menu Show Microarray method and View menu modes were selected. If the Show 'Edited Gene List' (EGL) option is set in the View menu, genes in the EGL will appear as magenta squares. The "Filter mode" is always present and shows genes meeting various Filter criteria (to be discussed). The user may interactively define a list of genes by clicking on them when the Click to add gene to edited gene list option is set in the Edit menu. Alternatively, you can click on a gene with the Control key pressed to add a gene to the EGL or with the Shift key pressed to delete a gene from the EGL.

Types of pseudoarray image displays

There are several differnt types of pseudoarray images that may be displayed. The current type is set in the Show Microarray submenu in the Plot menu selections including Pseudograyscale intensity that approximates the intensity of a single sample or average of samples. The Pseudocolor Red(X)-Yellow-Green(Y) HP-X/HP-Y ratio or Zdiff and Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5 (or F1/F2) ratio or Zdiff add the two samples or channels together as separate Red+Green channels to give a color spectrum. The Pseudocolor HP-X/HP-Y ratio or Zdiff Pseudocolor Cy3/Cy5 (or F1/F2) ratio or Zdiff gives a color spectrum from a low ratio (zdiff) value (Green) to a high value (Red) with a value of 1.0 (0.0) of Black. The Pseudocolor (HP-X,HP-Y) 'sets' p-value shows the p-Value between two X and Y sets in a color spectrum.. If the Original image is set and the image file is in the database, it will pop up a separate Web browser window to display it. The Pseudograyscale display is a grayscale image, with higher concentration genes appearing darker, on a light blue background. The pseudocolor HP-X/HP-Y ratio of spots image is constructed using a color scale going from bright green (<1) to black (=0) to bright red (>1) on a black background. For the pseudocolor Zdiff of (X-Y), the color scale goes from bright green (<0) to black (=0) to bright red (>0). If the dichromasy switch is set in the View menu, that a different set of colors is selected that may be easier for some people to differentiate. If the Use dual HP-X & HP-Y 'sets' else single samples toggle in the Samples menu is set, it displays the mean HP-X data in the left and HP-Y in the right for doing a side by side comparison.

In all of the pseudoarray images, the grids in the image are labeled field#-GridLetter (e.g. 1-C, 2-B, etc). This allows them to be clearly identified as the user scrolls over the image that is larger than the visible computer window.

Popup windows

MAExplorer starts with the main pseudoarray image windows. This window contains the pull-down menus where you may issue commands. As you perform various operations, new windows may popup for some of these commands. For most of these windows, you may click on the "Close" button or click on the close window icon associated with your operating system (generally one of the buttons at the top of the popup window). However, some windows were designed to not close when you do this. In particular the "State sliders" are not able to be closed unless the associated data filtering or clustering operation is closed. When you close the associated operation will automatically close the state slider window.

There is also a popup alert message window for bettering informing users of conditions that prevent them from doing the operation they requrested. You must press the Close button to pop-down the message, although you may do press the SaveAs butto to save the message to a file. For complex problems, some of the messages may suggest what you need to do to correct the problem.

The current sample sample, HP-X, and HP-Y

In MAExplorer, a hybridized array sample is abbreviated HP. The underlying data comparison model assumes, as a minimum, the comparison of two different experimental conditions represented by samples HP-X and HP-Y. A good way to think about this is that these variables are the two axes of a scatter plot (one of the displays you may generate). The HP-X and HP-Y may be thought of as containing data from either single hybridized samples or containing mean data from multiple replicate sets of sample. The HP-X and HP-Y are assigned using the Set current HP-X and Set current HP-Y in the Samples menu (hybridized sample is abbreviated HP in MAExplorer. The sets are most easily changed using Choose HP-X, HP-Y and HP-E to select the currently active samples. The contents of the of multiple sample HP-X and HP-Y 'sets' may alternatively be changed using the Edit HP-X & HP-Y 'sets' of samples by source submenu, and the HP-E list of samples using the Edit HP-E list of samples by source. Assigning single samples to either HP-X or HP-Y may be done from the Samples menu. However, it is easier to do it by clicking on the pseudoarray image. First click on the magenta "[X]" or "[Y]" Current Sample box at the top of the list of switch between HP-Y and HP-Y. Whichever is visible ([X] or [Y]) is the one that will be the HP sample assigned. Then simply click on the magenta "*" to the left of the sample name for the sample you wish to assign.

Hybridized samples are selected from a list of all of the sample samples in the database. To make it easier to select a HP, they may be selected from submenus by their developmental stage (if supported by your particular database) or from a list of all samples in the database located on the left side of the pseudoarray image. If a sample has never been loaded during a session, it will be loaded when you request it.

The last sample selected is called the current sample or current HP. That is the sample that is displayed in the pseudoarray image in the primary MAExplorer window when using display modes requiring a single sample.

Using 'sets' of HP-X and sets of HP-Y

Multiple samples may be assigned to the to the HP-X or HP-Y sets. These are assigned using the Edit HP-X and HP-Y 'sets' of microarrys in the Samples menu. The multiple sets are enabled by setting the Use HP-X and HP-Y 'sets' else single samples checkbox in the Samples menu. Then, when statistical calculations are performed on that data, it will use the means, std-deviations, etc. from each of these sets rather than individual samples.

The HP-E sample list for computing expression profiles

You may cluster sets of genes with similar expression profiles across a set of hybridized samples. The set of HP samples used in doing these profiles is specified by Edit expression profile 'list (HP-E) in the Samples menu. The Choose HP-X, HP-Y, and HP-E command may also be used for defining the members and order of the samples in the HP-E 'list'. Then, gene intensity expression profiles may be created in a popup window for hybridized samples in the HP-E set by using the Expression profile plot commands in the Plot menu. Several of these plots may be created on the screen at the same time. Clicking on a vertical data line in the plot will show the name of the HP, its intensity and coefficient of variation (CV) of the (F1,F2) data for this gene. Note that you can order the hybridized samples in the HP-E set by the order in which they are added.

Data 'Filters' - the intersection of one or more data tests

A set of genes may be computed by taking the intersection selected gene sets. These sets are determined by various logical, data range and statistical tests. Genes passing each test are assigned to a gene subset which in turn are used in the gene intersection computation. The final gene subset is used in array, plots, and reports, and subsequent data filtering. Changing any test parameters causes the data filter to be re-computed.

List of data filters for MAExplorer - any number of filters may be used simultaneously

Figure 1.3 Data Filter Venn diagram. This illustrates some of the logical, data range and statistical tests criteria available using the MAExplorer data Filter paradigm. Note that multiple criteria may be selected from each of these categories. The extreme case, probably never used, could use all tests.


1.4 Exploratory data analysis - overview

MAExplorer may be used to perform various data explorations by looking for patterns correlated with different sets of hybridized samples or with expression profiles of genes. This is discussed in more detail throughout this manual and later in Section 3 on Exploratory Data Analysis. Detailed descriptions of all commands are given in Section 2 Menus.

A first-approximation approach to data-mining might be to sequentially constrain the data of interest to find some changes and then to report on those changes. We have arranged these commonly performed first-pass operations as submenu entries in the Analysis Menu. The submenus are:

  1. GeneClass Menu
  2. Normalization Menu
  3. Filter Menu
  4. Plot Menu
  5. Cluster Menu
  6. Report Menu

The primary 'Analysis' menu of MAExplorer - most operations are found in this menu

Figure 1.4 Screen view of MAExplorer main window with Analysis Menu. The menu structure of MAExplorer was designed to allow users to quickly perform commonly used data-mining operations. Other menus are used for modifying the data (File, Samples, Edit, and View menus) or accessing on-line Help menu information in a separate Web browser popup window. MAExplorer menus are similar to most Windows PC applications where pull-down menu selections are used to invoke operations. The current hybridized array sample is displayed as a pseudocolor ratio image of median normalized spot intensities. Clicking on a spot assigns it as the current gene with data being reported in the top most message area. The names of the current HP-X and HP-Y samples are listed above that area. In general, clicking on spots, points in plots or cells in spreadsheet reports will assign the it as the current gene and access Web genomic databases if enabled.


In addition to displaying the hybridized sample pseudoarray images, derived data may be viewed in various types of plots. These include scatter plots, histograms, ratio-histograms, expression profiles, gene clustering, etc. Data may be presented as table reports presented as either active spreadsheets that can access genomic databases by clicking on cells or as tab-delimited Excel-compatible tables that may be cut (if your windowing system supports this) and pasted into an Excel spreadsheet.

The selected HP-X and HP-Y samples are used when generating scatter plots, ratio histograms and other graphics. Scatter plots and ratio histograms may also be performed on the left and right sides of the currently displayed HP array (fields F1 and F2 respectively if array data has duplicate spots for the same genes).

A MAExplorer database contains a table identifying genes, so data is accessible by gene name as well or by sub-strings identifying a set of genes (e.g. "onco" that could be used to find any oncogene or proto-onco gene in the database).

When the program starts, it displays the microarray image of the first hybridized sample in the HP-X set of samples initially specified. If you specify a new HP-X or HP-Y sample, then it changes the pseudoarray image to correspond to that array. You may change the current HP-X or HP-Y sample from either the Samples pull-down menu or by clicking on a sample in the Active Sample list in the left of the pseudoarray image. If you click the mouse on or near a spot, it will latch onto that spot and define it as the current gene.

Note: In Figure 1.4, genes that pass the MAExplorer data Filters are indicated by red (white) circles around spots in the pseudograyscale (pseudocolor) intensity (ratio) image. The pseudoarray image shows the gene data as replicate grids of spots if there are two fields Field 1 (left set of grided spots) and Field 2 (right set of grided spots). If there is no duplicate spot data, then only Field 1 is shown.

If background correction is enabled in the Normalization menu, then intensity is reported in the message displays as intensity' otherwise as intensity. Normalization should also be used between hybridized samples - whether the data is ratio data (i.e. Cy3/Cy5) or single sample intensity arrays.

1.4.1 Saving the state of a data-mining session in stand-alone mode

If you are running MAExplorer in stand-alone mode, you may save the state of your session for later use using the "Save DB" or "SaveAs DB" commands. Then, the checkpointed database could be accessed using the "Open file DB command". It currently saves: the gene sets, condition (HP) lists, current HP-X, HP-Y and HP-E lists, data Filter options and slider value settings, display options, clustering options, normalization options, etc. We recommend using the "SaveAs ... DB" so you can save the state under a different name rather than overriding the original state. This way you could backup to the original state if you wanted to. The "SaveAs DB" and "Open file DB" commands are described in the File menu.

1.4.2 Logging messages and command history

Often a user would like to review measurements of particular genes and to review the list of commands they issued (also called the command history). Various data measurements as well as many other types of information in the three text lines in the status area of the main window may optionally be recorded in a
popup message log (Section 2.5.1) and the command history may also be reviewed in a separate popup message log (Section 2.5.2). If you are running the stand-alone version, the logs may be saved. Otherwise, you could cut and paste the log data into other word processing applications.

1.5 Quick start - demonstration of MAExplorer

MAExplorer is used as a stand-alone application. You may
download the stand-alone application (see Appendix D). This download also include a demo data set of 50 hybridized samples from the public MGAP database. In any case, you can explicitly download the data at any time at http://www.lecb.ncifcrf.gov/mae/MGAP-Array-database.zip or HREF="http://prdownloads.sourceforge.net/maexplorer/MGAP-Array-database.tar.gz?download"> http://prdownloads.sourceforge.net/maexplorer/MGAP-Array-database.tar.gz?download

Setting up MAExplorer to work with user-specific data is discussed later in this manual in Appendix C.

MAExplorer Home page

Figure 1.5.1 The MicroArray Explorer home page at http://maexplorer.sourceforge.net/. The table of contents in the left panel lists an introduction and short tutorial, several demonstration databases. Below that are links to documentation including this reference manual, glossary and index. The Export version discusses running MAExplorer with other arrays and as a stand-alone version. The Download application is a Web page for downloading and installing the stand-alone Java application on your computer.


You may start MAExplorer in your Web browser from the MGAP Startup DB. This offers several preset public databases consisting of sets of hybridized samples as well as the empty database. After you have clicked on a particular startup database, it will begin loading MAExplorer - indicated by a red box with a "Loading..." message in the top window of your browser. After MAExplorer starts, this message changes to a white box with "Reading DB" while it downloads the data files required. Finally, when it is ready for your interaction, it displays a white box with a green "Ready".

NOTE: for Web browser invocation, the MAExplorer applet works with Netscape 4.7, Internet Explorer 5.0, and HotJava on a Windows (95/98/NT/2000/XP) system or a Solaris Unix system. Macintosh and SGI systems seem to hang at times because of Web browser problems. However, it works on all other systems as a stand-alone Java application that you may download and install on your computer. You might want to review these Web browser restrictions.

After the MAExplorer is started and the menus become active, you may switch the preset hybridized samples to other samples using the Samples pull-down menu. The last hybridized sample loaded becomes the "current hybridized sample" and its image is the one displayed.

Exiting MAExplorer

If you are in MAExplorer and want close the program and exit, you may use the Quit command in the File menu or click on the "close application" button (found in the upper right hand corners of MAExplorer windows put there by your operating system).

1.6 Tutorials for using MAExplorer

There are a number of things you may do in this data mining facility.
  1. Analyze expression of individual genes
  2. Analyze expression of gene families and clusters
  3. Compare expression patterns in multiple hybridizations
We wrote two tutorials to help you understand its capabilities. We recommend you first try the
short tutorial before attempting the advanced tutorial. The latter demonstrates some of the more advanced capabilities.

MAExplorer - Microarray Exploratory Data Analysis

2. MAExplorer menus

A MAExplorer analysis is performed using various interactive controls. These commands are selected from pull-down menus located on the "menu bar" at the top of the main MAExplorer window. The primary menus are:

Menu notation

In the following menus, selections that are sub-menus are indicated by a 'Indicates that the
menu has a submenu'. Selections prefaced with a '' and indicate '' indicate that the command is a checkbox that is enabled and disabled respectively. Checkbox menu items have a "[CB]" at the end of the command. Selections prefaced with a '' and indicate '' indicate that the command is a multiple choice "radio button" that is enabled and disabled respectively, and that only one member of the group is allowed to be on at a time. Radio button menu items have a "[RB]" at the end of the command. The default values set for an initial database are shown in the menus. Selections prefaced with a '#' indicate that the commands are available only when MAExplorer is run in the stand-alone mode. Selections prefaced with a '*' commands requires access to the backend Web server [Future]. Selections that are not currently available will be grayed out in the menus of the running program.

The following Sections 2.1 through 2.7 describe the pull-down menus in detail.

MAExplorer - Microarray Exploratory Data Analysis

2.1 File menu

The File menu operations includes options and submenus providing access to database data from disk and Web servers, state saving, and groupware to share states between collaborators.

In stand-alone mode, the user may select the database subset to be loaded from either a Web server or a local file system. When used as an applet, this is pre-determined by the Web page where MAExplorer is started. Opening a disk DB, 'Open disk DB', also restores any user defined gene sets and other parts of the exploratory state that were present when the 'Save ... disk DB' was invoked.

In the following menus, selections that are sub-menus are indicated by a 'Indicates the menu
entry has a submenu'. Selections prefaced with a '' and indicate '' indicate that the command is a checkbox that is enabled and disabled respectively. Checkbox menu items have a "[CB]" at the end of the command. Selections prefaced with a '' and indicate '' indicate that the command is a multiple choice "radio button" that is enabled and disabled respectively, and that only one member of the group is allowed to be on at a time. Radio button menu items have a "[RB]" at the end of the command. Selections prefaced with a '#' indicate that the commands are available only when MAExplorer is run in the stand-alone mode. Selections prefaced with a '*' commands requires access to the backend Web server [Future]. Selections that are not currently available will be grayed out in the menus of the running program.

When used as an applet connected to a Web database server, databases may be divided into public and collaborator projects. Users accessing protected collaborator projects will be required to log-in to the server and a popup login request will appear.

2.1.1 Databases menu

The Databases submenu is currently available only in stand-alone mode and contains the following selections for opening and saving databases.

[In the future], each user will be able to save the state of their exploration into a password protected directory of named states on a Web server (e.g. doing a 'Save ... Web DB' command. Later, they could restore that state from the Web server by doing an 'Open Web DB' command). Users would be required to register with that server to set up a unique state-saving area. Once this facility was setup, users may selectively allow other user's to view selected data implementing a groupware environment for improving collaboration.

Starting an existing MAExplorer database using the file browser with 'Open file DB'

Figure 2.1.1 Example of the "Open file DB" command. The file browser is opened in the current project directory with the name of the currently opened file. You may select another .mae startup database file to load in the current project. You may also "cruise" the file system and load an .mae file from a different project directory. The "Set project" command makes this easier since it gives you a list of available projects that you may change directly. The projects must have been setup on your computer previously. The "New project" command can be used for setting up new projects or projects.


Saving the current session in a new MAExplorer startup database file using the 'SaveAs file DB'

Figure 2.1.2 Example of saving a user session in a new startup file using the "SaveAs DB" command. The file browser is opened in the current project directory with the name of the currently opened file. You may enter another .mae file name to save your current session. Then when you restart MAExplorer using this new file, it will restore the data mining state to where you left off (except that no popup windows are opened).


2.1.2 Exploratory state menu

The Exploratory 'state' submenu contains the following selections for saving and later using the user's state of the exploration. If MAExplorer is being run on a local computer, no login is required.

2.1.3 Groupware facility for sharing user states menu [Future]

The groupware facility allows users to share their state data with other users. However, If MAExplorer is being run on a local computer, then groupware may not be available since it depends on using specific Web servers with MAExplorer-specific groupware services available. [We first developed these concepts in the context of 2D protein gels. They include: WebGel (Lemkin et al., 1999b) for Internet exploration of 2DE databases, Xconf (Lemkin et al., 1993) an early X-windows based image conferencing similar to CU-See-Me or ALO Instant messenger for sharing images in a conference over the net, Flicker (Lemkin, 1997) an image comparison over the internet using a a Java applet, and GELLAB-II (Lipkin and Lemkin, 1981) a system for data mining - see GELLAB-II Poster on the Web) that embody many of the concepts used in MAExplorer.] The Groupware submenu contains the following selections:

Saving the state and databases on the local file system

The current state of an exploratory data analysis session may be saved on the local file system. The state consists of: gene sets; HP-X and HP-Y hybridized sample sets; HP-E hybridized sample lists; thresholds and switch setting preferences; etc. The user may save their current state and restore a previous state at any time. Restoring the state will overide the current database.

Groupware sharing of intermediate exploratory results [FUTURE]

Each registered user will be able to save the current state of their exploration of the data in named User State files on the back-end Web server using the Save user's state command. The user may keep multiple named states on the back-end secure server where they be accessed to restore the state at a future time using the Open user's state command. The user can request a list of their states with the Directory of user's states command. They may remove a particular state with Delete user's state.

A registered user may allow another registered user to access their state or states (using the Open another user's state command) if the user owning the data had granted them permission. The Share user state and Unshare user state commands control these permissions. There are two special share-users defined: public to allow unlimited read-only access to the state they specify, and private to disallow all access to a user state.

MAExplorer - cDNA Microarray Exploratory Data Analysis

2.2 Samples menu

Each experimental condition sample is represented by a hybridized sample (abbreviated HP in MAExplorer). The Samples menu operations include operations to select the current hybridized sample or samples. The simplest model of a MAExplorer analysis assumes (at least) two microarray hybridized samples variables HP-X, HP-Y whose data may be plotted against one another or compared. The default is a single HP-X sample and a single HP-Y sample.

The first menu command, "Choose HP-X, HP-Y and HP-E samples", entries lets you change the current working HP-X 'set', HP-Y 'set', and HP-E 'list' hybridized samples.

The second menu command, "Choose named condition lists of samples", lets you define or edit new named lists of hybridized samples. This is useful for defining sets of replicate samples. These may be further manipulated using the (Edit menu | Sets of Conditions (samples)) commands.

The third menu command, "Choose ordered lists of conditions", lets you define or edit new named Ordered Condition Lists (OCL) of named condition lists. This is useful for defining a sub-experiment consisting of N conditions each with replicate samples. The last OCL manipulated is defined as the "current OCL". The current OCL is used in the OCL F-test Filter.

The fourth menu command, "Set Samples from lists", lets you change the current HP-X and HP-Y, HP-Y samples as well as the HP-X 'set', HP-Y 'set', and HP-E 'list' samples. This is similar to using the "Choose HP-X, HP-Y and HP-E samples" command, but is more dificult to use. You may change the current HP-X or HP-Y sample by clicking on the sample name directly in the list of sample names on the left side of the pseudoarray image (see Figure 2.2.3 legend).

The fifth menu entry, "Edit use (Cy5/Cy3) else (Cy3/Cy5) for each HP", lets you swap data channels for Cy3/Cy5 data for individual samples.

Other menu commands list the status of the current HP-X 'set', HP-Y 'set', or HP-E 'list', and define condition class names that are associated with the HP-X 'set' and HP-Y 'set'. The last menu entry, "Use HP-X & HP-Y 'sets' else single samples", lets you switch between using HP-X and HP-Y as single samples of sets of multiple samples. For example, if you are using a scatter plot of X and Y, it will switch the data being plotted from a comparison of single samples to a comparison of means of sets of samples depending on the status of the switch. Sets of samples are used extensively in data explorations.

Figure 2.2.1 shows setting the HP-X, HP-Y, HP-E lists of samples using the "Chooser" - the preferred method. Figure 2.2.2 shows setting the HP-X sample from the menus. Figure 2.2.3 shows changing the current HP-X sample by clicking on a sample name in the microarray image.

Choosing the X and Y sets of samples and the E ordered list of samples

Figure 2.2.1 Samples menu - selecting lists of samples by using the "chooser". The hybridized samples assigned to the current HP-X, current HP-Y, set of HP-X, set of HP-Y and expression profile list HP-E may be changed from the Samples pull down menu using the Choose HP-X, HP-Y and HP-E option lets you graphically change the currently active sample HP-X, HP-Y sets and E-list.


Selecting HP X, Y, or E samples using the pull-down menus

Figure 2.2.2 Samples menu - selecting samples by source characteristics. The hybridized samples assigned to the current HP-X, current HP-Y, set of HP-X, set of HP-Y and expression profile list HP-E may be changed from the Samples pull down menu. The specific "By Source" menus shown here are from the MGAP database. This figure shows the user changing the current X sample from the developmental stages submenu that is part of the "By Source" submenu. Alternatively, samples containing a keyword or part of a keyword can be found using a "guesser" popup window that allows the use of wild cards. This is invoked using the "From list of all H.P.s" submenu. For example, you could specify "*pregnancy*" to find all samples of containing that word.


Switching current X or Y sample by clicking on desired sample 
in list at left edge of pseudoarray image. Switch between 
X and Y by toggling '[X]' and '[Y]' of the Current Sample box.

Figure 2.2.3 Changing the current sample to either the HP-X or HP-Y sample by clicking on a sample name at the left edge in the microarray pseudoarray image. The current sample is indicated in magenta. Click on the magenta "*" adjacent to the new name you want to select and it will change the HP-X sample. To switch between setting HP-X and HP-Y, click on the [X] Current Sample box to change the sample to HP-Y. You can click on [Y] Current Sample box to change it back to HP-X. Then clicking on a sample name will set it to the current HP-X or HP-Y that was selected. This figure shows that the user had selected [Y] and C57B6-L10-29hrs for the new HP-Y sample.


2.2.1 Selecting hybridized samples with Chooser or pull-down menu sample lists

The Set Samples from lists submenu lets you define HP-X, HP-Y, HP-X 'set', HP-Y 'set', or HP-E 'list' from lists of samples. It contains four submenus:

The Set current HP-X sample and Set current HP-Y sample commands offer another way to set the single current X and Y sample (see Figure 2.2.3 above for the preferred way using the "Chooser").

The Edit HP-X & HP-Y 'sets' of samples by source menu allows the user to define HP-X and HP-Y as sets having multiple hybridized samples. Then, the mean values of the genes are used when comparing HP-X with HP-Y.

The Edit HP-E expr. profile 'list' by source menu allows the user to define an ordered list of samples for use in expression profile statistics. Then, an expression vector of normalized quantification values (one for each sample in the HP-E list) is computed for each gene. Note: to place the samples in a particular order, start with an empty HP-E set and then add them in the order you desire.

The Convention for pull-down menu sample selection lists

We use a common sample selection scheme when selecting a sample from a pull-down menu list. This sub-sample "By Source" option is only available if your database was set up to allow sub-sample source names in the Samples database.

For example, the By Source database-specific entries for the MGAP database includes the following submenus.

The From list of all samples selection pops up a hybridized sample guesser dialog window. As with the gene name guesser, you can start typing in the name of a sample and it will give you a list of HPs that match that initial string. You then click on the sample you want and then press the Done button.

2.2.2 Swapping selected samples' (Cy3,Cy5) channels in ratio data dye-swap experiments

The Edit use (Cy5/Cy3) else (Cy3/Cy5) for each HP command may be used to selectively swap (Cy3,Cy5) data entries so the user may use the samples (carefully, since gene labeling efficiency is not always symmetric!) dye-swap data for replicates. This is only available for ratio data. It swaps the data contained in MAExplorer (memory only) so that the Cy3 data is swapped for the Cy5 data. For example, consider the case where there are two materials A and B hybridized in two experiments and labeled as follows: E1 (A=Cy3,B=Cy5) and E2 (A=Cy5,B=Cy3). Then, assuming uniform symmetric labeling (which is generally not the case - although it might be true for a subset of genes), then one might average data from E1 and E2 if the data from E1 (or E2) were swapped. This is shown in the following figure.

Selectively swapping (Cy3,Cy5) for particular samples

Figure 2.2.4 Samples menu - selectively swapping (Cy3,Cy5) data channels for particular samples. This is only operative if your database contains Cy3/Cy5 ratio labeling data. This is useful in databases containing subsets of dye-swap experiments mixed in with other samples that are not dye-swapped.


2.2.3 Viewing sample HP-X, HP-Y, and HP-E partitions

You setup sets of HP samples for the HP-X and HP-Y sample sets and HP-E expression list of samples using the Chooser (above). The current contents of these lists may be viewed using the List HP-X & HP-Y sample 'sets' to list the samples in the HP-X and HP-Y 'sets'. The List HP-E sample 'list' may be used to list the samples in the ordered HP-E 'list'.

2.2.4 Defining sample condition 'class' names

When using sets of conditions, the HP-X and HP-Y 'sets', you will probably want to assign meaningful names to these sets. The commands Define HP-X class name and
  • Define HP-Y class name assign a free text line name for the sets.

    2.2.5 Toggling between single HP-X (-Y) samples and HP-X (-Y) sets

    When MAExplorer first starts up, it assumes that you wish to treat the data as single samples so that HP-X and HP-Y are assigned to single samples. However, if you want to work with sets of multiple samples then you must toggle the state using the Use HP-X & HP-Y 'sets' else single samples [CB] check box command. This toggles the state between treating the data as multiple samples (HP-X and HP-Y 'sets') or as single HP-X and HP-Y sample samples.

    2.2.6 Create and edit named condition lists of samples

    The command Choose named condition lists of samples lets you define new or edit existing named lists of hybridized samples called "Condition lists". Associated with each condition list is a set of annotation parameters to document the condition. The condition lists may be used in the (Edit | Sets of conditions) operations. Among other operations, you may assign any condition list to the working HP-X 'set', HP-Y 'set', or HP-E list of samples used through MAExplorer. The last condition list that was edited with the (Sample menu | Choose named condition lists of samples) is called the "current condition" that could be used in various operations. Figure 2.2.5 shows a screen illustrating a popup condition chooser session where the legend describes the options.

    Editing named condition lists of samples

    Figure 2.2.5 shows a screen illustrating a popup condition chooser session. The set of all samples in the database is in the scrollable "Remainder Samples" window in the upper left. The samples you have selected for the condition list being edited is shown in the upper right "Selected Samples in current condition" window. The list of all conditions in the database is in the lower left "List of Conditions" window. The current condition list that is selected is highlighted and its contents displayed in the "Selected Samples" window. User defined annotation associated with the current condition are displayed in the right "Current Conditioned Annotation" window. To add a new condition, click on the Add Cond button to define the new condition name. The Remove Cond button is used to delete a named condition list. The List Cond button pops up a report listing the samples and annotation for the current condition. The List All button pops up a report listing the the names of all of the conditions and the annotation names. You may add or remove new annotation names for all of the conditions. The Add Ann button will add the new annotation you enter into all conditions - you must enter the data for each condition that requires it. You may The Save the current status of all of the conditions into your working database. If you have pressed Cancel before saving, then you will not have saved your edits. Pressing the Done button will save the changes and pop-down the window.

    2.2.7 Create and edit named ordered condition lists (OCL) of conditions

    The command Choose ordered condition lists of conditions lets you define new or edit existing named ordered lists of conditions called "Ordered Condition Lists" (OCL). Associated with each ordered condition list is a set of annotation (name, value) pairs to document the condition. The last condition list that was edited with the (Sample menu | Choose ordered lists of conditions) is called the "current OCL". The current OCL is used by the (Filter menu | Filter by current Ordered Condition List (OCL) F-Test [p-Value] slider [RB]) test. Figure 2.2.6 shows a screen illustrating a popup ordered condition list chooser session.

    Editing named ordered condition lists (OCL) of samples

    Figure 2.2.6 shows a screen illustrating a popup ordered condition list (OCL) chooser session. The set of all conditions in the database is in the scrollable "Remainder Conditions" window in the upper left. The conditions you have selected for the OCL being edited is shown in the upper right "Selected Conditions in current OCL" window. The list of all conditions in the database is in the lower left "List of Conditions" window. The current OCL list that is selected is highlighted and its contents displayed in the "Selected Conditions" window. User defined annotation associated with the current OCL are displayed in the right "Current OCL Annotation" window. To add a new OCL, click on the Add OCL button to define the new condition name. The Remove OCL button is used to delete a named condition list. The List OCL button pops up a report listing the conditions and annotation for the current OCL. The List All button pops up a report listing the the names of all of the OCLs and the annotation names. You may add or remove new annotation names for all of the OCLs. The Add Ann button will add the new annotation you enter into all conditions - you must enter the data for each condition that requires it. You may The Save the current status of all of the OCLs into your working database. If you have pressed Cancel before saving, then you will not have saved your edits. Pressing the Done button will save the changes and pop-down the window.

    MAExplorer - Microarray Exploratory Data Analysis

    2.3 Edit menu

    The Edit menu operations include operations to modify the 'edited gene list' that is set from a variety of Filters as well as manually this menu. The user may perform set operations (union, intersection, and difference) on named sets of gene and sets of sample experiments (conditions). User preferences are also set in this menu.

    Sets of genes or HP condition lists are very useful for tracking complex data-mining sequences of analysis. For example, derived named gene sets may be used in successive data filters and for reports. For example, one could do the following experiment given four different types of HPs for (e.g. virgin, pregnancy, lactation, and involution)
    First compare two HPs using a statistical test such as a t-test. Then save the resulting set of genes under the name "virgin vs. pregnancy". Then compare the next two HPs and save the resulting genes under the name "lactation vs. involution". Finally, compute the difference of genes found in "virgin vs. pregnancy" that are not found in "lactation vs. involution". This resulting gene set could then be saved (e.g. with a name "Genes found in virgin vs. pregnancy, but not in lactation vs. involution"). Similarly, taking the intersection of these two named sets shows genes that are common between the two sets. Taking the union shows genes found in either of the two named sets.

    The Edit menu contains the following main selections. All of these entities and preferences are saved as part of the startup state when you do a (File | Databases | SaveAs ... DB).

    2.3.1 User edited gene list - the 'Edited Gene List' menu

    You may define and edit arbitrary sets of genes using the User edited gene list submenu to modify the 'Edited Gene List' (EGL). This has sub-modes of operation for adding or removing genes from the image by clicking on spots. If the Show 'Edited Gene List' mode is set, you may see exactly which genes you have defined by the magenta squares drawn around each gene in the EGL. Many of the clustering operations will leave the current cluster in the EGL. The commands include:

    This gives you the functionality of adding and deleting genes from a user defined list of genes to be analyzed. The EGL may be used with the gene-set operations discussed in
    Section 2.3.2. You may also define genes in the EGL using the "Gene Name Guesser" shown in Figure 2.3.1.

    Gene 'Guesser' used to find a particular gene or subset of genes by key phrase

    Figure 2.3.1 Edited Gene List defined from the Gene Name Guesser using wildcards. The Edited Gene List was defined as the set of genes containing the sub-string "onco" in it. The sub-string was specified to the popup guesser window as "*onco*" using '*' characters as wildcard symbols indicating that it should match any or no characters. The button Gene Name may be toggled through a set of other identifiers including Clone ID, UniGene ID, dbEST 3', dbEST 5', GenBank 3', and GenBank 5', LocusID, etc. depending on what identifiers are available in your database. The user then pressed the Set E.G.L. button on the guesser window that sets the E.G.L. to those genes. If you have enabled the View menu "Show 'edited gene list', then the genes in the EGL. are viewed as magenta squares seen in the pseudoarray image. You many to do additional editing to manually add or remove genes that you want to change in the set. If a 2D scatter plot was being used, EGL labeled genes would appear there as well. To select a particular gene as the current gene, click on the gene you want in the list, then press the Done button.


    2.3.2 Sets of genes menu

    These commands let you do comparisons of sets of genes generated under different criteria. In addition, you may compute derived gene sets from existing gene sets using set operations (OR, AND, DIFFERENCE). You may also normalize the data by a gene subset. The user may save the genes defined by: 1) by the Filter, or 2) the manually defined 'Edited Gene List'. The gene set resulting from a binary gene set operation OR (union), AND (intersection), or DIFFERENCE are saved in a new named gene set. The set difference (A-B) is defined as the gets in set A that are not in set B. Genes in set B that are not in set A are ignored. The 'User Filter Gene Set' may be set to any gene set and may then be used as part of the gene Filter cascade. The 'User Normalization Gene Set' may be set to any gene set and may then be used to normalize gene intensity values across hybridized samples. (See normalization algorithm for more information on this method.)

    If you are running MAExplorer in stand-alone mode, the current named gene sets are saved when you save the DB using the Save disk DB or Save as disk DB selections in the Databases submenu of the File menu. The gene sets are saved in a State sub directory as ".cbs" files and are used to restore the gene sets when restarting MAExplorer on a .mae startup file. The .mae startup file saves the names of the .cbs files that are shared among the various startup files for a given project. The implication then is that if you change and save a gene set in one startup database, it will change in other startup databases when they load that gene set. The advantage is that different startup databases may view a gene set produced by another database. The Sets of genes operations in the Edit menu include:

    The following is an example of List saved gene sets state listing the catalog of named gene subsets in some of the MGAP data. Note that sets #1 to #11 are fixed by the data in the GIPO file and may not be changed by the user. Sets #12 to #14 are assignable from other sets or in the case of the E.G.L, by various MAExplorer operations. Sets #1 through #14 may not be removed whereas #15 and higher may be removed.

       User Gene Sets
       Set# |#genes| title
       =======================
        #1 |1727| ALL GENES
        #2 |394|  ALL NAMED GENES
        #3 |246|  ESTs similar to genes
        #4 |456|  ESTs
        #5 |1096| All genes and ESTs
        #6 |1681| Good genes
        #7 |40|   Replicate genes
        #8 |0|    HousekeepingGenes
        #9 |96|   Calibration DNA
        #10 |77|  Your plates
        #11 |46|  Empty wells
       --------- User Assignable ----------
        #12 |0|   User Filter Gene Set
        #13 |60|  Edited Gene List
        #14 |0|   Normalization Gene Set
       --------- User definable------------
        #15 |60|  The 60 genes closest to Carbonic Anhydrase-III
        #16 |30|  Named genes in the 60 genes closest to CA-III
        #17 |4|   Replicate genes in the 60 genes closes to CA-III
    

    The following figure illustrates selecting sets by name for gene set operations.

    Gene subsets may be computed from other gene subsets using Boolean operations

    Figure 2.3.2 Selection of gene sets for binary gene set operations. This example computes the Boolean AND of two sets "ALL NAMED GENES" and "60 genes closest to CA-III from Named and Ests", and then the AND of the "Replicates" with the previous result. The first result is save in the set called "The 60 genes closest to Carbonic Anhydrase-III". The second result is saved in the called set "Named genes in the 60 genes closest to CA-III". Finally, the third result is saved in the set named "Replicate genes in the 60 genes closes to CA-III".


    2.3.3 Sets of sample conditions menu

    In addition, MAExplorer can operate on sets of hybridized samples. For example, a sample set might be replicate hybridized samples from the same biological experiment sample, or it could be repeated experiments of different but the same types of samples. (One must be careful in mixing data between the two cases because of the different expected sources of variance). This means you can treat multiple replicate samples as a distribution and compare the mean values for each gene in one set of samples with the mean values for another set of samples. We call these sets of hybridized samples conditions lists or HP lists. You may then put one or more HP samples into a condition set. These sets in turn can be used for computing statistics on clonal differences between different condition sets. Note each condition set may have multiple (i.e. different) samples. These condition sets are saved with the user state when doing a (File | Databases | SaveAs DB). As with sets of genes, there are a number of operations to manipulate HP condition set in the Sets of Conditions menu that includes:

    The following is an example of List saved HP condition lists state listing the catalog of named HP condition lists.

       Condition Lists
       ===============
       Condition[1] #HPs 2, [Initial HP-X: C57B6 pregnancy day 13]
       Condition[2] #HPs 2, [Initial HP-Y: Stat5a (-,-) pregnancy day 13]
       Condition[3] #HPs 4, [Initial HP-E expression list]
    
    

    The following is an example of List contents of saved HP condition list state.

       Condition List #1 [Initial HP-X: C57B6 pregnancy day 13]
       ====================================
       HP[1] Pregnancy 13  (1 hr) [C57B6-p13-totalRNA5ug]
       HP[2] Pregnancy 13  (1 hr) [C57B6-p13.2poly-A]
    
    

    2.3.4 Setting user preferences menu

    The Preferences submenu is used to set various data labels, statistical limits and other parameters. These include:

    The Font Family submenu is used to set the text font family. This may be useful if your computer is missing some fonts or some fonts are easier to read than others. Note: some fonts may not work well on your computer. If this is the case, try another font. When you save the data mining session with the "SaveAs file DB", it also saves the font you have set. For some plots or popup text-windows, you may have to regenerate the popup window to see the font changes.

    
    

    Popup window allowing you to adjust all threshold slider values

    Figure 2.3.4.1 Popup window allowing you to adjust all threshold slider values">. The Adjust all Filter threshold scrollers command allows you to pre-adjust all threshold slider values used in data filtering and in clustering. It may be easier to set the approximate range before invoking the clustering operation because changing a parameter will recluster your data.


    
    

    Dialog query to change HP-X Condition set name

    The Define HP-X (HP-Y) class name command may be used to change the names of the HP-X (HP-Y) experimental condition sets. These names are used in various labels in the main window, popup plots and reports, etc. The commands to change various names of database components are in the Preferences submenu in the Edit menu.


    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.4 Analysis menu

    The Analysis menu (see Figure 1.4) contains an ordered list of six primary menus that may be used, in that order, to perform an initial analysis. In more complex analyses, the sequence of operations will vary and include commands selected from other menus or will use these menus in different order. The Analysis submenus are as follows:

    2.4.1 GeneClass Menu
    2.4.2 Normalization Menu
    2.4.3 Filter Menu
    2.4.4 Plot Menu
    2.4.5 Cluster Menu
    2.4.6 Report Menu
    
    
    

    The 'Analysis' submenus are used for most data-mining operations

    Figure 2.4 MAExplorer main window with Analysis Menu. The menu structure of MAExplorer was designed to allow users to quickly perform commonly used data-mining operations as a first approximation analysis.


    MAExplorer - Microarray Exploratory Data Analysis

    2.4.1 GeneClass menu

    A gene class (e.g. all named genes, ESTs, oncogenes, etc.) is a set of genes that belongs to the class of genes in the universe of genes in the particular microarray database. MAExplorer may restrict the set of genes by "Gene Class" membership (currently includes All Genes, All named Genes, ESTs similar to genes, Unknown ESTs, All genes and ESTs, Good genes, Replicate genes (i.e. with more than one copy of the gene in the array), Calibration DNA, genes from user's plates). The additional Gene Class list of names depends on its availability in a specific database.

    GeneClass menu used for selecting a particular subset of genes by gene class membership

    Figure 2.4.1.1 Gene Class menu. The user may select a subset of genes that belong to one of the classes of genes. This shows the user selecting the set of "All named genes" that are indicated with red (white) circle over the spots in the array intensity (ratio) pseudoarray image.


    The GeneClass 'Replicate genes' showing all genes occurring more than 
once in the array

    Figure 2.4.1.2 Example of all replicated genes occurring more than once in the array. This was selected by using the GeneClass 'Replicate genes'. You may use the data Filter "Filter by genes with replicates" instead of the GeneClass. This has the advantage that you may use other GeneClasses (e.g. ESTs, or All named genes, etc.). Alternatively, you can find all of the replicates for a particular gene by 1) use the Gene Guesser to find the particular gene you want; 2) press "Set E.G.L." to save it as an Edited Gene List; 3) enable the Filter "Filter by E.G.L." at the same. This will show all occurrences of that gene.


    The set of all genes constitutes a number of different gene classes. It is possible to restrict the subsequent analysis to a particular subset of these genes called a gene class. The GeneClass menu operations include operations to select the current set of genes to analyze from the set of all genes by their membership in a gene class.

    Some of the above gene classes are deduced from the gene name supplied with the Gene In Plate Order (GIPO) file for the array. We use the following automatic classification rules shown in Table 2.4.1.

    
    
    

    Table 2.4.1 Rules for the automatic classification of gene names into the default Gene Class sets. The gene name is analyzed alphabetic-case independently.


    Gene class Rule for class membership
    All genes all genes on the array
    All named genes not starting with "EST"
    ESTs similar to genes genes starting with "EST,"
    ESTs genes with the name "EST"
    Replicate genes genes with multiple copies
    Calibration DNA genes using the configuration file name "calibDNAname" (optional - see Appendix Table C.4.1 )
    Your plates clones using the configuration file name "yourPlates" (optional - see Appendix Table C.5.1-C))
    Empty Wells empty wells where no spot exists on the array indicated by keywords "empty", "empty well" or "EmptyWell" (optional - see Appendix Table C.5.1-C) )
    Good Genes spots on the array where the GIPO QualCheck data was used and was valid. If it was not used, then it assumes all spots are good. (optional - see Appendix Table C.4.1 )

    
    

    2.4.1.1 GeneClass ontology subsets [Future]

    If the Set Gene Class subset were activated, it might include categories such as the following. If the categories exist and the data is made available to MAExplorer, then it is possible to specify gene subsets by Gene Class name. It is the responsibility of the database creator to define a mapping table supporting these named subsets of named genes.

    • Apoptosis
    • Bcl-2 family
    • Cdk inhibibition
    • Cell adhesion
    • Cell cycle
    • Cell surface
    • Chemokines
    • Cyclins
    • Cytokines
    • Cytoskeleton
    • DNA binding
    • DNA recomb
    • DNA repair
    • DNA synthesis
    • G-proteins
    • Growth factors
    • Heat shock
    • Interferons
    • Interleukins
    • Ion channel
    • Milk Genes
    • Motility
    • Oncogenes
    • Proteases
    • Protein turnover
    • Receptors
    • Receptors
    • Signal transduction
    • Stress response
    • Transcription
    • Transport
    • Tumor suppressors

    
    

    2.4.1.2 Simulating Gene Class ontologies using Gene Set operations

    You can effectively implement finding ontology subsets for Gene Class subsets using the following procedure. The trick is to repeatedly define an E.G.L. gene subset using the gene name guesser to find the genes of interest and save it as a named gene subset. Edit out genes you don't want. Then you would repeatedly do the OR of gene sets of interest, saving the result as a new named set. Then doing the OR of another gene set with the set you just created, etc.

    Procedure

    
    
    1. Use the Gene Name Guesser (see Figure 2.3.1) to collect genes belong to a particular ontology. Pressing "Set E.G.L" puts the genes from the guesser into the EGL set. You can then delete any genes that don't belong by clicking on the gene in the pseudoarray image with the SHIFT key pressed.
    2. Save the EGL as a name gene set with a meaningful name representing the ontology construct using the (Edit | Save 'Edited Gene List' as named gene set) (see Section 2.3.2 for gene set editing commands).
    3. Repeat steps [1] and [2] to gather additional sets.
    4. Merge appropriate gene sets to get the complete ontology constructs using the (Edit | OR (Union) of 2 gene sets) as required.
    5. Finally, set the (Edit | Assign 'User Filter Gene Set') to select a particular gene class to use, and then
    6. Enable the data Filter by setting (Filter | Filter by 'User Filter Gene Set' membership) (Section 2.4.3).
    
    
    
    
    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.4.2 Normalization menu

    The Normalization menu operations include operations to normalize gene intensity data between hybridized samples. This is critical in being able to compare samples because of differences in amount of sample, labeling efficiency and variations in scanner operation including gain and baseline settings. There are several methods available including normalizing by Zscore, median, log mean, Zscore of logs, calibration DNA, housekeeping genes, etc. The specific microarray image quantification is determined the image analysis program being used to pre-process the arrays.

    Note: although this set of normalization methods is limited, it is adequate for some analyses of the data. We are in the process of adding more normalization methods through MAEPlugin methods.

    2.4.2.1 Intensity background correction

    The background intensity data from the spot quantification programs may be used to correct spot intensity. Background may be specified as either a global value or on a per-spot basis. If the array images have low background, then this may not be too much of a problem if no background values are available.

    Some software quantification software (e.g. Research Genetics' Pathways 2.01) measures background globally as: BGLow (low background), BGAvg (Average background), BGRms (root mean square background). For MGAP, MAExplorer uses the BGLow value when you request background subtraction. These values are read from the MAExplorer Samples DB file (see Appendix Table C.2.1.1 For other quantification programs, background may be available on a per-spot basis in the quantification files. It the latter is available in your data, it will be used if background correction is enabled (see Appendix C.3).

    The background corrected intensity I'ij is computed from the raw intensity Iij and background intensity bkgrdHPi for H.P. i and spot j as follows:

          I'ij = Ij - bkgrdHPi
    

    Ratio computation for Cy3 and Cy5 data

    For most MAExplorer operations, the intensity of a gene is generally computed as the mean intensity of the spots (background corrected or not) which duplicate that gene on the microarray. When working with dual hybridized samples using Cye-3 and Cye-5-dUTP labeling that results in green and red fluorescence, this can be used in self-normalizing intensity for each hybridized clone array using the Cy3/Cy5 ratio. If local background is available, then the ratio can be computed for HP h and spot j as
      (Cy3hj - BkgrdCy3hj) / (Cy5hj - BkgrdCy5hj) 
      

    2.4.2.2 Normalization between microarrays to allow comparison

    The normalization of quantitative data is crucial when comparing data between different microarray samples. There are a
    number of different schemes possible. One is to normalize by the sum of known calibration, housekeeping genes or other "constant expression" genes in the microarray. Another is to sum the background corrected integrated density for all spots in an array and to normalize individual gene measurements by that sum. These methods are now described in more detail. As the MAEPlugins facility becomes available, we will be adding a number of more sophisticated gene-specific normalization methods that take many of the problems specific to microarrays into account.

    Normalizing by scaled Zscore of intensity

    The "normalized Zscore of intensity" method normalizes each hybridized sample by the mean and standard deviation of the raw intensities for all of the spots in that sample. The mean intensity mnIi and the standard deviation sdIi are computed for the raw intensity of 'Good genes'. It is useful for standardizing the mean (to 0.0) and the range of data between hybridized samples to about -4.0 to +4.0. When using the Zscore, you compute Zdiff(erences) not ratios. The Zscore intensity Zscoreij for intensity Iij for HP i and spot j is computed as
         Zscoreij = (Iij - mnIi)/sdIi,
    and
         Zdiffj(x,y) = Zscorexj - Zscoreyj.
    

    Normalizing by the median of intensity

    The "Median intensity" method normalizes each hybridized sample by the median of the raw intensities of 'Good genes' for all of the spots in that sample. It is a useful normalization to use when you want to compute X/Y ratios between hybridized samples.
         Imij = (Iij/ medianIi)
    

    Normalizing by the log of median of intensity

    The "Log median intensity" method normalizes each hybridized sample by the log of median scaled raw intensities of 'Good genes' for all of the spots in that sample. The value 1.0 is added to the intensity value to avoid taking the log(0.0) when intensity has zero value. This is a useful normalization to use when you want to compute X/Y ratios between hybridized samples and compress the scale. Because we are computing a log, we report the difference between HP-X and HP-Y as (X-Y) instead of a ratio (X/Y).
         Imij = log(1.0 + (Iij/ medianIi))
    

    Normalizing by scaled Zscore of log intensity, standard deviation

    The "Normalize by Zscore of log intensity, stdDev" method normalizes each hybridized sample by the mean and standard deviation of the logs of the raw intensities for all of the spots in that sample. The mean log intensity mnLIi and the standard deviation log intensity sdLIi are computed for the log of raw intensity of 'Good genes'. Then the Zscore intensity ZlogSij for HP i and spot j is
         ZlogSij = (log(Iij) - mnLIi)/sdLIi
    

    Normalizing by scaled Zscore mean absolute deviation of log intensity

    The "Normalize by Zscore of log intensity, mean absolute deviation" method normalizes each hybridized sample by the mean and mean absolute deviation of the logs of the raw intensities for all of the spots in that sample. The mean log intensity mnLIi and the mean absolute deviation log intensity madLIi are computed for the log of raw intensity of 'Good genes'. Then the Zscore intensity ZlogAij for HP i and spot j is
         ZlogAij = (log(Iij) - mnLIi)/madLIi
    

    By 'User Normalization Gene Set'

    This method is useful a subset of genes have been determined to have relatively constant expression across the set of samples. It normalizes by the sum of intensities for a subset of genes defined by the user in the
    'User Normalization Gene Set' (Section 2.3.2)using the gene set editing commands. Normalizing by the sum of genes uses the Igsij that is computed for microarray HPi with intensities Iij for all genes j in the gene subset.
         Igsi = Sum (Iij)
            	   genes j
            	   i in HPi
    
    Then, the normalized intensity I'ij is computed as:
          I'ij =  Iij/Igsi
    

    By 'Calibration DNA' set

    If a predefined set of calibration DNA genes are available on the array, they may be used to normalize density values between the samples. The calibration DNA genes are defined by special gene names that are declared in the Configuration file using the 'calibDNAname' parameter (see Appendix C Table C.5.1(C)). If there is no calibration DNA, this entry is not used. The algorithm is the same as "User Normalization Gene Set" (above), but the set is predefined as the genes flagged as calibration DNA. For example, in the MGAP database, these spots are the "mouse genomic DNA" spots so the Configuration file entry would be calibDNAname="m.g. DNA".

    Scaling intensity data to 65K

    Another method "Scale intensity data to 65K" scales the maximum intensity of each sample to 65K (the maximum intensity). Since the raw scanned data is often 16-bits, it can have a maximum value of 65535 (216-1) and so this does minimum scaling. This method may make it easier to view the data initially using the pseudoarray image. However, it may not properly scale the data between arrays and should probably not be used in quantitative comparisons.

    No normalization

    You may also want to look at the raw intensity (or Cy3 and Cy5 channel) data. Turning off normalization gives you the raw data read into MAExplorer.

    2.4.2.3 Using different normalizations to 'see' different data views

    Changing the normalization method will sometimes make differences between data sets more apparent. The following figure shows the same data in two different scatter plots but with two different normalizations.

    A) Scatter plot using the Median normalization

    B) Scatter plot using the Log Zscore normalization

    Figure 2.4.2.3 Scatter plot of HP-X and HP-Y 'sets' data. HP-X is C57B6 pregnancy day 13 and HP-Y is Stat5a (-,-) pregnancy day 13 filtered by "All named genes and ESTs". A) A scatter plot using the Median normalization. B) A scatter plot using the Zscore of the logs normalization. Notice how the Casein alpha outlier is more apparent in the case of the Zscore log normalization. The skewed plot is characteristic of much microarray data. Some normalization methods (not currently included in MAExplorer) can compensate for these some of these artifacts (Dutoit, 2000) and are planned for future MAEPlugins.


    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.4.3 Filter menu

    The final set of genes presented for display, plotting, reports, etc. is determined by a cascade of gene "data filters" that generate a restricted gene set. The cascade is computed in real-time using the intersection of individual criteria and tests selected by the user. Examples of Filter criteria include: membership in a particular gene set, ratio (HP-X/HP-Y) within a range, passing statistical tests such as t-tests or F-test, etc.

    Filter menu and Ratio filter submenu showing options and 
Preference State sliders of active filters

    Figure 2.4.3 Filter menu. The Filter menu is a cascade of data filters that restrict the set of genes passing all filters that have been enabled and whatever the criteria was that was set for those filters. This figure shows the GeneClass filter set to "All genes and ESTs", the spot CV filter and Ratio (X/Y) range filters being set interactively by the scroll bars on the right. The genes that pass the filter are indicated with a red (white) circle in the array intensity (ratio) pseudoarray image.


    The Filter menu options are used to restrict the set of genes by pre-filtering the data with a series of cascaded filter criteria and tests. The resulting subset of genes passing the filter are then used in the plots, reports and other data analysis methods. Some of the filters require additional parameters that are set by the State scrollers. The user will automatically be prompted for changes to these scollers (a threshold scrollers window will pop up) when the filter is activated or change. These values may also be set from the Adjust all Filter threshold scrollers entry in the Preferences submenu in the Edit menu. The filters are broken up into subgroups in the following menu with the grouping haveing more to do with the criteria (i.e. gene set membership, data range, or statistical tests).

    The Filter by positive intensity data submenu filter contains options that specify which spot intensity values are to be considered when excluding negative quantified spot data. Note: this filter only makes sense if your data might have negative values (e.g. Affymetrix chip "Avg Diff" data) or a background corrected value that is less than 0.0. The filter is enabled by setting the "Filter by spots with positive intensity" checkbox. Negative intensity values may occur with some types of arrays quantification programs. In the "Check spots for positive values mode" submenu, you may set the samples where the test may be applied to spots from the current HP, the single (HP-X,HP-Y) samples, (HP-X,HP-Y) 'sets' (replicated spots), or samples in the HP-E list selected to be used in the filter. If there are (F1,F2) or (Cy3/Cy5) data, then each spot must meet the threshold criteria.

    The Filter by Good Spot data submenu filter contains options that specify spots based on their quality. It filters out genes that have that do not have "Good Spot" values defined by the optional QualCheck spot data. (See the list of codes in Appendix C.4). If there is no such spot quality data, then all spots are considered "good". The filter is enabled by setting the "Filter by spots with Good Spot values" checkbox. All spots for the specified samples must meet the criteria. In the "Check spots for Good Spot mode" submenu, you may set the samples where the test may be applied to spots from the current HP, the single (HP-X,HP-Y) samples, (HP-X,HP-Y) 'sets' (replicated spots), or samples in the HP-E list selected to be used in the filter.

    The Filter by Spot Detection Value data submenu filter contains options that specify spots based on their spot detection value quality metric over the range of [0.0 : 1.0]. The filter is available only if the data exists for your database and is ignored otherwise. If active, it pops up a "Spot Detection Value" slider in the range of [0.0 : 1.0]. Only spots greater than the slider value pass the filter. This data could be the Affymetrix MAS5.0 "Detection p-value" or some other metric correlated with spot detection quality. The filter is enabled by setting the "Filter by per-sample Spot Detection Value" checkbox. All spots for the specified samples must meet the criteria. In the "Check spots for Spot Detection Value mode" submenu, you may set the samples where the test may be applied to spots from the current HP, the single (HP-X,HP-Y) samples, (HP-X,HP-Y) 'sets' (replicated spots), or samples in the HP-E list selected to be used in the filter.

    The Filter by spot intensity [SI1:SI2] sliders submenu contains options that determines how individual spot intensity thresholding is to be applied in the Filter.

    The Use data mode submenu filter contains options that specify which spot intensity values are to be considered of the single sample (F1 and F2 replicated spot intensity data, or Cy3/Cy5 for ratio data), or the (HP-X,HP-Y) 'sets' of replicated samples is to be used in the filter. If there are single sample (F1,F2) or (Cy3/Cy5) data, then each spot must meet the threshold criteria.

    The Compare channels meeting range submenu specifies which additional constraints are to be used. If required by the (AT MOST channels, AT LEAST channels, PRODUCT OF channels, SUM OF channels) commands, the Percent SI OK scroll bar will appear which covers the range of 0% to 100%.

    The Filter by [I1:I2] sliders submenu contains options that determines how spot expression (intensity or (Cy3/Cy5) ratio value) thresholding is to be applied in the Filter:

    The Filter by ratio or Zdiff sliders submenu contains options that determines how spot-ratio thresholding is to be applied in the Filter. The spot ratio is mean HP-X / mean HP-Y for sets of samples. The spot Zdiff is used if one of the Zscore normalization methods is active and is computed as (mean HP-X - mean HP-Y) for sets of samples.

    The Filter by Cy3/Cy5 HP-X ratio or Zdiff sliders submenu contains options that determines how spot Cy3/Cy5 HP-X ratio thresholding is to be applied in the Filter. The spot ratio is Cy3/Cy5 for normalized data unless one of the Zscore methods is used. In that case, the Zdiff is used and is computed as (Cy3 - Cy5) for sets of samples. If HP-X 'sets' is used, then it computes the mean Cy3 value and the mean Cy5 value and uses those values in the above computations.

    The Filter by spot CV submenu filter contains options that specify how the Coefficient Of Variation of the (F1,F2) or (HP-X,HP-Y) 'sets' (replicated spots) is to be used in the filter. The (F1,F2) CV is available only if there are duplicate spots on the HPs.

    Filtering using statistical test by your selecting a p-value

    These tests will filter genes meeting the test criteria if the resulting p-value of that test is <= the value specified by the p-Value state slider. Only one test may be active at a time. If you switch to a new p-value test, it will disable the previous p-value test. If any of these tests are selected, it will pop up the p-Value state slider window for you to set the p-Value. There are two t-tests: one operating on duplicate (F1,F2) data if available, and the HP-X,HP-Y 'sets' if they are defined. The Kolmogorov-Smirnov test operates on HP-X,HP-Y 'sets' if they are defined. The F-test operates on the current Ordered Condition List (OCL) consisting of any number of condition lists each containing at least 2 (replicate) samples/condition.
  • Filter by current Ordered Condition List (OCL) F-Test [p-Value] slider [RB] - only include genes that meet the F-test criteria on the current OCL. This only works if there are at least 2 (replicate) samples/condition for each of the condition sets in the OCL. See info on defining the OCL and using the OCL data.

    Filtering out genes with high replicate spot variation

    The Spot CV filter mode submenu contains options to select how the spot CV filter is to be applied. It computes the maximum value of CV for all of the samples in the particular sample set specified. That maximum value is then used for the spot CV filter test. Genes may be filtered out having a large difference between spot quantification values of corresponding duplicate spots. You may compute the coefficient of variation CVj for the two values (f1j and f2j for a particular gene j.
        CVj = 2|f1j-f2j|/(f1j+f2j)
    
    If the database only has one field but replicate HPs, then you may use the HP-X & HP-Y 'sets' CVj to filter the genes. Then CVj values are tested against a CV threshold slider value to eliminate genes with a high coefficient of variation.

    • Current HP [RB] - CV of (F1,F2) for each gene in current sample [if duplicate spots are available on each sample]
    • HP-X or HP-Y [RB] - CV of (F1,F2) for HP-X and HP-Y single samples [if duplicate spots are available on each sample]
    • HP-X 'set' [RB] - CV of spots in HP-X set
    • HP-Y 'set' [RB] - CV of spots in HP-Y set
    • HP-X or HP-Y 'sets' [RB] - CV of spots in the HP-X set or HP-Y set
    • HP-X and HP-Y 'sets' [RB] - CV of spots in both the HP-X set and HP-Y set
    • HP-E [RB] - CV of HPs in expression profile list

    2.4.3.1 Data filtering using multiple gene data filters

    Any or all of the data filters may be selected simultaneously. In particular, if you select filters that use parameter threshold scrollers, they will be added to a state scroller window (see Figure 2.3.4.1 for details to allow adjustment of ALL sliders simultaneously). You may change various thresholds and see the effect in real time. Note: some of the scrollers are more sensitive to low values. Therefore, we set them to respond non-linearly with a more precise vernier at the low end.

    Filtering using multiple filters (Spot Intensity and Ratios of samples) 
and multiple thresholds

    Figure 2.4.3.1 Filtering using multiple scrollers. This example is of Cy3/Cy5 time series data. It filters normalized spot intensity of the Cy3 and Cy5 channels independently ([SI1:SI2] inside range) where low intensity spots are eliminated. It then filters out genes outside of the [R1:R2] ratio range.

    Filtering using positive intensity data - ignoring negative data

    Figure 2.4.3.2 Using the Positive Intensity data Filter. This allows removing negative data if the data contains negative intensity values (e.g. Some Affymetrix data has negative Average Difference values which could be read as Intensity for MAExplorer).


    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.4.4 Plot menu

    The Plot menu lets you display a pseudoarray image, scatter plots, ratio and intensity histograms,and expression profile plots. The pseudoarray image is displayed in the main MAExplorer window. All of the other plots are displayed in popup windows. Depending on the particular plot, multiple instances may be allowed. The Plot submenus are:

    You may switch between different representations of the microarray spot pseudoarray image. It may be viewed as several different types of pseudo images including an intensity gray value and a pseudo-color Red/Black/Green image for ratio (HP-X/HP-Y) and Zscore (HP-X - HP-Y) data. The p-Value results of comparing a HP-X 'set' with a HP-Y 'set' of samples, or the CV of the HP-EP 'list can be displayed as a color spectrum pseudoarray image.

    Depending on the origin of the array data, it may have the same verisimilitude as the original arrays. Otherwise, it is displayed in a generic pseudoarray image containing grids that will fit the window - these are not the same as the original array image (see . However, the pseudoarrays are useful to getting a rough idea of the global changes in the data between arrays and how may genes pass the data filter.

    When enabled using one of the commands in the Section 2.4.5 Clustering menu, cluster data appears as blue circles or squares drawn as overlays on the pseudoarray image. These options are discussed in the section on clustering. If you are doing clustering K-means clustering, the current cluster is displayed in the scatter plot if the latter is active.

    Scatter plots, ratio and intensity histograms of the mean (HP-X/HP-Y) or (HP-X/HP-Y) 'set' data, or the F1/F2 or Cy3/Cy5 data. F1/F2 or Cy3/Cy5 plots are available if the data exists in your particular database. That might be the case with replicate spots or with Cy3/Cy5 data. If the normalization is set to a Zscore or log mean mode, it will compute Zscore scatter plots and histograms.

    Clicking on spots in an array image or points in scatter plots sets the current gene and will bring up data on the gene or (optionally) access corresponding data from GenBank, UniGene, mAdb Clone, etc. databases in a popup Web browser. Clicking on a bin in a ratio or intensity histogram plot filters out all genes except for those in the range of that bin.

    Expression profiles plots of selected genes or subsets of genes for all samples in the HP-E list. These are active plots with data reported when the user clicks in the plot.

    Clicking on a spot (i.e. gene) in the microarray pseudo image or on a point (i.e. gene) in the scatter plot, it will define that gene as the "current gene" that is used in other operations. The current gene is indicated in both plots with a green circle around it. Similarly, you may modify the 'Edited Gene List' from either the pseudoarray image or the scatter plot. When viewing is enabled, it overlays those genes with magenta squares.

    Plot menu - selecting Ratio Pseudoarray image

    Figure 2.4.4 Plot menu - selecting Ratio Pseudoarray image. This displays a pseudocolor show in the scale on the left that indicates the ratio of the value of the HP-X sample / HP-Y sample (or 'sets' if the option to use HP-X and HP-Y 'sets' is enabled.) If The data is Cy3/Cy5 data, then this displays the ratio of the ratios using the current normalization. Various other pseudoarray image representations could be used.

    2.4.4.1 Show microarray pseudoarray images menu

    You may show the pseudoarray image of the current hybridized samples using several modalities. The grayscale pseudoarray image is generated from the quantified spot data. If the data contains the actual spot positions of the genes (as generated by the various array image quantification program), the spots may be drawn using a scaled version of those coordinates. Otherwise, a generic set of grids (and fields in there are multiple fields) is synthesized to represent the spot positions. Pseudoarray images may also be useful as an alternative modality for displaying X/Y ratio or X-Y Zdiff data. If the normalized intensities are the same, then the spot will appear as black with the overall spot intensity depending on the spot concentrations. High ratios and Zdiffs will be red and low values green as shown in Table 2.4.4.1. The p-Value results of comparing a HP-X 'set' with a HP-Y 'set' of samples can be displayed as a color spectrum pseudoarray image.

    If the database that was loaded contains only one sample, the pseudoarray image display defaults to the pseudograyscale spot intensity mode. If there is at least one HP-X and one HP-Y sample, then the Pseudocolor HP-X/Y ratio or Zdiff mode is the initial default display. If there are duplicate spots for each gene, you may generate a Pseudocolor F1F2 ratio or Zdiff mode image. If you are using Cy3/Cy5 ratio data and the data is available as independent channels for each HP, then you may plot Cy3 vs Cy5 for individual samples.

    When available on the database server, the original image may be displayed in a separate popup Web browser.

    
    

    Table 2.4.4.1. Pseudocolors assigned to spots to represent data in the X/Y ratios or X-Y Zdiffs pseudocolor array images. Each color represents the normalized X/Y ratio or X-Y Zdiff depending on Normalization mode. The 9 colors of the boxes represent the normalized expression ranges.


    . . . . . . . . . Normalization
    mode - RBG
    bright green . . dark green Black dark red . . bright red
    . . . . . . . . . Normalization
    mode - dichromasy
    bright blue . . dark blue Black dark orange . . bright orange
    <0.250X 0.307X 0.400X 0.571X 1.000X 1.75X 2.50X 3.25X >4.00X Ratio data
    <-3.0 -2.25 -1.50 -0.75 0.00 0.75 1.50 2.75 >3.0 Zscore data
    <-0.99 -0.742 -0.495 -0.247 0.000 0.247 0.495 0.742 >0.99 Zscore Log data
    
    
    Clicking on a particular gene will report its specific quantification and identification values (See
    Section 3.3 on gene quantification). If the Enable display current gene in popup genomic DB Web Browser option is set in the View menu, then it will also pop up a Web browser with the corresponding to the particular genomic DB data for that database if it exists.

    The same data is shown in a variety of normalization and display formats.

    2.4.4.1.1 Examples of microarray intensity data pseudoarray image

    The relative intensity may be displayed for the current sample (last HP-X or HP-Y selected) or two samples (HP-X or HP-Y samples or HP-X or HP-Y 'sets' of samples). To show two samples side by side, enable the Use dual HP-X & HP-Y Pseudoimage in the Show Microarray submenu. To show averaged set data in the dual mode, enable the "Use HP-X & HP-Y 'sets' option in the Samples menu. The grayscale value reflects the current normalization mode.

    Pseudoarray image show intensity plot for median normalized data for 
a single sample

    Figure 2.4.4.1.1.1 Pseudoarray intensity image of median normalized intensities of the current HP sample (C57B6 virgin 10 weeks from MGAP database). The graylevel scale on the left edge of the pseudoarray image indicates the spot intensity. All pseudoarray images have scales that vary depending on the type of pseudoarray being displayed.

    Pseudoarray image show intensity plot for Zscore normalized data for 
single sample

    Figure 2.4.4.1.1.2 Pseudoarray intensity image of Zscore normalized intensities of the current HP (C57B6 virgin 10 weeks from MGAP database).

    Pseudoarray image show intensity plot for Zscore Log normalized data 
for single sample

    Figure 2.4.4.1.1.3 Pseudoarray intensity image of ZscoreLog normalized intensities of the current HP (C57B6 virgin 10 weeks from MGAP database).

    Pseudoarray image show intensity plot for Zscore Log normalized data 
for HP-X and HP-Y single samples

    Figure 2.4.4.1.1.4 Pseudoarray intensity image of ZscoreLog normalized intensities of the dual HP-X and HY-Y individual samples. The Plot menu Show Microarray submenu toggle "Use dual HP-X & HP-Y samples" option is set. HP-X is a C57B6 pregnancy day 13 and HP-Y is a Stat5a (-,-) pregnancy day 13.

    Pseudoarray image show intensity plot for Zscore normalized data 
for HP-X and HP-Y 'set' samples

    Figure 2.4.4.1.1.5 Pseudoarray intensity image of ZscoreLog normalized intensities of the dual HP-X and HY-Y sample 'sets'. The Plot menu Show Microarray submenu toggle "Use dual HP-X & HP-Y samples" option is set. The "Use HP-X & HP-Y 'sets' option in the Samples menu. HP-X is the mean of three 'C57B6 pregnancy day 13' and HP-Y is the mean of three 'Stat5a (-,-) pregnancy day 13'.

    2.4.4.1.2 Example of microarray ratio or Zdiff data pseudocolor image

    The ratio (HP-X/HP-Y) or Zdiff HP-X - HP-Y) normalized intensity data may be displayed as a pseudoarray image for HP-X and HP-Y, or HP-X and HP-Y 'sets' of samples. To show averaged set data in the dual mode, enable the "Use HP-X & HP-Y 'sets' option in the Samples menu. The colors of the 9 scale boxes represent the normalized expression ranges and is assigned according to the current normalization mode listed in the table.

    Pseudoarray image show color ratio plot for median normalized data
for HP-X and HP-Y single samples

    Figure 2.4.4.1.2.1 Pseudocolor array image of median normalized X/Y ratios. HP-X is C57B6 pregnancy day 13 and HP-Y is Stat5a (-,-) pregnancy day 13. Each spot's color represents the normalized X/Y ratio depending on Normalization mode. The color of the box is one of 9 colors representing the normalized expression ranges and assigned according to the table "Ratio mode".

    Pseudoarray image show color ratio plot for median normalized data 
for HP-X and HP-Y 'set' samples

    Figure 2.4.4.1.2.2 Pseudoarray color image of normalized X/Y 'set' mean value ratios. Mean of three HP-X C57B5 pregnancy day 13 samples and mean of three HP-Y Stat5a (-,-) pregnancy day 13 samples. Each spot's color represents the normalized X/Y 'set' ratios depending on Normalization mode. The color of the box is one of 9 colors representing the normalized expression ranges and assigned according to the table "Ratio mode".

    Pseudoarray image show color ratio plot for |X-Y| Zdiffs normalized 
data for HP-X and HP-Y 'set' samples

    Figure 2.4.4.1.2.3 Pseudoarray color image of X-Y Zdiffs. HP-X is C57B6 pregnancy day 13 and HP-Y is Stat5a (-,-) pregnancy day 13. Each spot's color represents the normalized X-Y Zdiff depending using the Zdiff normalization mode. The color of the box is one of 9 colors representing the normalized expression ranges and assigned according to the table "Zdiff mode".

    Pseudoarray image show color ratio plot for |X-Y| Zdiff of Log 
normalized data for HP-X and HP-Y samples

    Figure 2.4.4.1.2.4 Pseudoarray color image of X-Y Zdiff of log data. HP-X C57B5 pregnancy day 13 sample and HP-Y Stat5a (-,-) pregnancy day 13 sample. Each spot's color represents the normalized X/Y ratio depending on ZdiffLog with StdDev normalization mode. The color of the box is one of 9 colors representing the normalized expression ranges and assigned according to the table "ZdiffLog mode".

    Pseudoarray image showing color coded p-values for t-test comparison of 
HP-X and HP-Y 'set' samples

    Figure 2.4.4.1.2.5 Pseudoarray image showing color-coded p-values for t-test comparison of HP-X and HP-Y 'set' samples. The HP-X and HP-Y sets both have 2 samples each (more is obviously much better). The data was normalized using the Median and a spot intensity [SI1:SI2] data filter was applied to eliminate some of the noisy data. Each spot's color represents a p-value in the range indicated in the scale in the left edge of the image. Note that although all spots are assigned a p-Value, many may not be very significant because adequate preprocessing of the data (such as normalization, and low intensity spot removal, etc.). So use this display with care.

    2.4.4.2 Scatter plots menu

    Scatter plots include HP-X vs HP-Y intensity for comparing data between HP-X and HP-Y samples (or sets if the HP-X and -Y sample 'sets' mode is enabled in the Samples menu - as is shown in Figure 2.2.1). You may zoom into any area of the scatter plot as is shown in Figure 2.4.4.4.2(C). If there are duplicate spots for each gene, you may plot F1 vs F2 intensity (or Cy3 vs Cy5 if using ratio data) for comparing replicate data (or Cy3 and Cy5 ratio data channels) within the same sample It will also compute the correlation coefficient for the data and display it in the plot and in the message panel. The data is the intensity values using the current normalization method. If you are analyzing ratio Cy3/Cy5 data, you may compare Cy3 or Cy5 of the HP-X sample against Cy3 or Cy5 of the HP-Y sample. If you are in stand-alone mode, a SaveAs GIF button will also be available. This saves the current plot as a full resolution GIF file specified by the user in a popup file browser window.
      rSq=0.974, n=1728, X(mn+-sd)=(4.477+-7.845), Y(mn+-sd)=(12.379+-24.810)
    
    The Scatter plots submenu includes:
    • HP-X vs HP-Y intensity - display a scatter plot of HP-X vs HP-Y intensity data of H.P. selected (average of F1+F2 fields). If HP-X/-Y 'sets' are enabled, then the plot is of the mean set values.
    • HP F1 vs F2 intensity - display a scatter plot of F1 vs F2 (or Cy3 vs Cy5) intensity data of the current selected sample
    • --------------------
    • HP-X Cy3 vs HP-Y Cy3 intensity - display a scatter plot of HP-X(Cy3) vs HP-Y(Cy3) intensity data of the currently selected HP-X and HP-Y samples (ratio data only).
    • HP-X Cy5 vs HP-Y Cy5 intensity - display a scatter plot of HP-X(Cy5) vs HP-Y(Cy5) intensity data of the currently selected HP-X and HP-Y samples (ratio data only).
    • HP-X Cy3 vs HP-Y Cy5 intensity - display a scatter plot of HP-X(Cy3) vs HP-Y(Cy5) intensity data of the currently selected HP-X and HP-Y samples (ratio data only).
    • HP-X Cy5 vs HP-Y Cy3 intensity - display a scatter plot of HP-X(Cy5) vs HP-Y(Cy3) intensity data of the currently selected HP-X and HP-Y samples (ratio data only).

    If Cy3/Cy5 ratio data is being analyzed, then the "HP F1 vs F2 intensity" menu entry becomes

    • HP Cy3 vs Cy5 intensity - display a scatter plot of Cy3 vs Cy5 intensity data of the current sample selected (i.e. being displayed).

    The following figure illustrates some of the scatter plots and zoomed regions using the scroll bars on the horizontal and vertical axes.

    A) Scatter plot of single HP-X vs HP-Y samples

    B) Scatter plot of single HP-X vs HP-Y 'set's of samples

    C) Zoomed scatter plot of single HP-X vs HP-Y 'set's of samples

    Figure 2.4.4.2 Scatter plot of HP-X and HP-Y single sample data. HP-X is C57B6 pregnancy day 13 and HP-Y is pactation day 1. A) An active scatter plot may be generated for the current HP-X and HP-Y samples filtered by "All named genes". B) similar plot for HP-X and HP-Y 'sets' of replicate samples (3 pregnancy and 4 lactation samples in the sets respectively). Clicking on a point in the plot sets the current gene. C) Zoomed up region (of B) at the bottom of the plot showing more detail and filtered by just "All named genes". Zooming is performed by adjusting the X or Y axes limits scroll bars. Note the points enclosed in magenta boxes indicate genes in the E.G.L. gene list.


    Scatter plots of data from multiple channels on the same sample

    It is also possible to plot the separate channels within a single sample against each other. For example F1 vs F2 in samples with replicate data and Cy3 vs Cy5 in samples with separate ratio data channels.

    A) Scatter plot of Field 1 (F1) vs Field 2 (F2) of current HP sample

    B) Scatter plot of Cy3 vs Cy5 of current HP sample

    C) Scatter plot of HP-X Cy3 vs HP-Y Cy3 channels of different samples

    D) Scatter plot of HP-X Cy3 vs HP-Y Cy5 channels of different samples

    Figure 2.4.4.2.1 Scatter plot of multiple channel data from a single sample. A) F1 Vs F2 data for a C57B6 pregnancy day 13 sample. B) Cy3 vs Cy5 data for a NCI mAdb mouse array sample. C) Scatter plot of individual Cy3 channels from two different ratio Cy3/Cy5 data hybridized samples. C) Scatter plot of individual Cy3 channel of HP-X compared with Cy3 channel of HP-Y for ratio Cy3/Cy5 data hybridized samples. D) Scatter plot of individual Cy3 channel of HP-X compared with Cy5 channel of HP-Y for ratio Cy3/Cy5 data hybridized samples.


    2.4.4.3 Histogram plots menu

    You may compare ratios or Zdiffs of data using the HP-XY ratios or Zdiff command to display a ratio histogram of Filtered intensity data from two samples selected from the Samples menu. The HP-XY 'set' ratio or Zdiff is used if there are multiple samples in the HP-X or HP-Y sets, then the mean values in each of the sets is used in the calculations. If there are duplicate spots for each gene, you may plot the F1F2 ratio or Zdiff histogram of the F1/F2 ratios or F1-F2 Zdiff values for normalized data for each spot in the currently displayed sample. If you are in stand-alone mode, a SaveAs GIF button will also be available to save the current plot as a full resolution GIF file specified by the user in a popup file browser window.

    The Intensity selection plots a histogram of the gene intensity data values for each Filtered spot (gene) in the current hybridized sample.

    The Histograms submenu includes:

    • HP-XY ratio or Zdiff - display ratio or Zdiff histogram of data from selected HP-X and HP-Y samples (mean F1+F2 data if duplicate spots).
    • HP-XY 'sets' ratio or Zdiff - display ratio or Zdiff histogram of mean data from HP-X and HP-Y 'set's of samples (mean data).
    • F1F2 ratio or Zdiff - display ratio or Zdiff of histogram of F1 F2 duplicate spot data from the current sample.
    • HP Intensity - (or HP (Cy3/Cy5) if ratio data) display a histogram of spot intensity (or Cy3/Cy3 ratio) data for the current sample.

    If Cy3/Cy5 ratio data is being analyzed, then the F1F2 histogram menu entry becomes

    • Cy3Cy5 ratio or Zdiff - display ratio or Zdiff of histogram of Cy3 Cy5data from the current sample.

    The following figures illustrates the histograms. You may use the histogram to specify ranges of [I1:I2] or {R1:R2] for data filtering in the histogram by specifying the corresponding histogram bins. This is described in the figure legend.

    A) Ratio histogram of (HP-X/HP-Y) samples data, median normalization

    B) Zdiff  histogram of (HP-X - HP-Y) samples data, Zscore normalization

    C) Intensity histogram of current sample data, median normalization

    Figure 2.4.4.3 Histogram plots. A) Ratio histogram of HP-X/HP-Y data with particular histogram bin selected with the constraint set to filter all genes > that bin. HP-X is 13 day pregnancy C57B6 and HP-Y is day 1 lactatation. The selected bin thresholds are then used in the Filter with the resulting Filtered genes shown in the array image. B) Zdiff histogram of HP-X - HP-Y 'sets' for same data as (A) but with the >< threshold constraint set to find genes outside of the symmetric histogram range. C) Intensity histogram of HP-X data filtered by [I1:I2] intensity range. As with ratio histograms, you can do additional filtering by selecting a particular histogram bin that is then used in the Filter. Filtering was disabled for the intensity histogram. To apply the filter, the "Don't re-Filter" button would be toggled to the "Re-Filter" state. The threshold constraints include: =, >, <, >, <>, and ><. Note that each time you click on the "Thr:" button, it cycles to the next option in the threshold constraints list.


    2.4.4.4 Expression profile plots menu

    You may generate an individual expression profile plot (EP plot) or a scrollable list of EP plots. The order list of hybridized samples to plot are specified by the HP-E set. In the latter case, the genes are specified by the data Filter.

    You many generate as many individual expression profile plots as you want using the Display a gene's expr. profile for HP-E command. However, only the last one will be active and will be updated with different genes as you click on them in the microarray image scatter plot. This could be used to compare the EP plots for several different genes. First view the EP plot for one gene, then create a new EP plot for the second gene, etc.

    If you use the Display Filtered genes expr. profiles for HP-E command, it will generate a scrollable list of expression profile plots for all of the genes passing the Filter. If the number of genes is very large, it may take a while.

    You may interrogate a line corresponding to a particular HP sample in a EP plot by moving the mouse over the line and then selecting the line. This will cause the name of the HP, its intensity and CV to appear in the plot. If the Err check box is set, then the mean of the intensity is indicated by a short horizontal bar and the +- CV by red vertical error bars above and below the mean. If the plot style Line button is pressed, then the plot style is cycled between Line (vertical lines for each point), Circle (small circles at each point), and Curve (circles are connected). Pressing the button repeatedly cycles through: Line (i.e. vertical vars), Circle, or Curve (i.e. continuous curve of all samples). In the case of mean expression profiles used in K-means clustering, the standard deviation is used in place of the CV value. The various clustering methods have EP plots buttons. When they are invoked, the scrollable list of EP plots is sorted by the clustering method ordered list of genes. This enables you to view the data in the same order as that produced by the cluster analysis. If the zoom nnX button is pressed, then all of the plots are magnified by nn-fold to make low intensity plots more visible. Pressing the button repeatedly cycles through: 1X, 2X, 5X, 10X and 20X. It does not change the data itself. The Show HP names button pops up a numbered list of all HP entries used in the expression profile. If you are in stand-alone mode, a SaveAs GIF button will also be available for the EP overlay mode (Figure 2.4.4.4.1) or individual EP plot. This saves the current plot as a full resolution GIF file specified by the user in a popup file browser window.

    The Expression profile plots submenu contains:

    • Display a gene's expr. profile for HP-E - popup a window and display the expression profile of a gene when click on a spot in the Image or a point in the scatter plot.
    • Display Filtered genes expr. profiles for HP-E - popup a scrollable montage of expression profiles of the list of Filtered genes.
    • --------------------
    • Use EP overlay else EP list [CB] - display Filtered genes as an overlay plot of expression profiles, else as a scrollable list of EP plots.

    A) Expression profile plot - shown with various formating options

    B) List of expression profile plots (it is scrollable if more than 10 plots)

    Figure 2.4.4.4 Expression profile plots. A) Individual expression profile plots may be created by clicking on any gene. Multiple instances may also be created. Here we show some of the presentation options for the 38 sample MGAP database. Error bars are computed for the standard error for that sample. There are three different plotting options: line, circle and curve. #1 is the default line plot with error bars. #2 is the line plot without the error bars but clicking on line 7 to find out which sample it is and what the intensity value is. #3 is the circle plot with error bars, and #4 is the curve plot without error bars. Window #5 shows the list of samples corresponding to the 38 points in the EP plots. B) List of EPplots of the oncogenes and proto-oncogenes in the database (set by the guesser with "onco" and "Set E.G.L." and the Edited Gene List Filter). The list would become scrollable if there were more than 10 profiles. Setting the current gene would scroll the list to the EPplot for the current gene.


    A) Scrollable list of expression profile plots

    B) Overlay expression profile plots with current gene in green

    C) Overlay expression profile plots of EGL filter genes with current 
gene in green

    Figure 2.4.4.4.1 Expression profile plots. A) Scrollable list of EP plots of Filtered named genes centered at Carbonic anhydrase III. B) Overlay plot of all named Filtered genes. C) Overlay plot of all ONCO or PROTO-ONCO genes with the draw EGL option active so the graphs are drawn for these genes.


    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.4.5 Cluster menu

    The Clustering menu lets perform various types of gene and condition clustering operations. When you invoke a clustering operation it will popup one or more windows and may modify the pseudoarray image. Some of the popup windows include clustergram and dendrogram analysis plots used with the hierarchical clustering.

    When enabled, cluster data appears as blue circles or squares drawn as overlays on the pseudoarray image. These options are discussed in the section on clustering.

    Cluster analysis plots include finding a subset of genes or subsets of samples based on cluster analysis of expression profile similarity measures. These show genes belonging to particular clusters, or genes that cluster well with specified genes. Cluster methods include: finding genes similar to the current selected gene within a "distance" threshold; K-means-like clustering where you specify a seed gene and the number of clusters; and hierarchical clustering with clustergram and dendrogram graphics.

    Cluster menu options

    Figure 2.4.5 Cluster Menu options. The hierarchical clustering option is being selected.


    Use of clustering to find patterns of similar gene expression

    Clustering is a way of possibly finding co-expressed genes that exhibit similar expression changes in a set of samples. Genes may show similar co-expression, but that does not prove they are co-regulated at the same point in a pathway - merely that measurements of those genes in a particular set of experiments show similar expression. However, identifying genes with similar expression for which some information is already known about some of the genes may be useful as a starting point to help figure out gene function and possibly aspects of its pathways in cell function using additional experiments and analysis.

    There are many methods for doing clustering - each with advantages and disadvantages. We present three methods in MAExplorer and plan on adding a variety of more powerful methods through the MAEPlugin facility under development.

    These methods may find genes belonging to particular clusters or genes that cluster well with particular genes. Gene clusters are sets of genes whose expression profiles are found to be similar according to a particular metric. We now define what we mean by "similar". The order list of hybridized samples used in computing the expression profiles are those in the HP-E list. MAExplorer has two different dissimilarity measures for Cij: Euclidean distance LSQdistij and Pearson correlation coefficient rij. These are computed as follows and are tested against the cluster distance threshold (set by the slider in the preferences sliders). Let n= |HP-E|, the number of samples in the expression profile. We define similarity as (1.0 - normalized dissimilarity).

    
    
    
    Hint: when working with very large data sets with many samples, it may be useful to pre-adjust the distance and/or number of clusters threshold sliders to an approximate range using the (Edit Menu | Preferences | Adjust all Filter threshold scrollers). This is because once the clustering starts, it does not (currently) let you abort the clustering to change the threshold value.

       LSQdistij = Sqrt( Sum ( D'hj - D'hi) **2 ) / n 
                   h in HP-E
    	       i,j in Filtered genes, i not j
    
    Let,
       sumij = Sum( D'hj * D'hi ),
       mni = (1/n)Sum( D'hi ),
       mnj = (1/n)Sum( D'hj ),
       sumSqi = Sum( D'hi *  D'hi ),
       sumSqj = Sum( D'hj *  D'hj ),
      
    then,
             [sumij - n*(mni * mnj)] 
       rij = --------------------------------------------------------
             [Sqrt(sumSqi - n*n*mni*mni) * Sqrt(sumSqj - n*n*mnj*mnj)]
                   h in HP-E
    	       i,j in Filtered genes, i not j
    

    The Cluster plots submenu contains a number of clustering methods. Pressing the Escape key during a long cluster operation will abort the operation. If you are in stand-alone mode using the ClusterGram, a SaveAs GIF button will also be available for saving the current plot as a full resolution GIF file specified by the user in a popup file browser window.:

    • Cluster genes with expression profiles similar to current gene [RB] - click on gene in image to find other genes with similar HP-E expression profiles whose cluster distance is less than the cluster distance threshold. The larger the blue box, the higher the similarity.
    • Cluster counts of similar Filtered genes by expression profiles [RB] - draw blue circles around filtered genes indicating the number of other genes whose cluster similarity is less than the cluster distance threshold. The larger the circle, the more similar genes were found. Clicking on a gene switches to the above mode.
    • K-means clustering of gene expression profiles [RB] - draw magenta circles around the N primary-node gene clusters representing the gene closest to representing the center of the cluster. Each of the nodes is a maximum distance from all other nodes in the recursive definition of nodes. N is determined by the State Scroller "# of Clusters". Changing N will recompute the clusters. It then pops up a scrollable text window with the clusters and indicates which genes belong to it. If you select the EP plot button, it will draw the expression profiles for the clustered genes. The Mn-Cluster-Report button will generate report for all genes sorted by K-means cluster. Summaries can be generated using the Mean EP plot and Mn-Cluster-Report buttons. The SaveAs GeneSets button saves all of the clusters as named Gene Sets ("Cluster #1", "Cluster #2", etc). If you change the filter or current gene, you should explicitly use the Recompute Clusters button to regenerate the new set of clustered genes. When you recompute the K-means clusters, it uses the current gene as the initial node.
    • Hierarchical clustering of expression profiles - this computes the hierarchical clustering of the expression profiles (normalized by HP-X sample data for each gene) of Filtered genes. The hierarchical clusters are displayed in an ordered gene clustergram and optional dendrogram. Sub-regions of the clustergram may be explored in more detail using the EP-subset plot button, or a report of the ordered genes can be created using the ClustGram Report Note: you may add (remove) genes you select from the Clustergram to the E.G.L. by holding the Control(Shift) key while clicking on the gene name.
    • S.O.M. gene clusters by expr profiles [RB] - [Future MAEPlugin]
    • Multi-Dimensional Scaling of genes by expr profiles [RB] - [Future MAEPlugin]
    • Multi-Dimensional Scaling of genes by exprprofiles [RB] - [Future MAEPlugin]
    • Clusters of (HP-E) samples as fct of Filtered genes [RB] - [Future]
    • --------------------
    • Use correlation-coefficient else Euclidian-distance [CB] - use the (1.0 - correlation coefficient) as the distance metric instead of the default Euclidean distance.
    • Scale EP vector by max magnitude prior to clustering [CB] - scale each sample in the EP by the max magnitude for all sample values in the EP.
    • Normalize by HP-X sample else HP max intensities [CB] - normalize data by the corresponding HP-X sample data for each gene or the maximum raw intensity for each HP in the expression profile.
    • Use median instead of mean for K-means clustering [CB] - use the clustering (see (Bickel, 2001)).

    The Hierarchical Cluster plots submenu contains:

    • Display ClusterGram of gene expr profiles [CB] - compute the hierarchical clustering of the expression profiles (normalized by HP-X sample data for each gene) of Filtered genes. Then display the hierarchical clusters in an ordered gene clustergram and optional dendrogram when the dendrogram checkbox is selected. Expression profile plots of the clustergram may be explored in more detail using the EP plot button that generates a scrollable list of all EP plots ordered by the same order as the clustergram. A full report of the ordered genes expression profiles may be created using the ClustGram Report button.
    • --------------------
    • Use avg-arithmetic-linkage [RB] - set the hierarchical clustering linkage method to the average arithmetic linkage of sub-clusters.[ Future]
    • Use avg-centroid-linkage [RB] - set the hierarchical clustering linkage method to average centroid linkage of sub-clusters (default).
    • Use next-min-linkage [RB] - set the hierarchical clustering linkage method to the next minimum distance sub-cluster linkage in random order.
    • Use cluster-distance matrix cache [CB] - if you do not have enough memory for clustering large gene sets, disable the cache. It will take MUCH longer without the cache. When clustering, if there is not enough memory available for the cache, it will warn you and suggest you either reduce the number of genes being clustered or use a computer with more memory.)
    • Use short else float cluster-distance matrix cache [CB] - if there is not enough memory for the set of genes you wish to cluster and you still want to use the cache, you can use 16-bit (i.e. short) data instead of the 32-bit (i.e. float) data. The results will be less precise.
    • Use un-weighted else weighted average [CB] - set the hierarchical clustering vector averaging to un-weighted (the default weights it by the number of genes in that sub-cluster). Otherwise using weighted gives equal (0.50) weighting to each sub-cluster.

    Handling of hierarchical clustering of large numbers of genes - problem with slow response

    The hierarchical clustering algorithm uses a gene-gene floating point (i.e. 32-bit) distance matrix of order N2 (for N data filtered genes). This means that if you are experiencing a slow response, this may be due to several factors some of which you may not be able to control. You might:
    1. If you determine that your computer is "paging", use a computer with more memory. The currently distributed stand-alone version will use up to 256 Mbytes of chip memory.
    2. If that is not possible, reduce the number of genes being clustered. Even if you have enough memory, the computation is still high if N is large.
    3. Set the Use short else float cluster-distance matrix cache option in the cluster plot menu. This reduces the memory requirements for the distance matrix by 1/2.

    2.4.5.1 Cluster genes with expression profiles similar to current gene

    The Cluster genes with expression profiles similar to current gene is used to find genes with similar HP-E expression profiles as measured by the least square error that are less than the cluster distance threshold. It pops up the "Cluster Distance" threshold scroller. Then click on a gene in the microarray image. It then pops up up a window with a list of the similar genes and their expression profile distances to the current gene. Each gene that passes the cluster distance threshold test is indicated in the image with a blue square where the size of the square is proportional to its similarity. It also displays a sorted list of the genes with the cluster distance in the cluster panel that was popped up. On each lines is a series of '*****' - the more stars the higher the similarity to the seed gene. This is a silhouette plot that is used to display a sorted list of similar objects and is described to that described in (Kaufman and Rousseeuw, 1990). Larger squares indicate that more genes are similar. You may change the cluster distance threshold and it will update the display and the list. In addition, the 'edited gene list' is set to the subset of genes that belong to the current cluster.
    A) Cluster of similar genes current seed gene in green, similar genes with
blue boxes, similarity report on right

    B) Similar genes viewed in scrollable list of expression profile plots 
ordered by similarity

    Figure 2.4.5.1 Similar genes clustered to the current gene. This method finds all genes that are similar to the current gene as those defined by their distance between expression profiles being less than the threshold set by the user. Each gene that passes the cluster distance threshold test is indicated in the image with a blue square where the size of the square is proportional to its similarity. This data is from the 38 samples in the MGAP database containing duplicated spots. A) Main windows with popup cluster similarity report and cluster distance threshold slider. B) Scrollable list of EPplots of similar genes with the red error bars indicating the variation for duplicated spots for each HP sample. The Err checkbox may turn the error bar overlays on and off.


    2.4.5.2 Cluster counts of similar filtered genes by expression profiles

    The Cluster counts of similar Filtered genes by expression profiles command analyzes the set of all Filtered genes for the expression profile defined by the HP-E samples. It counts the number of similar genes for each Filtered gene and draws a blue circle whose size is proportional to the number of genes similar to that gene. After it analyses these genes it lists the genes and their counts in the cluster panel. You may change the cluster distance threshold and/or Filter parameters and it will update the display and the list. If you click on a gene with a green circle, it will switch to single gene cluster mode (with the blue squares).

    For both of these commands, if you want to view the expression profile plots, click on the EP plot button in the cluster window and it pops up the scrollable expression profiles window. If you click on a gene in the image, it will select it as the new current gene and seed gene and recompute the cluster of genes most similar to the new see gene.

    For both of these commands, if you want a permanent report, click on the "Cluster Report" button in the cluster window and it will generate a report in the current modality (i.e. scrollable spreadsheet or tab-delimited). You may switch between these two modes by pressing the "Go '...'" button in the report.

    Cluster counts showing clusters with large # genes with larger circles, 
cluster count report on right

    Figure 2.4.5.2 Display of cluster counts for all genes less than the cluster threshold from MGAP 38 sample database. The algorithm counts the number of similar genes for each Filtered gene and draws a blue circle whose size is proportional to the number of genes similar to that gene. That is why there are a larger number of the larger circles.


    2.4.5.3 K-means clustering' gene expression profiles for filtered genes

    The K-means cluster gene expression profiles for Filtered genes command searches the data Filtered gene list for the genes (i.e. primary genes) with the N most orthogonal expression profiles. It will start this recursive computation from the gene with minimum distance to all other genes unless you have selected a "current gene" with the mouse. All Filtered genes are assigned to the nearest K-means primary node. The mean cluster vector is computed and used as the new definition of the cluster center. If you set the "Use median instead of mean for K-means clustering" option in the Clustering submenu, it will compute the center as a median instead of a mean (Bickel, 2001). K-means clustering is described in (Sneath and Sokol, 1973). A new K-means primary gene (i.e. gene for the cluster center) is found that is closest to this new center. Then all of the data Filtered genes are reassigned to the new cluster centers. The mean+-stdDev of the within-cluster distance to its center is computed. It then pops up a text window with an ordered report of the Filtered genes illustrated by part of a report shown below. [This is part of a report from a 38 sample MGAP database subset of 141 genes from the set of named genes restricted by the CV data filter.] Note that clusters where the "Similarity" data is plotted as a silhouette plot use variable length strings of '****' is about the same for the entire cluster (e.g. cluster #4) contain genes that probably belong together in the same cluster. Clusters that do not (e.g. Cluster 6) probably contain two smaller more robust clusters.

    A) K-means clustering, current cluster (#4) in scatter plot, clone 
report on right

    B) Scrollable list of expression profiles for K-means clustering ordered 
by clusters

    C) Scrollable list of mean expression profiles for K-means clustering

    Figure 2.4.5.3 Genes clustered using the K-means cluster method. A) Using the current gene as the initial cluster, MAExplorer finds N orthogonal clusters assigning the set of filtered genes to these clusters using the HP-E expression profiles. All genes are iteratively assigned to these clusters. Genes belonging to the current cluster are labeled with a green cluster number both in the array and in the scatter plot. The slider determines the number of clusters (set to 6 here). A 2D scatter plot shows the genes belonging to cluster 6. The K-means cluster report on the right contains a sorted list of the genes in each cluster and has buttons to generate EP plots and reports as well as summary mean EP plots (shown) and mean cluster reports. The detailed list is shown below. B) Part of the scrollable EP plots for this data showing genes belonging to both clusters #5 and #6. C) The mean EP plots for the 6 clusters.


     
    Cluster report for 6 K-means clusters with 141 genes being clustered.
    The seed gene is [1248564] Jun-B oncogene.
    
    Clone ID  Similarity      Cluster-#  Distance-to-cluster  Gene-Name
    --------  --------------  ---------  -------------------  ----------------
    
    1248411   **************  1          Cluster [26 genes] in cluster [distNext: 1.035] wiCdist:mn+-sd=1.223+-0.453 CV=0.371  Calpactin I light chain
    1381592   **********      1          0.448  Surfeit gene 4
    1247956   *********       1          0.706  Protein kinase, cAMP dependent, catalytic, beta
    1381836   ********        1          0.761  Prohibitin
    1382325   ********        1          0.771  M.musculus mRNA for C1D protein
    1248270   ********        1          0.775  Seven in absentia 1A
    1247716   ********        1          0.794  Lipoprotein lipase
    1248184   ********        1          0.847  Mus musculus bromodomain-containing protein BP75 mRNA, complete cds
    1248564   *******         1          0.864  Jun-B oncogene
    1382667   *******         1          0.888  SERINE/THREONINE PROTEIN PHOSPHATASE PP2A-BETA, CATALYTIC SUBUNIT
    1382561   *******         1          0.931  Mus musculus GTP-specific succinyl-CoA synthetase beta subunit (Scs) mRNA, partial cds
    1248089   ******          1          1.013  M.musculus RPS3a gene
    1247780   ******          1          1.088  Proprotein convertase subtilisin/kexin type 7
    1247557   ******          1          1.104  M.musculus L28 mRNA for ribosomal protein L28
    1248321   *****           1          1.278  Decay accelerating factor 1
    1382751   ****            1          1.311  Clusterin
    1382007   ****            1          1.357  Murine mRNA with homology to yeast L29 ribosomal protein gene
    1382074   ****            1          1.390  Orosomucoid 1
    1381963   ****            1          1.417  M.musculus mRNA for ribosomal protein L36
    1248278   **              1          1.658  HISTONE H3.3
    1247630   **              1          1.675  Procollagen, type I, alpha 2
    1247865   *               1          1.837  Mouse beta-D-galactosidase fusion protein mRNA, complete cds
    1382236   *               1          1.85  Caspase 7
    1247833                   1          1.882  Mus musculus radio-resistance/chemo-resistance/cell cycle checkpoint control protein (Rad9) mRNA, complete cds
    1248535                   1          1.953  M.musculus mRNA for selenoprotein P
    1247702                   1          2.157  Cytochrome C oxidase, subunit Va
    1382282   **************  2          Cluster [13 genes] in cluster [distNext: 24.199] wiCdist:mn+-sd=16.184+-6.667 CV=0.412  Max interacting protein 1
    1382159   **********      2          9.086  TRANSPLANTATION ANTIGEN P35B
    1247854   *********       2          11.002  Prolyl 4-hydroxylase, beta polypeptide
    1247970   ********        2          11.786  Mouse mRNA for osteoblast specific factor 2 (OSF-2)
    1381663   ********        2          12.948  Mus musculus vacuolar adenosine triphosphatase subunit A gene, complete cds
    1382100   ********        2          13.34  T-complex protein 1, related sequence 1
    1248366   ********        2          13.541  Mus musculus cytochrome c oxidase subunit VIIa-L precursor (Cox7al) mRNA, nuclear gene encoding mitochondrial protein, complete cds
    1247568   ********        2          13.762  Cathepsin D
    1247872   *******         2          14.015  Mus musculus endothelial monocyte-activating polypeptide I mRNA, complete cds
    1382333   *******         2          14.065  Stromal cell derived factor 5
    1382008   *******         2          15.985  Mus musculus FK-506 binding protein homolog (SAM11) mRNA, complete cds
    1247724   ****            2          21.964  Glutathione-S-transferase, alpha 3
    1247846                   2          34.704  House mouse; Musculus domesticus kidney mRNA for Phosphatidic acid phosphatase, complete cds
    1247945   **************  3          Cluster [22 genes] in cluster [distNext: 11.979] wiCdist:mn+-sd=7.559+-3.347 CV=0.443  Mus musculus mRNA for DEDD protein
    1247797   **********      3          4.159  Mus musculus Btk locus, alpha-D-galactosidase A (Ags), ribosomal protein (L44L), and Bruton's tyrosine kinase (Btk) genes, complete cds
    1382087   **********      3          4.494  Cell division cycle 42
    1247539   **********      3          4.511  EST
    1248212   **********      3          5.009  Murine mRNA for integrin beta subunit
    1248470   **********      3          5.044  EST
    1247521   *********       3          5.299  Mus musculus mRNA for peroxisomal integral membrane protein PMP34
    1381808   *********       3          5.924  Mus musculus UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase-T3 mRNA, complete cds
    1381970   *********       3          6.285  Mus musculus thioredoxin mRNA, nuclear gene encoding mitochondrial protein, complete cds
    1382168   *********       3          6.343  N-terminal Asn amidase
    1382704   *********       3          6.36  Mus musculus N-myristoyltransferase 1 mRNA, complete cds
    1248548   *********       3          6.378  Mus musculus WDR protein mRNA, complete cds
    1247564   ********        3          6.652  Erythrocyte protein band 7.2
    1248588   ********        3          6.67  M.musculus BAP31 mRNA
    1247541   ********        3          6.690  Apolipoprotein D
    1248462   ********        3          7.322  Sterol O-acyltransferase 1
    1248462   ********        3          7.42  Sterol O-acyltransferase 1
    1248521   ******          3          9.121  Mus domesticus nuclear binding factor NF2d9 mRNA, complete cds
    1382212   ******          3          10.137  Thyroid autoantigen 70 kDa
    1382270   *****           3          10.529  Voltage-dependent anion channel 2
    1248152   *****           3          10.541  M. musculus mRNA for MAP kinase-activated protein kinase 2
    1247678                   3          19.431  Casein alpha
    1247543   **************  4          Cluster [44 genes] in cluster [distNext: 1.035] wiCdist:mn+-sd=0.439+-0.266 CV=0.606  RAS-related C3 botulinum substrate 1
    1381923   ************    4          0.158  Prolyl 4-hydroxylase, beta polypeptide
    1382052   ************    4          0.209  Trans-acting transcription factor 1
    1247882   ***********     4          0.237  Mus musculus AMP activated protein kinase mRNA, complete cds
    1248099   ***********     4          0.246  Mus musculus mitogen-responsive 96 kDa phosphoprotein p96 mRNA, alternatively spliced p67 mRNA, and alternatively spliced p93 mRNA, complete cds
    1248351   ***********     4          0.251  Abl-interactor 1
    1247540   ***********     4          0.255  Mus musculus mRNA for ZIP-kinase, complete cds
    1248316   ***********     4          0.26  Mus musculus proteasome alpha7/C8 subunit mRNA, complete cds
    1382671   ***********     4          0.264  Mouse MA-3 (apoptosis-related gene) mRNA, complete cds
    1382014   ***********     4          0.277  Transcription elongation factor B (SIII), polypeptide 1 (15 kDa),-like
    1247885   ***********     4          0.289  Mus musculus mRNA for ryudocan core protein, complete cds
    1248294   ***********     4          0.292  Mus musculus thioredoxin-related protein mRNA, complete cds
    1382066   ***********     4          0.306  Inhibitor of DNA binding 2
    1248597   ***********     4          0.307  Lipocortin 1
    1248591   ***********     4          0.324  Interferon beta, fibroblast
    1248445   **********      4          0.333  Mus musculus beta prime coatomer protein mRNA, partial cds
    1247775   **********      4          0.34  House mouse; Musculus domesticus male brain mRNA for ARF1, complete cds
    1382750   **********      4          0.340  Thymoma viral proto-oncogene
    1247905   **********      4          0.341  Monokine induced by gamma interferon
    1381668   **********      4          0.351  Mus musculus mitogen-activated protein kinase-activated protein kinase mRNA, complete cds
    1381811   **********      4          0.356  Protein tyrosine phosphatase, receptor type, D
    1382031   **********      4          0.358  Protease (prosome, macropain) 28 subunit, beta
    1248345   **********      4          0.363  Mus musculus alpha-methylacyl-CoA racemase mRNA, complete cds
    1382555   **********      4          0.364  Lysosomal membrane glycoprotein 1
    1247820   **********      4          0.367  Tight junction protein 1
    1247598   **********      4          0.374  Retinoblastoma 1
    1247595   **********      4          0.378  PROBABLE CALCIUM-BINDING PROTEIN PMP41
    1381928   **********      4          0.379  Mus musculus MRJ (Mrj) mRNA, complete cds
    1248196   **********      4          0.399  Max protein
    1381691   **********      4          0.423  SRY-box containing gene 17
    1248225   **********      4          0.434  Mus musculus heat shock transcription factor 1 (Hsf1) gene, partial cds
    1248084   **********      4          0.442  Mus musculus Supl15h gene
    1247941   *********       4          0.453  Fibroblast growth factor inducible 14
    1381623   *********       4          0.468  Stearoyl-coenzyme A desaturase 1
    1248202   *********       4          0.473  Mouse mRNA for PAP-1, complete cds
    1382115   *********       4          0.512  GLUTATHIONE S-TRANSFERASE GT8.7
    1382044   *********       4          0.515  Cartilage derived retinoic acid sensitive protein
    1381636   ********        4          0.567  Lymphotoxin B
    1381920   ********        4          0.569  Mus musculus mRNA for NEFA protein, complete cds
    1247757   ********        4          0.596  Granzyme B
    1382094   ********        4          0.609  High mobility group protein 1
    1247545   ********        4          0.638  Carbon catabolite repression 4 homolog (S. cerevisiae)
    1247607   ***             4          1.188  POLYADENYLATE-BINDING PROTEIN
    1247727                   4          1.667  Malate dehydrogenase, mitochondrial
    1248244   **************  5          Cluster [19 genes] in cluster [distNext: 3.473] wiCdist:mn+-sd=4.273+-2.059 CV=0.482  CD80 antigen
    1248534   **********      5          1.648  Carbonyl reductase
    1247764   **********      5          1.776  H-2 CLASS II HISTOCOMPATIBILITY ANTIGEN, GAMMA CHAIN
    1381933   *********       5          2.345  Mouse rpS17 mRNA for ribosomal protein S17, complete cds
    1381616   *********       5          2.42  Mus musculus oral tumor suppressor homolog (Doc-1) mRNA, partial cds
    1248232   *********       5          2.486  Mus musculus putative glycogen storage disease type 1b protein mRNA, complete cds
    1382644   ********        5          2.717  Cyclin G
    1248125   ********        5          2.791  Histocompatibility 2, class II, locus Mb2
    1247799   ********        5          2.869  Mus musculus signal recognition particle receptor beta subunit mRNA, complete cds
    1247708   ********        5          3.024  Ephrin A1
    1247932   ******          5          4.235  Mus musculus (clone: pMAT1) mRNA, complete cds
    1382515   *****           5          4.668  ATPase, Na+/K+ beta 3 polypeptide
    1248586   *****           5          4.838  Mus musculus viral envelope like protein (G7e) gene, complete cds
    1248198   ***             5          5.874  Mus musculus D9 splice variant 2 mRNA, complete cds
    1381623   **              5          6.224  Stearoyl-coenzyme A desaturase 1
    1382086   *               5          6.885  Mus musculus (strain C57Bl/6) mRNA sequence
    1247887   *               5          7.014  Mouse chromosome 6 BAC-284H12 (Research Genetics mouse BAC library) complete sequence
    1247886                   5          7.810  Cut (Drosophila)-like 1
    1248303                   5          8.094  Lipopolysaccharide response
    1247621   **************  6          Cluster [17 genes] in cluster [distNext: 19.157] wiCdist:mn+-sd=12.410+-3.024 CV=0.244  Mus musculus Lsc (lsc) oncogene mRNA, complete cds
    1248050   *******         6          7.407  Mus musculus C57BL/6J ribosomal protein S28 mRNA, complete cds
    1247698   *******         6          7.571  Adipocyte protein aP2
    1248240   *****           6          9.198  Mus musculus mRNA, complete cds
    1247862   ****            6          9.844  Mus musculus Nmi mRNA, complete cds
    1382162   ****            6          10.330  CAMP responsive element modulator
    1248398   ***             6          11.007  Mouse mRNA for ribosomal protein S12
    1248281   ***             6          11.143  M.musculus mRNA for histone H3.3A
    1247852   ***             6          11.576  Twist gene homolog, (Drosophila)
    1381991   **              6          12.809  Prolyl 4-hydroxylase, beta polypeptide
    1382753   **              6          13.019  Mus musculus cleavage and polyadenylation specificity factor (MCPSF) mRNA, complete cds
    1248368   *               6          13.639  Mus musculus ribosomal protein S26 (RPS26) mRNA, complete cds
    1247639   *               6          13.692  SRY-box containing gene 4
    1248435                   6          14.262  Thymus cell antigen 1, theta
    1247961                   6          14.75  ATP SYNTHASE ALPHA CHAIN, MITOCHONDRIAL PRECURSOR
    1248344                   6          15.217  Gut enriched Kruppel-like factor
    1382234                   6          16.351  CD8 antigen, beta chain
    
    

    We call the genes closest to the "center" of the K clusters primary genes and they are reported with additional information. The "Cluster [# genes]" entries in the distance-to-cluster fields indicates that these genes are the center of the clusters (i.e. primary genes). The distNext is the distance from this cluster center to the next nearest K-means cluster center. The number of clusters N (6 in this example) is set in the popup state scroller. If you change the value of N, it will recompute the clusters and the primary genes.

    It draws magenta circles around the primary genes in the microarray and the cluster number to the right of the circle. The size of a circle corresponds to the number of genes clustered with that circle. If you click on a gene belonging to any cluster, it defines that cluster as the "current cluster". It will change the labels of the subset of genes that belong to the current gene from red (white) circle to a green (yellow) cluster number of the current cluster in the intensity (ratio) pseudoarray image. In addition, the 'edited gene list' is set to the subset of genes that belong to the current cluster. If you are also displaying a scatter plot, genes in the current cluster have their red '+' characters changed to the cluster number.

    You can click on that gene in the array image to determine its identity. You may also popup an ordered (same as the above report) plot of the clusters expression profiles by clicking on the EP plot button. You may plot the mean expression profiles of the N clusters using the Mean EP plot button. You may generate a report of all of the clustered genes or of the mean clusters using the Cluster-Report or Mn-Cluster-Report buttons respectively. If you change the Filter conditions, you may recompute the clusters using the Recompute Clusters button. Closing the text window will remove the magenta circles. If you selected the current cluster, the genes that belong to it will still be available in the 'edited gene list' for making reports, saving as a gene subset or for additional gene filtering. If you press the SaveAs GeneSets button, then K gene sets are created with the names "Cluster#1", "Cluster#2", ..., "Cluster#K". You can then save or rename the clusters you want and delete the rest. If you press the ClusterGram button, it displays the gene sets in a cluster gram order the same way as the cluster report.

    2.4.5.4 Hierarchical clustering of expression profiles

    The Hierarchical clustering of expression profiles computes the hierarchical clustering of the expression profiles of data Filtered genes and displays a clustergram and optional dendrogram. Hierarchical clustering is described in ( Sneath and Sokol, 1973). The gene data is normalized either by the corresponding HP-X sample data for each gene or the maximum raw intensities for each HP sample in the expression profile set by the Normalize by HP-X else HP's max intensities menu toggle. There are three types of clustering linkages: average-arithmetic-linkage, average-centroid-linkage, and next minimum linkage. These may be modified using the weighted average that gives equi-weighting to the child clusters in computing the mean of a new cluster, and un-weighted-average that weights them by the number of non-terminal clusters. The average-linkage clustering is very compute intensive and takes a while. The next-minimum-linkage is much faster and may result in adequate clustering for some situations.

    Clustering is represented by a binary tree and is visualized as an ordered gene clustergram and optional dendrogram sub-plot. This is similar to the methods of (DeRisi, 1996), (Eisen, 1998), and (White, 1999). Currently, MAExplorer does 1-way clustering - not the 2-way clustering of (Weinstein, 1998) and (Eisen, 1998). Each row of the clustergram represents a gene and each column represents a HP in the HP-E list of samples. Each box in a row represents the normalized expression of that gene for the HP represented in that column. The color of the box is one of 9 colors representing the normalized expression ranges and assigned according to the following table:

    
    

    Table 2.4.5.4. ClusterGram pseudocolor assignments. The colors are assigned to "box" entries in the clustergram corresponding to genes. The color represents data as either the X/Y ratio or X-Y Zdiff relative to the normalizing HP.


    . . . . . . . . .
    bright green . . dark green Black dark red . . bright red
    <1/8X 1/6X 1/4X 1/2X 1X 2X 4X 6X >8X
    
    
    The current gene may be set by clicking on a row that is then
    highlighted in green. If you click on a colored box, it will also
    report the HP name for that column and its normalized expression value
    (highlighting that box with a white circle). If the Web genomic
    databases are enabled (through the View menu, then it will also popup
    a Web page for that gene). If you set the current gene in any of the
    array, scatter plot, gene guesser, etc. displays, it will set it for
    and position the clustergram at that gene.  If the Dendrogram
    checkbox is enabled, then a dendrogram is drawn to the left of the
    clustergram boxes. Clicking on a region in the dendrogram sets a
    distance threshold (displayed at the top) and displays all parts of
    the dendrogram tree in red that have a cluster distance less than what
    you defined. If the zoom nnX button is pressed, then the
    of dendrogram drawing is magnified by nnnn-fold to make highly similar
    clusters more visible.  Pressing the button repeatedly cycles through:
    1X, 2X, 5X, 10X, 20X.  Sub-regions of the clustergram may be explored
    in more detail using the EP plot button that pops up a
    scrollable window of the ordered gene list.  You may generate
    multiple EP-subset plots so as to compare different parts of the
    clustergram. A report of all of the ordered genes may be created
    using the ClustGram Report button. The Show HP names
    button pops up a numbered list of all samples used in the expression
    profiles and clustergram.  This report has all of the normalized
    expression profiles on the right side of the report.

    
    
    A) Hierarchical clustering showing ClusterGram with current gene in green

    B) Hierarchical clustering showing ClusterGram and Dendrogram with 
thresholded dendrogram in red

    C) Selecting genes from Hierarchical clustering plots for the E.G.L. gene list

    Figure 2.4.5.4 Hierarchical clustering clustergram of genes filtered by ratio histogram bins for 19 samples from the MGAP data set. The hybridized samples are drawn as colored boxes in the 19 columns. Rows of boxes correspond to gene expression profiles. In A), the set of all genes and ESTs was filtered by the CV filter set to 0.387 and the normalization was the Zscore. The gene "Mus musculus D9 spice variant 2 mRNA, complete cds" was selected as the current gene in the clustergram. Data for this gene and the selected HP column is indicated at the top of the clustergram. The list of the 19 samples is shown on the left. B) Details of clustergram and dendrogram are shown where the user had selected a cluster distance threshold at "Mouse mRNA for mitochondrial cytochrome c oxidase subunit Vb" in the dendrogram part of the plot (zoomed by 2X). This selection draws all parts of the dendrogram tree that are less than this distance are drawn in red. C) shows the manual selection of genes from the ClusterGram or Dendrogram by clicking on the genes names you wish to capture in the Edited Gene List (EGL) while the Control key is pressed. The zoomed subregion shows three genes in the same cluster that were selected (magenta stars in the right edge of the ClusterGram).

    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.4.6 Report menu

    Various reports summarizing gene or sample data may be generated and appear in popup tables. These include:
    • Data on all of the hybridized samples in the database including links to related Web databases
    • Data on additional quantification and descriptive information on the hybridized array samples in the database
    • Hybridized sample calibration, Samples vs. Samples correlation coefficient of Filtered genes, and mean & variance tables
    • Tables of all named genes, the calibration DNA genes, etc.
    • A table of genes in the 'Edited Gene List'
    • A table of genes in the current Gene Class
    • A table of genes passing the "Filter"
    • The N genes with the highest or lowest ratios (HP-X/HP-Y or F1/F2) of the Filtered genes (N defaults to 100, but may be changed as a Preference)
    • Tables of additional data for the Filtered genes including expression profile ratios, HP-X vs. HP-Y 'set' statistics if the data is enabled for using HP-X and HP-Y 'set' data. If you are doing a t-test of Kolmogorov-Smirnov test, then it reports the p-value and test statistics.
    • Table of statistics on F-test on the current Ordered Condition List of samples/condition, including p-value and F-test statistics.
    • Tables of correlation coefficient statistics of data Filtered genes comparing all samples in the HP-E list against each other.
    • Tables are presented as either: a) an active "clickable" spreadsheet that may contain links to Web sites and pops up a Web browser window to that site (e.g. histology links or model system in the array report), or display additional gene features and numeric comparisons. b) tab-delimited text that may be "cut" and then "pasted" into an Excel type spreadsheet for further analysis.
    • Additional expression profile data (from the HP-E samples) and/or statistical data from the HP-X vs. HP-Y 'sets' may be included in gene report tables.
    • Clicking on a column name in the dynamic report will sort the report by data in the report in ascending and descending order.
    
    

    Samples Reports menu

    Figure 2.4.6 Reports menu. You may create either dynamic or tab-delimited text reports of either Samples or of subsets of genes.


    
    

    These may be presented as interactive dynamic tables as well as scrollable text windows capable of being exported to Excel. If Web DB access is enabled, clicking on an entry will bring up a Web browser with access to GenBank data. If the report contains Clone ID as one of the fields, you can click on it to have it define that gene as the current gene and highlight it in the microarray image or scatter plot (if it is being used). The reports are divided into two types - those dealing with lists of arrays (i.e. the sample experimental condition) and those dealing with lists of genes.

    The Report menu includes:

    • Array reports - show tables of sample array information.
    • Gene reports - show tables of sets of genes information.
    • --------------------
    • Table format - generates the table as 'Spreadsheet', 'Tab delimited', or 'Name=Value list'.
    • Table font size - changes the report font size from default 10 point.

    2.4.6.1 Array report menu - hybridized samples global data

    You may generate reports of sample array information. The first two menu selections contain descriptive information about specific hybridized microarrays samples. The "Extra Samples info" contains quantitative and extra descriptive information (if available for your database).

    The "Samples vs Samples correlation coefficients" computes the correlation coefficients in an upper diagonal matrix for the current set of Filtered genes showing HP samples similarity. Then entries are of the following form where HP:1 and HP:2 correspond to samples listed in the field names of the table and the data is the intensity values using the current normalization method.

      rSq=0.748, n=1656, HP:1(mn+-sd)=(28991+-19564), HP:2(mn+-sd)=(5044+-9766)
    

    The "Calibration DNA summary" table contains the computed means, std-dev, and computed normalization scale factor for all active hybridized samples. The scale factors are used if the 'Calibration DNA' normalization is used.

    You must set the Web access checkbox if you want to click on a blue hyperlink in the resulting report to access an associated Web database.

    • Samples info - show table of all samples accessible in the database.
    • Samples Web links - show table of Web links for all sample samples accessible in the database.
    • MAE Projects DB - show table of project databases if the project startup database exists (stand-alone mode).
    • Extra info on Samples - show table of additional information on all of the samples in the database.
    • Sample vs Sample correlation coefficients - show table of correlation coefficients for all HP-E selected samples for genes meeting the data Filter criteria.
    • 'Calibration DNA' summary - show a table of means, std-dev, scale factor for all HP-E selected samples.
    • HP mean & variance summary - show table of means, std-dev, min, max, etc. of raw intensity data for all HP-E selected samples.

    
    

    A) Scrollable dynamic samples report

    B) Scrollable dynamic samples report with clickable Web links in blue

    C) Scrollable dynamic report of samples vs samples correlation using current data Filtered genes

    Figure 2.4.6.1 Hybridized samples dynamic Report windows. A) Samples Info report. B) Sample Web links. Clicking on a blue hypertext link brings up the corresponding genomic Web database entry in a separate Web browser window if the Web access is enabled. The tab-delimited version of the same reports (not shown) may be cut and then pasted into other programs such as an Excel spreadsheet. C) HP vs HP correlation table on genes passing the data Filter for all samples in the HP-=E list.


    
    

    2.4.6.2 Gene reports menu

    You may generate gene reports with various additional options. You must set the Web access checkbox if you want to click on a blue hyperlink in the resulting report to access an associated Web database. In addition, specialized gene reports may be generated from some of the cluster plot command windows. These include lists of genes sorted by cluster (K-means cluster #), by hierarchical cluster order, by similarity to a gene, etc. The mean cluster expression values may be reported for K-means clustering.
    • All named genes - show table of all genes where the gene is named.
    • Genes in 'Edited Gene List' - show table of genes in the 'Edited Gene List'.
    • Genes in 'Normalization Gene List' - show table of genes in the 'Normalization Gene List'.
    • Genes in GeneClass - show table of genes in the current gene class.
    • Filtered gene reports - show genes that meet the data Filter criteria.

    2.4.6.2.1 Filtered gene reports menu

    You may generate gene reports of Filtered genes with various additional presentation options. In the highest/lowest N genes, N defaults to 100 and is set by (Report | Table format | Set max # genes in highest/lowest report) command.
    • Genes passing the Filter - all genes passing the Filter
    • Highest HP-XY ratios or Zdiff - top N Filtered genes
    • Lowest HP-XY ratios or Zdiff - lowest N Filtered genes
    • Highest F1/F2 ratios or Zdiffs - top N Filtered genes (if duplicates). If ratio data, then highest Cy3/Cy5
    • Lowest F1/F2 ratios or Zdiffs - lowest N Filtered genes (if duplicates). If ratio data, then lowest Cy3/Cy5.
    • Expression profiles Filtered genes - show numeric expression ratios for HP-E data scaled to the first HP in HP-E.
    • HP-XY 'set' statistics of Filtered genes - show (mean,stdDev,CV,n) data for both the HP-X and HP-Y sets. If the t-Test is active, also show (p-value, df, t-statistic, pF-sameValue). If the Kolmogorov-Smirnov test is active, then it reports the (p-value, df, D-statistic).
    • Ordered Condition list statistics of Filtered genes - for the current OCL if the F-test is active, show the (f-statistic, dfWithin, dfBetween, mnSqWithin, mnSqBetween) (mean,stdDev,CV,n) data for each condition.

    If Cy3/Cy5 ratio data is being analyzed, then the Highest (Lowest) F1/F2 entries become

    • Highest Cy3Cy5 ratios or Zdiffs - top N Filtered genes (if duplicates)
    • Lowest Cy3Cy5 ratios or Zdiffs - lowest N Filtered genes (if duplicates)

    
    

    A) Scrollable dynamic report of 50 nameed genes sorted by HP-X/HP-Y ratio

    B) Same report as tab-delimited text area that may be cut and pasted to 
Excel or saved as text file for export

    Figure 2.4.6.2 Gene Report windows of 50 named genes with highest HP-X/HP-Y 'set' ratios. A) Dynamic gene report of 50 genes with highest HP-X/HP-Y 'set' ratios. A similar report may be generated for the lowest ratios or for single HP-X/HP-Y samples. This type of report may be generated for the highest or lowest Zdiff values when the Zscore normalizations are used. Clicking on a blue hypertext link brings up the corresponding genomic Web database entry in a separate Web browser window if the Web access is enabled. It also sets the current gene to the gene for that row. B) The tab-delimited version of the same report may be cut and then pasted into other programs such as an Excel spreadsheet.


    
    

    2.4.6.3 Table format menu

    The report is presented as a table. However, it may be visualized several different ways. The scrollable spreadsheet includes the ability to click on blue hypertext items and have a Web browser pop up for that item on a Web database (e.g. GenBank, dbEST, UniGene, LocusLink, mAdb Genes, GeneCard, etc). The tab-delimited option enables you to cut the table and paste it into a separate spreadsheet program such as Excel. You may also extend the data in the table to by 'Adding' expression profile ratios and statistics from the HP-X and HP-Y 'set' comparisons.
    • Spreadsheet [RB] - some cells (indicated by blue) are connected Web databases that are accessed by clicking on them (default).
    • Tab delimited - suitable for export [RB] - as in Excel, etc.
    • --------------------
    • Set max # genes in highest/lowest report or filter [CB] - sets the number of genes N to report or use in data Filter.
    • Add EP data to Gene-Reports [CB] - for each of the samples in HP-E.
    • Use EP data in Gene-Reports [CB] - use raw EP data (under the current normalization) else use data normalized to 0.0 to 1.0 by maximum value for all genes being displayed.
    • Add HP-X/-Y 'set' statistics data to Gene-Reports [CB] - that includes (mean,stdDev,CV,n) for both the HP-X and HP-Y sets.

    2.4.6.4 Table font size menu

    For wider tables, you can see more information if you use a smaller font to display the table. The font sizes available are:
    • 12pt [RB]
    • 10pt [RB] - (the default size)
    • 8pt [RB]

    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.5 View menu

    The View menu options are used to modify the view of genes visible in the pseudoarray image. Genes may be displayed with additional properties or capabilities including access to Web-based genomic database entries for specific genes. Note that depending on your particular database, if some genomic identifiers are not available then the corresponding "Enable display current gene in genomic DB Web browser" will not appear in the menu.

    Example of View Menu

    Figure 2.5 View Menu options. These are divided into various options for modifying the presentation as well as recording activity such as the messages or history popup scrollable log windows.


    
    

    A) Example of UniGene popup Web browser page for current gene

    B) Example of NCI/CIT mAdb popup Web browser page for current gene

    C) Example of LocusLink popup Web browser page for current gene

    Figure 2.5. Popup genomic browser database page. A) The UniGene Web page pops up in a new Web browser window when the user clicks on a gene in the array image, 2D scatter plot or Report and the view is set to "Display current gene in Unigene Web Browser" toggle was enabled in the View menu. The current gene was "Jun-B oncogene". Alternatively, the B) mAdb Gene DB may be selected - as well as GenBank or dbEST genomic databases. C) Alternatively, data from the NCBI LocusLink database may be accessed if either the GenBank ID or LocusID is available.


    
    

    2.5.1 Logging MAExplorer messages

    MAExplorer shows various data measurements as well as many other types of information in the three text lines in the status area of the main window. The Show log of messages pops up a scrollable log of all messages to the three line status area. This is useful for recording measurements and other activity. The messages may be saved in log file (typically maeMessages.log). Figure 2.5.2 shows an example of the messages popup log window. Clicking on genes in the pseudoarray image or in plots will log the gene data (
    see Section 3.3) given the current normalization, Samples use (single or multiple), and pseudoarray display mode. The current values of all of the State Threshold scrollers are saved in the message log when the (Edit menu | Preferences | Adjust all Filter threshold scrollers) State Thresholds popup window is closed. This is useful for capturing the current settings at any time.

    2.5.2 Logging command history

    During a datamining session, the user will typically execute many commands from the menu as well as clicking on genes in the pseudoarray image or in plots. It is useful to recording the steps you took during this analysis. The Show log of command history pops up a scrollable history of all commands issued to MAExplorer. The commands are automatically numbered. The history may be saved in log file (typically maeHistory.log). Figure 2.5.2 shows an example of the command history popup log window.

    Examples of messages and command history popup log windows

    Figure 2.5.2 Examples of messages and command history popup log windows. Measurements and other activity are shown in more detail in the messages window whereas the command history indicates commands (numbered in the order they are executed) in the command history window. Data from either of these windows may be saved in text log files.


    MAExplorer - Microarray Exploratory Data Analysis

    2.6 Plugins menu

    MAExplorer may be extended by users to use new analysis methods using Java plugins. We call these new methods MAEPlugins which are small Java programs written by users that may be dynamically loaded into MAExplorer and then applied to their data. These plugins will include plugins written by LECB, those written by academic or commercial groups. See the MAEplugins for details. If you have a Java compiled plugin in the form of either a Java .class or .jar file, you may load it at run time using the "Load plugin" command in the Plugins menu. If specified in the MAEPlugin, it will be added to the appropriate menu in the MAExplorer menu tree at the end of the specified submenu (see Appendix C. Table C.5.7). If this submenu "stub" is not specified, it will place in the list of plugins in the Plugins menu (e.g. plugin #1, ..., plugin #n).

    MAEPlugin paradigm

    Figure 2.6 MAEPlugins paradigm. If you have a MAEPlugin .jar file, then it may be specified using the "Load plugin" command. When you invoke the command from the menus (or other methods), it accesses data from the current MAExplorer database it may need from the Open Java API.


    
    
    The Plugins menu includes:

    • Load plugin - popup a browser to load a MAEPlugin jar file
    • Unload plugin(s) - popup a selector to pick MAEPlugins to unload
    • -------------------------------------
    • RLO methods - list of executable R methods (R LayOuts). These may be added by the user using the RtestPlugin. An RLO analysis allows you to export data from MAExplorer, execute it with the R program, and import the R results back into MAExplorer. [This is under development and is alpha-level.]
    • Save RLO reports in time-stamped Report/ folder [CB] - puts files generated by R from successive executions of the same RLO into separate sub-folders in the Report/ folder with names "RLOname-YYMMDD-HHMMSS/" to peep the data separate.
    • -------------------------------------
    • plugin #1 - 1st MAEPlugin added
    • plugin #2 - 2nd MAEPlugin added
    • . . .
    • plugin #n - n'th MAEPlugin added

    RLO methods menu

    This contains a list of executable R analyses methods (called R LayOuts or RLOs created with the RtestPlugin) for evaluating MAExplorer data with R analysis scripts. It is only available if you have installed the R language program (www.r-project.org) on your computer. An RLO analysis allows you to automatically export data from MAExplorer, execute it with the associated R program, and import the R results back into MAExplorer. [This is under development and is alpha-level.] A recent poster on Extending MAExplorer with R is available as a PDF file.

    The Save RLO reports in time-stamped Report/ folder [CB] options puts files generated by R from successive executions of the same RLO into separate sub-folders in the Report/ folder with names "RLOname-YYMMDD-HHMMSS/" to peep the data separate. This is useful when you want to compare results from the same RLO method but with different MAExplorer preprocessing.

    You may download the latest versions of all plugins using the (File | Update Plugins from maexplorer.sourceforge.net) menu command. Similarly, you can update your versions of the RLO methods using (File | Update RLO methods from maexplorer.sourceforge.net

    2.6.1 Example of using a Plugin

    This shows a short demonstration of what is involved in using a MAEPlugin. The user first load the plugin from the disk. Generally the plugins .jar or .class files are stored in the Plugins/ directory where you have installed MAExplorer. Then they load a particular plugin which installs it in the Plugins pull-down menu. Then they revisit that menu to invoke the particular plugin. You may load any number of plugins (until you run out of computer memory if that should occur).

    Loading a Plugin from the Plugins 'Load plugin' menu

    Figure 2.6.1 Loading a MAEPlugin from your file system using the Load Plugins command in the Plugins pull down menu. If you have a plugin .jar or .class file, it may be specified using the "Load plugin" command. This pops up a file browser to let you specify the plugin file.

    Selecting the loaded plugin command from the Plugin menu

    Figure 2.6.2 Executing the new command previously loaded in the Plugin menu. Selecting the new "Show List Active Filters" command that now appears in the Plugins menu invokes the plugin. This pops up a report shown in the next figure.

    Popup window from executing the MAEPlugin

    Figure 2.6.3 Popup window from executing the MAEPlugin. This plugin gives a full report on the data Filter status in a new pop up window.


    Selecting a previously loaded plugin from the Plugins menu

    Figure 2.6.2 Plugins menu - executing a previously loaded plugin. Plugins that do not go into particular MAExplorer submenus go into the Plugins menu. Selecting the command will invoke that MAEPlugin.


    
    

    MAExplorer - Microarray Exploratory Data Analysis

    2.7 Help menu

    Various on-line help and documents are available if you are connected to the Internet. These will appear in a separate pop-up Web browser window so you may view them while working with MAExplorer. This includes on-line documentation (including this reference manual), tutorials, and other information. This may be links to other Web pages describing key areas of specific databases. For example, for the MGAP database, the point back to key areas of MGAP including the MGAP Animal Models, Histology atlas, etc. You can then use the browser's "Save as" and "Print" options to save the data to a file or print it.

    The Help menu includes:

    • Home page - MAExplorer home page on SourceForge.net
    • Introduction - to MAExplorer
    • Overview - overview of MAExplorer capabilities
    • Short tutorial - simple tutorial of things to try in MAExplorer (Appendix A)
    • Advanced tutorial - advanced tutorial of things to try in MAExplorer (Appendix B)
    • Menu summary - short summary of the MAExplorer menus.
    • Reference manual - this Reference Manual document
    • MAEPlugins - plugin home page for MAExplorer
    • Intro to data mining - for MAExplorer, (Chapter 3)
    • Glossary - glossary of terms used in MAExplorer Reference Manual
    • Index - Index of terms used in MAExplorer Reference Manual
    • -------------------------------------
    • Database-specific help menu entries -
      entries defined for a particular database (see below)
    • -------------------------------------
    • About - show MAExplorer authors, version and revision date

     
    

    2.7.1 Adding custom help links to your database to the Help menu

    These Database-specific help menu entries list of entries are keyed to the database you are using and may be customized by the database maintainer in the configuration file (Section C.5.6) to links relating to the particular database. For example, database specific help for the MGAP database is:

    • List of hybridized samples - available from MGAP.
    • MGAP animal models - animal models used in MGAP.
    • MGAP home page - Mammary Genome Anatomy Program

    
    

    MAExplorer - Microarray Exploratory Data Analysis

    3. Exploratory Data Analysis - Introduction to Data Mining

    Data mining is the uncovering of relevant patterns of interest in data from a particular problem domain (Tukey, 1977). Typically this involves using various statistical techniques to identify the patterns including cluster analysis. See StatSoft Inc's, 2002 on-line statistics textbook for definitions of clustering and other statistical terms. Researchers across a wide range of fields such as (Tufte, 1997) and (Cleveland, 1985) have suggested that a major aspect of this problem is finding the correct means of graphical presentation to allow humans to be a part of the pattern recognition process. Tufte argues that the proper display of quantitative data in the context of the problem domain can aid in the understanding of complex sets of data. This carries over to the analysis of microarrays with data mining involves having statistical, genomic knowledge database, and graphical components for success. (Jagota, 2001) discusses a number of methods and applications for microarray data analysis and visualization. Other useful resources are the sets of papers in ( "Chipping Forecast", Nature Genetics supplement, Jan, 1999), and ( "Chipping Forecast II, Nature Genetics supplement, Dec, 2002).

    This section briefly addresses some of the issues you need to consider. However, a full discussion of the issues involved is beyond the scope of this manual. These issues are covered in other more focused statistical methods literature and you might also address them in consultation with biostatisticians. The Internet has vast resources for microarrays. A few to get you started might include: a microarray citation electronic library http://arrayit.com/e-library/, the National Library of Medicine PubMed journal search engine, a general microarray Listserv GENE-ARRAYS@ITSSRV1.UCSF.EDU. The MGED group (Brazma, 2001) has published the MIAME standard which specifies (Minimum Information About a Microarray Experiment). This information is useful in doing an analysis. Also try searching using general Internet search engines. There are a number of public microarray data repositories. One that we find useful is NCBI's GEO (Gene Expression Omnibus), that contains array data and MIAME compliant information about the arrays.

    • Data mining is a pattern discovery activity - use all the tools you have.
    • It is open-ended because of the variety of ways data may be partitioned, normalized, pre-filtered, clustered, and viewed. Patterns that are apparent in one view may not be apparent in another.
    • When data-mining microarray data, look at correlated genes from the point of view of what relationships might be interesting from a biological view by characterizing genes that cluster together using additional information. I.e. check out the results with various NCBI and PubMed database searches on the resulting genes, by designing other lab experiments to better uncover the cause and effect, etc. Remember correlation does not imply cause and effect.

    Organization of Sections in this Chapter

  • 3.1 discusses Objectives in data mining, discovery and analysis,
  • 3.2 describes the steps in an analysis,
  • 3.3 describes how the database may be interrogated for gene spot intensity & identification.
  • 3.4 describes selecting subsets of genes using the data filter.
  • 3.5 describes selecting subsets of hybridized sample conditions.
  • 3.6 describes setting thresholds using state-slider controls.
  • 3.7 describes how to export (i.e. save) report and plot data to your local computer.

    3.1 Objectives in data mining, discovery and analysis

    There are a number of objectives an investigator has when analyzing a set of data. The types of analyses and how useful they are depends on what they wish to get out of the analyses as well as the type of data.

    A good and appropriate experimental design (i.e. the design and setting up of experiments to subsequently be analyzed) is critical for resolving significant differences in gene expression between experimental conditions. We touch on some of the issues here. (Simon, 2001), (Dudoit,2000), and Kerr and Churchill (2001a, 2001b) discuss some of the issues of experimental design for microarrays. We do not currently implement the Kerr-Churchill method. However, some of the issues involved in experimental design based on the types of arrays are discussed in Section 3.1.1 for (Cy3/Cy5)-labeled as well as 33P-labeled samples.

    If users are comparing two different types of samples, the analysis would be different than if they were comparing an ordered sequence of samples (e.g. time series, cell cycle, dose-response, tumor-stage, etc.). MAExplorer gives users the ability to:

    1. Organize their experiments by sample characteristics allowing them to perform a variety of mining analyses comparing gene expression patterns between sets of different samples or comparing a single sample within an array's set of genes.
    2. Explore, compare, and record these analyses and share the results or intermediate data with other investigators.
    3. Use graphical direct-manipulation combined with statistical methods, clustering and spreadsheet techniques to gain different insights into the structure of the data.
    4. Access public Internet genomic databases for particular genes that are found to be interesting.

    Briefly, data mining is the discovery of potentially interesting patterns in the data that were previously unknown. One approaches the analysis of a set of data with minimal expectations. However, some idea of what you are interested in helps focus the search. But beware of the trap of mining the data until you get the results you hope for. The following figure helps illustrate this process.

    Flow chart of a typical data mining session

    Figure 3.1 Flow chart of a typical data mining session. The user makes some initial decisions on the experimental design such as which hybridized samples to compare, the type and numbers of replicates. They then make initial guesses as to the normalization method to use, and the gene subset (the gene class) to concentrate on when setting the data filter. The data is viewed in various modalities to get a feeling for its inherent dynamic range and where interesting outliers might appear. Clustering and plots helps bring these differences into view. The results are then evaluated and either the process is finished or the views are refined by adjusting data normalization and filter parameters, data subsets to be investigated, clustering methods, plots etc. and the process repeated until the user is able to see the differences between gene subsets more clearly or no significant differences appear to be found.

    Obviously, this approach is a first approximation to what is eventually required. But it does capture the flavor of the data-mining process. Typically the user would refine the search using variations of the data filters and might contrast (using gene sets and hybridized sample condition lists operations) results found under one set of conditions with those found under another set of conditions.

    Recording the analysis steps during your data mining session - command history

    Because of the iterative nature of this process, you might want to keep a record of the commands you have used or the messages and measurements you have made. To do this you need to enable message and command history logging. Go to the View pull-down menu and then select the type of logging you want using the Show log of messages or the Show log of command history commands.

    3.1.1 Some experimental design issues of microarray experiments

    *** THIS SUBSECTION IS IN THE PROCESS OF BEING UPDATED ***

    Proper experimental design of microarray experiments is critical to successful use of microarray data. Several recent reports discuss some of the key issues involved in various aspects of statistical analysis of microarrays: (Radmacher, 2001), (McShane, 2001), (Korn, 2001), (Simon, 2001), (Dudoit,2000).

    Comparing HP-X/HP-Y for Cy3/Cy5 data as 'ratio of ratios'

    If we have two samples HP-X and HP-Y with a common reference sample P (e.g. Cy5P), then we would be comparing the HP-X "intensity" Cy3X/Cy5X against the HP-Y "intensity" Cy3Y/Cy5Y. Alternatively, you can label Cy3 as the common reference sample P in which case just swap Cy3 and Cy5 in these equations. If you are using a common reference standard (i.e. Cy5X1) is the same sample as Cy5Y1 eg. a pooled sample Cy5P, then
     a)   (Cy3X/Cy5X1) / (Cy3Y/Cy5Y1)
    becomes
     b)   (Cy3X/Cy3Y)
    
    However, this new comparison is accompanied by additional noise because of use of the two Cy5P intermediaries.

    An alternative method would be to compute (Cy3X/Cy5Y) directly. However, this too has its own sources of error and other problems, namely that not all genes are labeled symmetrically with the two dyes since different dyes may have different sequence specific affinities due to a variety of causes. For that reason, dye-swap experiments are often done. I.e. the two samples would be run as (Cy3X/Cy5Y) as well as (Cy3Y/Cy5X). If one were to plot (Cy3X/Cy5Y) against 1.0/(Cy3Y/Cy5X) and the data were perfectly symmetric (which they are not) then one would expect a straight line. That is generally not what you get in practice.

    Another issue is that when you have a number of samples A, B, C, D, ..., N and wish to compare them, there are a number of alternate experimental designs you can use with different resulting sets of advantages and problems. If a common pooled Cy5P sample P were used, then the following experiments would be done:

       (Cy3A/Cy5P), (Cy3B/Cy5P), ... , (Cy3N/Cy5P)
    
    This assumes that there is enough of the pooled sample P to be used for all of the experiments - otherwise additional sources of error would be introduced. MAExplorer is ideally used with this common reference sample P. It a common pooled sample is not used, then the experimental design becomes more complicated - especially if dye-swap experiments are performed for all samples. For N samples taken 2 at a time (i.e. Cy3 and Cy5), then the number of experiments may be impossibly large to perform for other than a very small N. Eg. for N of 3, the number of experiments is 3 and 6 if dye swap experiments are also performed. For N of 4, the number of experiments is 6 and 12. And this is without doing any replicate experiments. If a reasonable number of replicates is added, then this set of experiments becomes even difficult to perform.

    MAExplorer is currently not oriented to handling these large combinatoric types of non-pooled sets of experiments. However, you do have the ability to swap (Cy3,Cy5) data on an individual basis so you could compute an average of data from dye-swap experiments - but with the caveats or non-uniform labeling mentioned above.

      [(Cy3X/Cy5Y) + 1.0/(Cy3Y/Cy5X)]/2
    
    In general, this is probably not a very good estimate.

    3.1.2 Design philosophy of MAExplorer methodology

    There are several ways to implement a data mining system on moderate size databases. The first is that all computations are performed on a Web server and the user's Web browser displays the results. The second is download an applet from the Web server, get the data from the Web server and do computations in the Web browser. A third way is do download data from a Web server and run a local stand-alone program on the data. MAExplorer can be run using both the second and third ways. However, we encourage the use of the stand-alone paradigm as having the best bandwidth and being the most robust. The browser-based computation paradigm (as opposed to server-based) is somewhat unusual. It keeps both the program and data on the server, making user maintenance of the latest versions easier than if they had to constantly upgrade the program or data. This also has the distinct advantage of giving the user instantaneous feedback through rapid visual and tabular views and the ability to more effectively navigate the data since the analysis is done on their desktop computer. Because it is easy to access reference data from other genomic sources (e.g. UniGene, GenBank, NCI/CIT's mAdb clone DB, dbEST, GeneCard, etc.), it can be accessed from their respective Web servers as needed. Complex browser-based computations are used in other data mining or intensive computation domains. With the increased bandwidth of the Internet and compute power and memory of PCs approaching the Cray supercomputers of the previous decade, this paradigm becomes even more feasible. However there are limits to how well it scales because of Web browser limitations. Appendix E.2 discusses these issues in more detail

    The major focus of the MAExplorer is interactive data mining with an emphasis on direct graphical and tabular manipulation of the data. The investigator is able to interact with the system by clicking on spots in the array image, points in graphic plots, cells in spreadsheets, by manipulating threshold sliders or typing in gene names/clone Ids. This level of interaction allows investigators to search for and identify patterns of differences with greater ease than with a more static graphic system since it is easier to test ideas by "grabbing onto the data". For example, "what" is the identity of "this" outlier I am pointing to in a scatter plot; "which" genes are best clustered with "this" gene in this clustergram and are perhaps co-regulated; "which" genes have expression ratios within the range of the histogram bins to that I am pointing?

    Direct user manipulation of data, as incorporated in MAExplorer, was defined by (Schneiderman, 1997) who defends the position that the direct manipulation of data in data mining is an extremely effective means to amplify human creativity in understanding patterns. Schneiderman's dogma states "overview first, zoom, and then filter details on demand" and favors the use of "shallow search trees, slide controllers, and information-right screens with tightly coordinated panel view of data", (Beardsly, 1999). MAExplorer also uses many of these direct manipulation principles. It was designed to run on the desktop computers with data residing on the same computer and loaded into its memory for rapid direct manipulation - for both the Web browser and stand-alone versions.

    3.1.3 Evolution of MAExplorer from earlier proteomic data mining systems

    MAExplorer was designed to do flexible exploratory quantitative data analysis of gene data from microarray hybridized sample experiments. Many of the data-mining concepts are derived from a system called GELLAB-II (http://www.lecb.ncifcrf.gov/lemkin/gellab.html) that is a UNIX-based stand-alone exploratory data analysis system for 2D protein gels over multiple experiments (Lipkin and Lemkin, 1981), a review (Lemkin and Lester, 1989) and examples of graphical representations of this type of data (Lemkin, 1995). An on-line GELLAB-II Web-Poster (http://www.lecb.ncifcrf.gov/lemkin/gellab-ep93wd.html) is available showing various screen shots of GELLAB-II in action. Whereas GELLAB works with sets of corresponding spots (i.e. proteins) across sets of 2D gel samples, MAExplorer works with sets of genes (spots in the microarray) across sets of hybridized sample microarrays. With protein gels, one typically has spot alignment problems since gels are generally not superimposable. This is often called the rubber-sheet distortion problem and requires localized alignment of spots based of neighboring spot constellation morphology. We have used Web-based visual methods to visually compare gels including the Flicker (http://www.lecb.ncifcrf.gov/flicker/) image comparison system a Java applet, (Lemkin, 1997), and the 2DWG (http://www.lecb.ncifcrf.gov/2dwgDB/) meta-database of 2D gel images, (Lemkin, 1999a). Since the genes are precisely spotted on the arrays, aligning spots between arrays is not required and greatly simplifies that the data analysis problem.

    Part of the Flicker system allows comparison of user 2D gel images with standard images from SWISS-2DPROT for putative identification of unknown spots in the user gels. The user would select a standard 2D gel image from over 20 tissue types, enter their own 2D gel image and align them at spots of interest. They could then switch to a database access mode, click on those spots and generate popup SWISS-2DPROT Web pages for those proteins - similar to Clone reports in MAExplorer. That is accessed at http://www.lecb.ncifcrf.gov/flicker/swissProtIdFlkPair.html.

    MAExplorer will have a groupware facility similar to what we have done with our WebGel (http://www.lecb.ncifcrf.gov/webgel/) system described in (Lemkin et al., 1999b). It is a two-dimensional electrophoresis system for sharing data analyses. In WebGel, users may perform a data-mining analysis and leave the state of the their analysis and accompanying notes to share with their collaborators on a login-protected basis.

    3.1.4 Concepts used in data mining with MAExplorer

    This section introduces some of the concepts used in data mining microarrays with an emphasis on how they are used with MAExplorer.

    Gene data filters - a Boolean AND of gene set tests

    A primary MAExplorer concept is that of
    gene data filter that selects a working set of genes by the conjunction (Boolean AND) of user selectable tests. Each test further restricts the working set of genes to those meeting the test. These criteria include gene membership in particular gene classes, membership in particular user defined or computed gene subsets, and meeting a variety of statistical constraints. Statistics include intra- and inter-array CV, X-Y sets t-tests. Range test criteria include X/Y ratio ranges and histogram bins, intensity ranges and histogram bins. Membership criteria include test if genes are in the current-cluster (derived from cluster-analysis), gene set membership, etc. By selectively including one or more of these filter restrictions, the user can home in on the data that appears to be interest. Of course as in real mining, what appears interesting may not be interesting based on further investigation.

    Set operations on gene subsets

    Because of the complexity of comparing many different replicated samples, it may be difficult to manually organize the resulting comparisons. MAExplorer offers set-theoretic operations on sets of genes and sets of hybridized samples (i.e. intersection, union, difference) to help with this organization (step 9 in Table 3.2). The results of set operations may be saved and used in subsequent set operations, normalization, as well as with the data filter. This is useful when comparing and documenting procedures, methods, and analyses from several subsets of experiments.

    User exploration states

    Users needs to be able to save and restore the current state of their explorations of the data and option settings to document and continue at later times. When running in stand-alone mode, the user may save their data mining session on the local disk as in named (.mae file extension) startup files. Clicking on one of a startup file will restart MAExplorer and restore the state to that of the time it was saved. In addition to filter and parameter status, the HP-X, HP-Y, HP-X and HP-Y 'sets', HP-E 'list', the named gene sets and HP condition sets are saved as part of the state

    User groupware sharing of exploration states with collaborators

    [In the future], these could be saved on a public Web server using multiple named state files. These are protected for the user using a login procedure. A groupware sharing of these intermediate exploratory results is available when they allow another user to access selected states. User states and groupware sharing complete step 11 in the analysis described in Table 3.2.

    We now discuss using these tools for analyzing ones data.

    3.2 Steps in data mining, discover, and analysis

    An analysis scenario may use many methods for viewing the data. A typical sequence of analysis steps is listed below in Table 3.2 in the order they might be performed. Note that this is a rough guide for a possible analysis and the iteration and backing up for of some of these steps is required for data mining complex sets of conditions, especially in the setting of constraints for the "data filter" (step 4) when the user focuses on subtle patterns of interest (c.f. Figure 3.1).
    
    

    Table 3.2 Steps in a data-mining analysis.
    1. Select a set of hybridized samples to be analyzed. Additional samples may be added or removed from the HP-X, HP-Y and HP-E sample lists prior to or during an analysis.
    2. Organize samples by selecting query type: 2-class (X vs. Y) sample sets or N-sample ordered expression lists.
    3. Select normalization (transformation scaling) method appropriate to the type of data and what types of changes you are looking for (patterns or outliers).
    4. Restrict search using the "data filter" to subset genes (pre-filter data) by gene-class, gene-subset, ratio range, intensity range, coefficient of variation, t-test, expression profile, cluster membership, etc.
    5. Review scatter and expression plots, ratio and intensity histograms to gain insights in overall characteristics of the data range and intensity distributions. Export results using either SaveAs or cut and paste operations.
    6. Cluster genes by similar expression profiles and other clustering methods. Review similarity, clustergram and dendrogram plots and reports. Export using either SaveAs or cut and paste operations.
    7. Select subset(s) of genes of interest in plots or reports.
    8. Access other Web genomic databases for genes of interest from spreadsheets or plots.
    9. Save genes of interest in named gene sets and perform set operations if required.
    10. Create gene reports for export to Excel using tab-delimited Reports and either SaveAs or cut and paste operations.
    11. In the stand-alone version of MAExplorer, save the state of exploratory data analysis for later use or sharing.

    In designing a data mining experiment, the first decision to be made is selecting the set of hybridized samples to be compared (steps 1 and 2). This is accomplished by setting the current hybridized sample-X (HP-X) and hybridized sample-Y (HP-Y). In Figure 2.4.4.2 for the scatter plot we selected a single C57B6 pregnancy day 13 and a single Stat5a (-,-) pregnancy day 13 as current HP-X and current HP-Y samples. Changing the normalization changes the view in the scatter plot so that hidden differences may be more apparent (see Figure 2.4.2.3)

    The names of the current HP-X and HP-Y samples are displayed at the top of the main window. The current HP-X and HP-Y samples may be changed at any time by clicking on a new sample from a list of samples shown on the left side of the main window or from lists of samples organized by sample population in the Samples menu.

    The next decision to be made is selection of the genes to be studied by choosing a subset from the gene class menu list (step 4). Further selection occurs throughout the analysis by clicking on spots in microarray images, points in graphic plots or cells in spreadsheets, by adjusting threshold sliders, or using the text-entry "guesser" to type in gene names, clone IDs, genomic IDs, samples, etc.

    The next decision the user must make is to set the intensity data normalization mode (step 3). Normalization of quantitative data is crucial when comparing data between different hybridized microarrays because of spotting, hybridization efficiency, uniformity, and other systematic errors.

    Genes of interest may be separated for all of the genes in the database using a cascade of data filters (step 4). Additional filtering options are easily accessible in the (data) Filter menu. Some of the filters require additional parameters. These parameters are set by state scroll bars that pop-up on the screen when data filters requiring them are added to the filter cascade. Changing scroller values causes the data filter to be automatically be reapplied and a new set of genes to be computed.

    It is desirable to reduce false-positives found by the data filter by eliminating genes with high quantification variability between duplicate spots on the same sample or spot duplicated in replicate samples. If duplicate genes are available on the array (denoted by Field 1 and Field 2 or F1 and F2 spots), this allows the computation of a coefficient of variation (CV) for the duplicates. This CV may be used in a data filter to reduce potential false-positives. CV is computed as 2|F1-F2|/(F1+F2) using those spot values for each gene, as StdDevHP/MeanHP for a set of replicate hybridized samples.

    Graphical views of the data give the user additional insights into the data. These include spot intensity and ratio or Zdiff pseudoarray images, scatter plots, histogram plots, expression profile plots, cluster plots showing genes similar to a specified gene, the number of clustered genes for each clone, divisive clusters for K-means clustering, and clustergrams and dendrogams for hierarchical clustering.

    Scatter plots are useful for visualizing data from two conditions

    The scatter plot method (step 5) allows the user to plot the intensity data between two samples, the X-sample and the Y-sample. Gene data may be spot data for two different samples (HP-X and HP-Y), means of two different sets of hybridized samples of replicate samples (sets of HP-X and sets of HP-Y), or the left and right normalized replicate data (F1 vs. F2) for the current hybridized sample. If Cy3/Cy5 data is used, then each sample is the ratio of data from two different hybridized samples. So if we have sample Cy3a and Cy3b then HP-X could be Cy3a/Cy5 and HP-Y could be Cy3b/Cy5 such that we are scaling the Cy3a and Cy3b samples using a common Cy5 normalization sample. Scatter plots are useful for obtaining a better understanding of the outliers when comparing different hybridized samples and determining the reproducibility of spotting when comparing F1 vs. F2 data or replicate sample data.

    Filtering genes by histogram plots of ratios, Zdiffs or intensities

    Histogram plots may be generated from either X/Y ratios or (X-Y) Zdiffs of two different hybridized samples (single samples or X and Y replicates) or from the F1/F2 intensities of a single hybridized sample. Selecting a bin in a histogram restricts filtered genes to those that are contained in that histogram bin. As an alternate method, data filtering by ratio (Zdiff) or intensity range may be used with adjustable range scrollers independent of the histograms. However, histograms and scrollers may be used together. For example, one could filter by the ratio histogram after filtering out genes with low-intensity values that may be considered noise using the intensity sliders. That might help eliminate falsely high ratios resulting from dividing high X values by a very small noisy Y values. Histograms are useful for getting a better understanding of the range and distribution of the gene intensities or ratios.

    Expression profile plots (EP-plot) of N conditions for viewing time series, etc.

    List HP-E is an ordered list of samples - as different from HP-X and HP-Y that are unordered sets of samples. The expression profile (step 5) of a gene is the plot of its normalized intensity as a function of the samples in the ordered HP-E list. It may be plotted for the current gene in a pop-up window. Selecting a different current gene causes the EP-plot to be displayed for that gene. Multiple EP-plots may be created to view the differences between a few genes you are investigating further. The HP name button pops up a window with the ordered list of samples so you can see the details of the sample names being plotted. Selecting a line in a plot displays the intensity data and sample name for that hybridized sample. The data may be plotted as a bar, point or continuous curve and error bars may be turned off to better compare multiple plots.

    When there are too many EP-plots to be viewed simultaneously, you might use a scrollable list of expression profile plots that lets you scroll through an arbitrarily large list of genes. However, it is difficult to compare genes that are not sorted in some way (i.e. clustered). Therefore, these are most useful when used after clustering the data and displaying the scrollable EP-plots of the cluster-order data.

    Clustering is one way of possibly finding co-expressed genes that exhibit similar expression changes in a set of samples. Genes may show similar co-expression, but that does not prove they are co-regulated at the same point in a pathway - merely that measurements of those genes in a particular set of experiments show similar expression. However, identifying genes with similar expression for which some information is already known about some of the genes may be useful as a starting point to help figure out gene function and pathway using additional experiments and analysis.

    There are many methods for doing clustering - each with advantages and disadvantages. We present three methods in MAExplorer and plan on adding a variety of more powerful methods through the MAEPlugin facility under development.

    Finding clusters of genes with similar expression profiles: similar, cluster counts, K-means, and hierarchical methods

    We may define a cluster of genes as a set of genes whose expression profiles are found to be similar (step 6). The samples used in computing the expression profiles are specified by the HP-E ordered list. You can scale the list of normalized intensity data for each gene to 1.0 (resulting in finding genes with similar shaped EP-plots). Alternatively, if you don't scale this data it will cluster more on magnitude changes. You can select either the Euclidean distance or the correlation coefficient of the EP lists between two genes as the measure of gene-gene distance. Similarity is 1.0 - normalized distance.

    The first cluster method finds a cluster of genes whose expression profiles are similar to that of the currently selected gene. This list of genes is restricted by the constraint that the cluster distance between each of these genes to the selected gene is less than the "Cluster threshold" distance set by the user with a scroll bar. It displays genes that are found both with blue boxes (the larger the box, the higher the similarity) and in a text report window showing the genes and their distances to the current gene. By varying the threshold and observing the results, the user can find a set of highly correlated genes. If the threshold is set to 0.0, no genes are found. If it is set too high, all data filtered genes are found. So it is critical to adjust the threshold to a reasonable level commensurate with the type of data being analyzed and the approximate number of genes expected.

    A second cluster method draws blue circles in the array image around all filtered genes meeting the threshold criteria, where the larger the circle the larger the number of similar genes (i.e. passing the threshold) are found to be clustered with that gene. Clicking on a gene toggles between the first and second methods. For both of these methods, it will pop-up a "Cluster Distance" threshold scroller and recomputes the clusters if you change the scroller value or the current gene. It also shows a text report that displays the number of genes similar to each data filtered gene.

    A third method called "K-means" clustering K genes (we call primary nodes) whose expression profiles are most orthogonal to each other. It uses the current gene as the first or "seed" node. It then finds the gene furthest from this and assigns it as node 2. Then the gene furthest from both nodes 1 and 2 is assigned to node 3, etc. This process is repeated until all K nodes are assigned. Then the remaining genes are assigned to the closest node. Having defined the initial cluster centers, it recomputes the centroid of each of the clusters. The centroid can alternatively be computed using a median instead of a mean in which case we would be doing K-median clustering (Bickel, 2001). K genes are then reassigned to the nearest new centroids as the new K-means node instances. Finally, the remaining genes are assigned to the nearest centroid. A scrollable K-means cluster text window report pops up with genes sorted by cluster. Clicking on a gene in either the array image or scatter plot assigns all genes in the cluster to which that gene belongs to the "current cluster". Genes in the current cluster are labeled in the array and scatter plot with a small number of the cluster. In addition, genes in the current cluster are copied to the E.G.L. where they can be used in a report, saved in a named gene set, or used for additional filtering. It also pops up a "N-clusters" scroll bar window to let you dynamically adjust the number of clusters. Changing N will recompute the clusters. When the K-means is recomputed, it uses the current gene as the initial seed gene.

    The fourth method is a hierarchical clustering method that generates a clustergram and dendrogram similar to that of Eisen's red-black-green clustergram (Eisen, 1998). This was derived from the clustered correlation map (ClusCor) of Weinstein et al. (Weinstein, 1997). The MAExplorer clustergram and dendrogram are dynamic and may be interrogated and used to set the current gene. This means that it may also position a corresponding ordered list of expression profile plots to the same gene so you may view the data as a plot as well. The dendrogram may be zoomed in to explore a part of the dendrogram in more detail. As with the K-means clustering, a report can be made of the ordered genes.

    Gene reports: dynamic spreadsheets for Web access or tab-delimited for exporting data to Excel

    Pop-up report windows (step 7) may be generated for either individual genes or a global array sample data. Instances of the latter include experimental information and Web links, global statistics, correlation coefficients between array samples, etc. Gene reports may present this data in a number of ways. These include: highest/lowest gene ratios, profiles, parametric and cluster statistics, etc. Reports may be presented as either dynamic Web-interactive spreadsheet tables or as static tab-delimited tables. The latter is useful for exporting data using cut and paste into Excel (step 10). If the user clicks on a blue hyperlinked cell in a dynamic spreadsheet table, it pops up another Web browser window and loads it with data (step 8) from the respective Internet genomic database such as mAdb Clone DB, UniGene, GenBank, dbEST, GeneCard, and MGAP model and histology Web pages.

    Collaborative groupware environment

    Having immediate access to collaborator's data is a powerful research tool. A collaborative environment is being implemented for MAExplorer that allows groups of users to share data and intermediate results. These capabilities include: 1) the ability to save and restore exploratory data mining sessions (states) through the Web server including named sets of genes, and 2) to selectively share these states with collaborators. The latter process is sometimes called a groupware environment because if offers a collaborative group the ability to share and interact. These capabilities are modeled after our WebGel system (Lemkin et al., 1999b). In addition, users can create a custom database Web page as a subset of samples from the entire database. This may be saved on their own computer through their Web browser's "File/Save as" command. This hypertext file could then be used at a later time to access the database or be E-mailed to a collaborator to do the same.

    3.2.1 Definition of expression profile

    It is helpful to define an expression profile. There may be alternate definitions, but the following is useful for getting an understanding of how it might be computed. An expression profile ej of an ordered list of N samples (k=1 to N) for a particular gene j is a vector of scaled expression values vjk.

    Then, the expression profile is expressed as a list of values:

       ej = (vj1, vj2, vj3, ..., vjN)
    
    A difference between two genes p and q may be estimated as a N-dimensional metric "distance" between ep and eq. The Euclidean distance is then defined as
       dpq = (1/N SUMj=1:N (vjp - vjp)2 )1/2
    
    Other distance measures may include correlation coefficient, city-block (or manhatten distance) etc.

    For scaled data such that dpq has a maximum value of 1.0 ovger all samples. A similarity measure could be computed as 1.0 - distance or

     
      spq = 1 - dpq
    

    3.2.2 Clustering Methods

    Clusters represent one way to identify similar gene expression across a set of experiment samples. There are many ways to cluster the data, some of which are available in MAExplorer. These include:
    1. Finding genes with similar expression
    2. K-means clusters where the number of clusters K is fixed prior to clustering and a particular gene is used to start off the cl.
    3. Hierarchical clustering where a binary tree hierarchy is created
    Other methods include Self Organizing Memory (SOM), fuzzy clustering, Support Vector Machines (SVM), etc.

    3.2.2.1 Clustering similar genes

    If we have a particular gene s (the "seed" gene), we may want to find a set of all genes {gj} similar to gs. We can find this set of genes by testing We define a particular gene gj as similar to seed gene if the distance between genes s and j meets the following criteria.
      djs < T
    
    The threshold T is set by the investigator and in MAExplorer is changed using a slider. Typically, the set of all genes {gj} found is sorted by similarity before being viewed.

    3.2.2.2 K-means clustering

    K-means clustering finds K clusters of genes with similar expression profiles to a given gene (see
    Sneath and Sokol, 1973). Given the number of clusters K, we could use high variance of clusters to determine if they should split into sub-clusters. K-means clustering does not need a distance matrix (see Hierarchical clustering which follows), so it is faster and may cluster large numbers of N genes. However, it is highly dependent on seed selection. It may be useful for getting an initial estimate - especially if other techniques (such as silhouette plots) are also used. The following is a simplified definition of one way to compute a set of K-means clusters of gene expression profile data.

    Algorithm:
    1. Pick seed gene s and put it into cluster 1 (let k = 1)
    2. For all clusters j = 1 to k, find gene q such that djq is a maximum
    3. Set k = k+1. Put gene q into new cluster k
    4. For j = k to K, repeat steps 2 and 3 until there are K clusters
    5. Then, assign (N-K) remaining genes q into one of the K clusters j with minimum djq
    6. Compute new virtual gene cluster centroids as means or medians {ek} for each of K clusters
    7. Reassign all N genes q into K new clusters with minimum dpq using virtual genes {ep}
    8. Variants: use multiple seed genes, range of K values, minimize CV

    3.2.2.3 Hierarchical clustering

    Hierarchical clustering of a set of genes will generate a binary tree of clusters with the genes at the terminal ends of the tree and a single cluster of the entire tree at the top (also called the root) of the tree. See (Sneath and Sokol, 1973) for a discussion on hierarchical clustering. There are many other variants of hierarchical clustering. Hierarchical clustering requires a distance matrix or the equivalent of one [there are more efficient ways to compute it]. ForN genes (terminal clusters), it generates 2N-1 clusters. Distance matrix is upper diagonal matrix D of dpq of size N(N-1)/2.

    D can get quite large for clustering a large number of genes N [for N=5000, this is > 50 Mbytes!]

    The following is a simplified definition of one way to compute a hierarchical clustering of gene expression profile data.

    Algorithm:
       1. Assign all N genes to clusters 1 to N, set n to N
        2. Find two clusters p and q such that dpq is a minimum doing the following:
    2.1 Compute a virtual cluster vector ek = average (ep,eq)
    2.2 Set n = n+1
    2.3 Assign virtual cluster to new cluster
       3. Repeat step 2 until n = 2N-1.

    3.3 Display of gene spot intensity and identification data measurements

    You may select the current gene by clicking in the pseudoarray image or in the X-Y scatter plot and MAExplorer reports. The microarray grid coordinates, normalized quantified spot intensity data, plate coordinates, gene name (if known) and associated data for that gene. If you are displaying a pseudocolor ratio (X/Y) or Zdiff (X-Y) image, it will report HP-X/HP-Y or (HP-X - HP-Y) data respectively. It also sets the gene as the current gene. The pseudoarray image coordinates are reported as:

         [<field>-<grid name><row#>,<col#>].
    e.g.
         [1-A4,3]
    

    If there is only one field in the array, it will appear as field 1. In the above example, [1-A4,3] is field 1 grid A row 4 and column 3. Note that the pseudoarray coordinates are for visualization purposes in MAExplorer and may or may not be the same as the coordinates on the actual array. That depends on how the MAExplorer database was defined in the configuration file described in Appendix C.

    When the current gene is defined, it will draw a yellow (green) circle around the spot in the ratio (intensity) pseudoarray image and display other features of the gene in the three-line status area near the top of the main window. If background correction is enabled (the "Use background intensity correction" in the Normalization menu), then spot intensity values will appear as intensity' (with background intensity subtraction) and intensity (without background subtraction).

    There are a number of different reporting formats available depending on the array display mode and particular normalization method selected. These include: the pseudoarray image of the intensity of a single sample, the pseudocolor ratio X/Y or Zdiff (X-Y) image (using either HP 'sets' or single samples), or the ratio of Cy3/Cy5 for dual-labeled dyes or F1/F2 for replicate spots for a single sample. In addition, the normalization mode is also displayed in the reporting line. We will present examples of each of these different reporting formats.

    You may show the intensity data for a particular spot in the currently displayed pseudoarray image. First select the "Pseudograyscale image" option in the "Show Microarray" submenu in the "Plot menu". If your data has duplicate grids (i.e. fields F1 and F2) then you may look at F1, F2 and mean (F1+F2)/2 data in the reports when you click on a spot. If the "Gang F1-F2 scrolling" switch is disabled in the "View menu", then the intensity value is the intensity data value for the gene at that location. If the "Gang F1-F2 scrolling" switch is enabled, then it reports intensity[F1], intensity[F2], and the F1/F2 ratio. These two formats are shown in the following two examples for a C57B6 pregnancy day 13 samples in the MGAP database:

    a) Field F1 spot for a single spot in a single sample with the median intensity selected.

    
    [1-A4,5] intensity=4.5267, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    
    b) Field F1 and F2 replicate spots for a single sample. The top line is shown for each of the different normalization methods.
    
    [1-A4,5] intensity[F1]=-0.3067, intensity[F2]=-0.2312, F1-F2=-0.0755, (Norm.: Zscore intensity)
    [1-A4,5] intensity[F1]=4.5267, intensity[F2]=6.2408, F1/F2=0.7253, (Norm.: median intensity)
    [1-A4,5] intensity[F1]=0.8755, intensity[F2]=1.1457, F1-F2=-0.2701, (Norm.: log median intensity)
    [1-A4,5] intensity[F1]=-0.1442, intensity[F2]=-0.0945, F1-F2=-0.0497, (Norm.: Z-score, stdDev, log intensity)
    [1-A4,5] intensity[F1]=-0.1533, intensity[F2]=-0.1004, F1-F2=-0.0528, (Norm.: Z-score, mean abs.deviation, log intensity)
    [1-A4,5] intensity[F1]=630.9911, intensity[F2]=869.9273, F1/F2=0.7253, (Norm.: calibration DNA intensity)
    [1-A4,5] intensity[F1]=1919.9376, intensity[F2]=2646.957, F1/F2=0.7253, (Norm.: scale to max. (65K) intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    
    If the "Pseudocolor HP-X/HP-Y ratio or Zdiff" option is selected in the "Show Microarray" submenu, data is reported as either Ratio or Zdiff data depending on the normalization method selected. The data used in the following examples is for C57B6 pregnancy day 13 (HP-X) compared with Stat5a (-,-) pregnancy day 13 (HP-Y).

    c) Ratio data for two samples X and Y in separate hybridized arrays. Ratio data for the field F1 and F2 spot data as well as the mnX/mnY ratio is reported. The median normalization was used in this example.

    
    [1-A4,5] HP-XY: mn(X,Y)=(5.383,6.834) (X/Y)(F1,F2,mean)=(0.651,0.928,0.787), (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    
    d) Zdiff data for two separate samples X and Y. Ratio data for the field F1 and F2 spot data as well as the mnX-mnY Zscore difference is reported. The three Zscore, ZscoreLog, and logMean normalizations were used in this example (first lines are shown).
    
    [1-A4,5] HP-XY: mn(X,Y)=(-0.269,0.151) (X-Y)(F1,F2,mean)=(-0.470,-0.370,-0.420), (Norm.: Zscore intensity)
    [1-A4,5] HP-XY: mn(X,Y)=(-0.119,0.051) (X-Y)(F1,F2,mean)=(-0.199,-0.142,-0.170), (Norm.: Z-score, stdDev, log intensity)
    [1-A4,5] HP-XY: mn(X,Y)=(1.010,1.224) (X-Y)(F1,F2,mean)=(-0.362,-0.064,-0.213), (Norm.: log median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    
    e) Example of when the "Use dual HP-X & HP-Y Pseudoimage" mode is enabled in the "Show Microarray" submenu of the "Plot" menu. This displays mean data for the HP-X and HP-Y data side-by-side. The median normalization was selected.
    
    [1-A4,5] intensity[X]=5.3837, intensity[Y]=6.8342, X/Y=0.7877, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    

    Reporting for multiple hybridized samples when using HP-X/-Y 'sets'

    If you have enabled MAExplorer to "use HP-X and HP-Y 'sets' of multiple samples" rather than single samples" in the Samples menu, it will report a spot differently using the means (mn), standard deviations (S.D.), coefficient of variations (CV) for the samples in the HP-X and HP-Y 'sets'. For duplicate fields, these are computed using the normalized average of F1 and F2 spots for each gene in each samples. The data used in the following examples is for three C57B6 pregnancy day 13 (HP-X) samples, and five Stat5a (-,-) pregnancy day 13 (HP-Y) samples.

    f) Multiple HP-XY 'sets' using median normalization for the pseudoarray image display for the HP-X 'set' of three C57B6 samples.

    [1-A4,5] HP-X 'set' mean intensity=3.295 stdDev=1.482 CV=0.449 n=3, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    g) Multiple HP-XY 'sets' using median normalization for the pseudoarray image display for the HP-Y 'set' of five Stat5a (-,-) samples.
    [1-A4,5] HP-Y 'set' mean intensity=8.180 stdDev=0.986 CV=0.120 n=5, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    h) Multiple HP-XY 'sets' using median normalization for the pseudoarray image display for the HP-X and HP-Y 'sets' when the "Use dual HP-X & HP-Y Pseudoimage" mode is enabled in the "Show Microarray" submenu of the "Plot" menu.
    [1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120)\
          n(X,Y)=(3,5), (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    
    i) Multiple HP-XY 'sets' using median normalization for ratio (HP-X/HP-Y) data for the "Pseudocolor HP-X/HP-Y Ratio or Zdiff" display.
    [1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120) \
            n(X,Y)=(3,5), (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, platey[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    

    j) Multiple HP-XY 'sets' p-value using median normalization for ratio (HP-X/HP-Y) data for the "Pseudocolor (HP-X,HP-Y) 'sets' p-value display.

    [1-A7,20] HP-XY: mn(X,Y)=(3.449,0.853) (X/Y)(F1,F2,mean)=(4.09,4.008,4.041), (Norm.: median intensity)
    CloneID: 1382656, dbEST5': 1775754, GenBank 5': AI036495, UniGene: Mm.300,  plate[12,A,8]
    GeneName: Carbonic anhydrase 3
    

    Reporting for Cy3 and Cy5 channels for a single hybridized sample

    k) If you have Cy3/Cy5 data, then you can look at the two channels for a single sample ( the current HP sample). For median normalization and the display set to "Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5 data" display.
    [1-A6,11] Cy5/Cy3=0.3588, Cy5=67.324, Cy3=187.622, (Norm.: median intensity)
    CloneID: IMAGE:1054189, 
    GeneName: expressed sequence AW213287
    

    Reporting HP-X (Cy3 or Cy5) vs HP-Y (Cy3 or Cy5) data for 2 samples

    l) If you want to compare Cy3 or Cy5 in the HP-X sample with a Cy3 or Cy5 value in the HP-Y sample, you do it through the special Cy3,Cy5 scatter plots. There are four types of plots:
    1. HP-X Cy3 vs HP-Y Cy3
    2. HP-X Cy3 vs HP-Y Cy5
    3. HP-X Cy5 vs HP-Y Cy3
    4. HP-X Cy5 vs HP-Y Cy5
    After the plot is started, clicking on a scatter plot will report data from the point in that plot will print the following data as shown in the following example where HP-X Cy3 is plotted against HP-Y Cy3.
    [1-A5,16] intensX=4.695, intensY=5.923, (X-Y)=-1.2275, (Norm.: log median intensity)
    CloneID: IMAGE:963758, 
    GeneName: RIKEN cDNA 2410114O14 gene
    

    
    

    3.4 Selecting subsets of genes using the data Filter

    Genes may be selected on a number of criteria specified by the
    data Filter (Section 2.4.3) that is a cascade of data tests. The first might be the gene class (Section 2.4.1) to restrict the set of all genes to a particular subset. Various numeric and statistical data tests might be applied on the remaining genes to exclude those not meeting these tests. For example, genes having a high coefficient of variation between duplicate spots on the same sample or duplicate samples could be eliminated. Then one could select genes that had a (HP-X/HP-Y) ratio greater than 4.0 but less than 8.0, etc. The latter could be done using either the ratio scrollers or by clicking on that bin in the ratio histogram plot. See the Filter menu options and look at one of the tutorials (Appendix A) for ideas on adjusting the Filter to close in on a particular subset of genes.

    3.5 Selecting subsets of hybridized sample conditions

    Sets and lists of hybridized sample conditions (HP-X and HP-Y sets, HP-E list) may be selected using various commands from the Samples Menu (Section 2.2) including pull-down menus, guessing by name or part of a name, or using a "Chooser" (see Figure 2.2.1) to design your settings for the current (HP-X and HP-Y sets, HP-E list). The Chooser is the easiest way to select these entries. In addition, if you want to change the current HP-X or HP-Y individual sample, you can do this directly from the array pseudoarray image by clicking on the [X] or [Y] part of the image and then selecting the particular sample to use. Note that if the mouse-over checkbox is enabled, then moving the mouse over the sample names gives you the full sample name. Otherwise, the sample name may be truncated.

    3.6 Setting threshold values using the state-scroller sliders

    You may filter genes using a variety of thresholding operations (see the Filter menu (Section 2.4.3) to select any of these). For example, these include a spot intensity (per channel) range [SI1:SI2], gene intensity range [I1:I2], ratio range [R1:R2], Zdiff range [Z1:Z2], Coefficient Of Variation (CV) range [0:1.0], p-value range [0:1.0] for the t-test, etc. Additional threshold scrollers are used with the clustering methods including the number of clusters (default 6), the maximum cluster distance from a gene to a another gene for the latter to be considered in the same cluster, and the absolute difference between HP-X and HP-Y.

    For the intensity and ratio threshold filters, the range interpretation may be inside, or outside the specified range. The ratio range [R1:R2] is between 0.01 and 100.0. The Zdiff range [Z1:Z2] and [CZ1:CZ2] are between -4.0 and +4.0. The intensity threshold range [I1:I2] is set to the dynamic range of the min and max intensity for the current normalization method.

    A list of possible threshold sliders is shown in the following table. When a Filter is enabled that requires a slider, it pops up the State Scrollers window that contains one or more slides. When you disable all filters that use these sliders, the popup window will disappear. The corresponding Ratio R1[R2] or Zdiff Z1[Z2] sliders are used if you are using a ratio or Zscore normalization - and will change if the normalization changes while the filter is active.

    Some of the sliders are implemented with a non-linear scale so that you have more resolution at the low end (eg. p-Value, Spot CV, Diff HP-XY).

    Depending on the set of data Filters selected, there may be multiple sliders present in the State Slider popup window (eg. see Figure 2.4.3).

    
    

    Table 3.3.1. List of threshold sliders. Sliders are enabled in the State-Scroller popup window when the corresponding data filters are enabled.


    Slider nameAssociated with operation
    Spot Intensity SI1Filter by spot intensity range per channel
    Spot Intensity SI2Filter by spot intensity range per channel
    Percent SI OKFilter by percent of spots whose spot intensity
    is in threshold range criteria meets the AT LEAST or AT MOST criteria
    Intensity I1Filter by gene intensity range
    Intensity I2Filter by gene intensity range
    Ratio R1Filter by gene X/Y ratio range
    Ratio R2Filter by gene X/Y ratio range
    Zdiff Z1Filter by gene X-Y Zdiff range
    Zdiff Z2Filter by gene X-Y Zdiff range
    Ratio CR1Filter by Cy3/Cy5 gene X/Y ratio range
    Ratio CR2Filter by Cy3/Cy5 gene X/Y ratio range
    Zdiff CZ1Filter by gene (Cy3-Cy5) X-Y Zdiff range
    Zdiff CZ2Filter by gene (Cy3-Cy5) X-Y Zdiff range
    p-ValueFilter by t-Test
    Spot CVFilter by Coefficient of Variation
    Cluster DistancePlot - cluster by expression similarity
    # of ClustersPlot - K-means clustering
    Diff HP-XYFilter by absolute difference (HP-X,HP-Y)
    Spot QualityFilter by continuous spot quality (If data available)
    
    

    3.7 Exporting report and plot data

    Data is typically reported in MAExplorer in report and plot windows. These may be saved using cut and paste if your are using MAExplorer as an Applet or with "SaveAs" buttons on the popup windows if you are running it as an stand-alone application . Reports are then saved as text (.txt extension) files, and plots are saved as GIF (.gif extension) files.

    If you are running on a windowing system supporting cut and paste, then you may cut and paste data from reports and plots into applications on your system that allow you to save or print this data. Set the Report menu table-format to "Tab-delimited". Then, in Windows 95/98/NT/2000/XP, cut data from the popup tables (or other text reports) and paste it into Microsoft Excel. In Windows, you can capture (i.e. "cut") the entire screen by pressing the "Prt Sc" or print screen button. To capture a specific window (e.g. a scatter plot), hold the "Alt" key when pressing the "Prt Sc" key. Then go into a Windows imaging application (such as PhotoShop) and paste it into the application. In PhotoShop, in the File menu, select New (or type Control/N). Then when the window is opened, click on the window and paste the MAExplorer screen you had cut into the image window by typing Control/V. In both Excel and PhotoShop you may print the data or save it in a file.

    MAExplorer - Microarray Exploratory Data Analysis

    4. Status and Bugs of MAExplorer

    This section discusses the status and known bugs in MAExplorer. It also discusses dealing with the reporting of fatal errors so we can resolve them.

    Section 4.1 discusses known bugs, Section 4.2 lists the revision notes for older versions known bugs. If you have experienced bugs with an older version of MAExplorer, you might check the revision notes to see if the bug was fixed and download a new version. Section 4.3 discusses problems in using MAExplorer as an applet with Web browsers. Section 4.4 describes handling fatal "DRYROT" errors.

    List of known bugs in MAExplorer 4.1 Known Bugs in MAExplorer

    Disclaimer: none of our code ever has bugs... :-). So despite this, we are working on resolving these bugs and implementing planned functionality. Here is a short non-inclusive list of known problems that we are resolving. We welcome and encourage you to E-mail us with any bugs that you find do exist as well as suggestions for capabilities you would like to see. As the new open-source MAEPlugins facility evolves, most new (and some old) functionality will migrate to these plugins. Then the user community can help maintain these analytic methods.

    If you encounter a fatal error that is detected by MAExplorer, it will popup an error reporting window. Please E-mail this data to us so we can try to resolve the problem.

    In the mean time, partially implemented commands are disabled to keep you out of trouble :-) ...

    You can help us and get MAExplorer to do more of the things you would like to see. Let us know of problems that you encounter as well as suggestions for changes or new methods you would like to see - send us E-mail.

    4.1.1 Browser Applet Bugs
    4.1.2 Downloading and Installer Bugs
    4.1.3 Computation speed and display Bugs
    4.1.4 User state and login Status
    4.1.5 Data file names Bug
    4.1.6 Gene Sets Bugs
    4.1.7 Clustering Bugs
    4.1.9 Expression profile Bugs
    4.1.9 Data conversion problems
    4.1.10 Java Plugins Bugs

    4.1.1 Browser Applet Bugs

    1. When using MAExplorer as a Web browser applet (this does not apply if running as a stand-alone application - and we recommend running it as a stand-alone application not as an applet), there are problems with running MAExplorer on some of the Web browsers on some operating systems. It generally works with Netscape 4.7 or later, on Windows PCs (95/98/NT/2000/XP), and on SUN Unix and with Microsoft Internet Explorer 5.0 on PCs. It may not work as well as an applet on Macintoshes - even with Internet Explorer (although it works fine as a stand-alone application on the Mac) and problems have been reported on SGI systems because of browser problems.

      If you are experiencing Web browser problems using the MAExplorer applet, you might check the discussion of possible solutions.

    2. When run as a Web browser applet, you should only click once since multiple clicks may cause some browsers to hang. Note that when you click the first time, it starts downloading the applet. Clicking a second time may cause a second request for the applet to be launched. This interferes with the first request and so neither is started correctly.

    3. When the MAExplorer applet is being downloaded, some browsers do not indicate to the user that it is being downloaded and so it may appear that nothing is going on. Some browsers will not indicate that it is downloading an applet. So please be patient. When it starts, the main window will popup and it continues to download data. Wait a while for it to load the applet before giving up. For a 28Kb Internet connection, this could take several minutes. When it is finished, the menu bar will become active and it will display Ready - click on a gene to query database.

    4.1.3 Downloading and Installer Bugs

    1. We have had reports of problems with MAExplorer installations or running MAExplroer on MacOS 9 since we started distributing MAExplorer with the new InstallAnywhere version 5.0.2. We suggest you switch to MacOS-X or use a Windows, Linux or Solaris system until this gets resolved.

    2. Installer does not start or does not download the data. Normally, when you attempt to download MAExplorer to your computer, the InstallAnywhere software running in your Web browser will detect which OS you are running and suggest the particular download to use in:
           "Recommended version for your computer
            Download installer for ...your OS..."?
      
      Occasionally, we have seen instances where you can not install MAExplorer from within the Web browser. The solution is to explicitly download the particular Platform for your OS in the Available Installers list. And then to follow the instructions on running it.

    3. If you have problems downloading this with Netscape 4.7x or later, then try Internet Explorer 5.0. It could be a Mime/type problem with your particular browser not setup correctly.

    4. On Solaris, and possibly other Unix systems, you may have problems with the stack limits. Do a "man limit" to read about the command for your particular Unix shell. We have found that the following seems to work. For CSH, add the following to your .cshrc startup file. With versions 96.25 and later of MAExplorer, you may adjust the memory size using the (Edit menu | Preferences | Resize MAExplorer memory limits for the next time it is run) command. This edits the startup file according to the new memory limits you specify.
         limit stacksize unlimited
      

    4.1.4 Computation speed and display Bugs

    1. Because it does a lot of computation, MAExplorer runs best on a fast computer with lots of memory. An optimal system should have at least 128Mb of memory at least 500Mhz CPU speed. For doing hierarchical clustering we recommend 256Mb of memory which will allow you to cluster larger sets of genes without paging the operating system. So you may run into problems with running out of memory on an under-powered computer. To increase the usable memory size to more than 256Mb (assuming your computer has it), then you may need to edit the MAExplorer.lax file that resides where you installed MAExplorer on your computer to increase the heap and stack maximum limit.

    2. We have seen instances where the same system will run out of memory with particular Web browsers (again - not a problem when running stand-alone) Netscape 4.7 on a system with a lot of memory. This does not seem to happen with Internet Explorer 5.0. This is because Netscape4.7 on the Windows PC seems to be able to use only about 5.5Mb before it runs out of memory. This is not the case with Internet Explorer 5.0.

    3. MAExplorer is difficult to use with a system with a small screen because of the size and number of multiple graphic images and plots. We recommend a screen size of at least 1024x768. As they say these days, more is better :-), generally.

    4. For problems with MAExplorer and MacOS, you might also see our MacOS problems FAQ.

    5. For problems with MAExplorer and Sun Solaris OS, you might also see our SUNOS problems FAQ.

    6. We have on rare occasions seen a data-related problem on MacOS systems. If the data was edited in such a way that the line terminators were Return characters instead of Linefeeds or Return-Linefeeds, the data may become corrupted and MAExplorer has difficulty reading it. The solution is to replace the Return characters with Linefeeds or Return-Linefeeds. One (painful) way this might be done by exporting the data to a system such as a Windows PC; read the data into Excel; save it as an Excel worksheet file format; exit Excel; start Excel again on the new data files; save the data files back into the original MAExplorer directories as tab-delimited text files. MAExplorer should recognize Return as the line terminator. This bug will be resolved in a future release.

    4.1.5 User state and login Status

    1. User registration and the groupware access functionality are not available yet. However, login is available for collaborator databases on the MGAP MAExlorer Web server.

    2. The commands to save and restore user states for a particular user on the back-end server are not available yet. However, in stand-alone mode you may save the state (including all gene and condition sets, parameter conditions, condition sets, etc) on a local computer by doing a (File menu | Databases | Save As DB). The specialized groupware commands that will allow users to selectively share their saved states with particular other users on a public groupware server is not available yet.

    4.1.6 Data file names Bug

    1. There is a potential problem with using long file names when using MAExplorer with MacOS-8 or MacOS-9. Both of these operating systems limit file names to 32 characters. This could be a problem if importing data from another operating system (Windows, Unix, Linux, etc) that does not have this restriction. The solution is to use short file names. This should not be a problem with MacOS-X since it allows file names with up to 256 characters.

    4.1.7 Gene Sets Bugs

    1. The full range of GeneClass subsets (Section 2.4.1) is not available yet. However, you may use the gene name guesser to find a set of genes by wildcard into the Edited Gene List (e.g. "*ONCO*" to find all oncogenes or proto-oncogenes, etc). These EGL sets may then be saved as named gene sets and used in the data filter as "Filter by 'User Gene Set'". See the Gene Class ontologies using Gene Set operations (Section 2.4.1) that can achieve a similar effect.

    4.1.8 Clustering Bugs

    1. The hierarchical clustering method is buggy. It currently clusters genes - but not samples. The latter is under construction. Centroid averaging is implemented and the arithmetic averaging is under construction. Although it generates a clustergram plot, the dendrogram drawing may not be drawn correctly using centroid averaging. Depending on the number of samples, it you do a complete hierarchical cluster for all genes and ESTs on a Netscape browser on a Windows PC, the browser may run out of memory and nothing appears. [When done in the stand-alone application, this is less of a problem since there are no memory restrictions.] You can do three things at this point: 1) switch to Internet Explorer, or 2) decrease the number of genes being clustered (e.g. just "All named genes") or the CV filter. The third option is to disable the clustering cache. This will be done automatically for you and it will continue doing the clustering. You may turn off (or on) the Use cluster distance matrix cache in the "Hierarchical Cluster Plots submenu. It will perform the clustering, but will take MUCH longer without the cache.

    2. If you are working with very large data sets then if you start a find similar genes or K-Means clustering with a very high threshold, you can not change the threshold until it is done. In these situations, try pre-setting the threshold distance, number of clusters using the (Edit | Preferences | Adjust all Filter threshold scrollers). This will popup the state scroller window with all of the thresholds. You may also select the current gene prior to starting clustering.

    4.1.9 Expression profile plots

    1. The Expression Profile Overlay display is being improved to offer better graphics and interaction.

    2. We have observed some instances of the popup EP plot where data does not seem to correspond to the data. This may be related to the normalization method.

    3. When viewing an ordered list of EP plots associated with a sorted list of genes (e.g. clustering genes similar to a seed gene), if there are replicated genes in the clustered genes, they may not show up in the scrollable list of EP plots.

    4.1.10 Data conversion problems

    1. The Cvt2Mae data file converter application is currently available for beta-testing on the a related MAExplorer Cvt2Mae Web page. It enables users to convert their data files (academic one-of-a-kind as well as commercial e.g. Affymetrix, Incyte, GenePix, Scanalyze, etc) to the standard MAExplorer files as described in Appendix C and D.

    2. If you are not able to convert your data using Cvt2Mae (which is currently in early Beta testing - see the Cvt2Mae Web page for current status), you can always manually convert the data using the examples in Appendix C.

    3. We have seen some problems running the converter with some types of data on MacOS 8-9. The temporary solution is to convert your data using another system such as Windows until we fix the problem.

    4. If you are using the NCI-CIT mAdb system to package data for MAExplorer, you must select all samples from multiple projects that you want to compare with MAExplorer while packing the data in mAdb. (See PDF document "How to download mAdb data For MAExplorer") Currently, you can not easily merge the data after you have downloaded the files to your computer. The solution is to use mAdb to prepackage all data you want to analyze in a single .zip file. You can not currently merge data from separately unzipped files (without manually editing the SamplesDB.txt file). The way you package data using mAdb is to: 1) select ALL of the projects that contain samples that you want (on the mAdb "Top Level Analysis Selection" web page); 2) further select the exact subset of samples that you want in the "Array Selection" table on the "mAdb: Database Retrieval" web page). NOTE: currently, you can only select arrays using the same GAL file so they have compatible geometries. This is on our list of things to change so you can mix and match data from different chips.

    4.1.11 Java Plugins bugs

    1. Java Plugins are in the process of being implemented and will be made available to the user community to enable them to add their own analysis methods to MAExplorer.
    
    

    4.2 Revision notes

    This section lists the revision history and is useful for deciding whether to upgrade to the most recent release. You may want to check for the latest the current "Stable release" available on the MAExplorer Web site. That may be different than the Stable release listed in this copy of the Reference Manual. The "Beta release" listed below the Stable release in the previous links is experimental and may generally be downloaded as it has more functionality. If you experience problems, you can just reinstall the Stable version.

    Note: An archive of some of the stable older releases is available on the NCI/LECB Web site for a limited period.

    
    

    • Version 0.96.34.01: (11-24-2003) Fixed problem introduced in the previous release where you could not increase the memory size invoked by the (Edit | Preferences | Resize MAExplorer memory limits for the next time it is run) command. To fix this bug you must reinstall MAExplorer using the full download using the Java installers. Also added ability to visualize the CV (coefficient of variation of a set of replicate samples) using the new (Analysis | Plots | Show microarray | Pseudocolor HP-EP 'list' CV (Coefficient of Variation) [RB]) command. It computes the CV for the HP list. So use the (Samples | Choose HP-X, HP-Y and HP-E samples) command to define the HP-EP list of samples you want to investigate.

    • Version 0.96.33.07: (08-15-2003) The code implementing the R eval paragigm has been refactord so that it is clearer and will be easier to use for other projects. See the RtestPlugin documentation Section 6 for details.

    • Version 0.96.33.04: (07-31-2003) Fixed minor bugs in MAExplorer and Reval data export for RLO methods.

    • Version 0.96.33.03: (07-23-2003) Fixed minor bugs in current OCL state as well as adding support for future MAERlibr.zip updates from the Web.

    • Version 0.96.33.01: (07-09-2003) Fixed bugs where the XY scatterplot and other functions may not work with some data where there were empty genes in the database. Changed the underlying control mechanism to switch between the 4 clustering methods so it is more robust and easier to use.

    • Version 0.96.32.10: (06-30-2003) added loessPlot() R method to MAERlibr R library. This is used by RtestPlugin for generated code.

    • Version 0.96.32.09: (06-26-2003) added new command (Plugin menu | Save RLO reports in time-stamped Report/ folder [CB]). put files generated by R from successive executions of the same RLO into separate sub-folders in the Report/ folder with names "RLOname-YYMMDD-HHMMSS/" to peep the data separate. This is useful when you want to compare results from the same RLO method but with different MAExplorer preprocessing. Also added new command (Analysis menu | Filter | Filter by genes with non-zero intensity) which can be used to omit genes with zero intensity values in any samples of the HP-EP set of samples. This is useful when exporting data to RLOs that may have problems by dividing by 0.0 etc.

    • Version 0.96.32.08: (06-17-2003) fixed bug in RtestPlugin for Version 0.96.32.07 where it did not always use the intput file mode correctly. Simplied the "Add Template" usage to "Generate Template for 1st input file". Finished inplementing the (Files | Update RLO methods from Web server) so that it updates the MAERlibr package library file (lib/MAERlibr/R/MAERlibr) as well as the .R and .rlo files for the RLO methods.

    • Version 0.96.32.07: (06-15-2003) fixed RtestPlugin editor bug that did not always setup parameters for New RLO correctly. There is still a but that may require you to set the input files mode to the corresponding Demo BEFORE doing "Add Template". This will be corrected shortly.

    • Version 0.96.32.06: (06-13-2003) fixed RtestPlugin editor bug that did not always save the Routput files types. Also fixed bug in demo R script Test.R.

    • Version 0.96.32.05: (06-12-2003) fixed code generators and changed all R methods in the MAExplorer MAERlibr to have a "mae." prefix to minimize conflict with other packages. Added alpha-level support for XY overlay plots in MJAplot class of the API.

    • Version 0.96.32.04: (06-11-2003) fixed code generators on Demos 1-15. Demos 16-17 now broken with VSTACK error. Major code generators cleanup in RtestPlugin. Demo 15 (with use Filter Action) works but does not toggle the (Analysis | Filter | Filter by user gene set) menu checkbox. Also added code to R2MAE actions to better support reimporting gene sets back into MAExplorer.

    • Version 0.96.32.03: (06-08-2003) Moved the MAExplorer-R interface R library code from the .R files to an full R library that now is downloaded when you do the full download/install of MAExplorer. We will be adding an update for this MAERlibr library. The Web update is not available currently available.

    • Version 0.96.32.01: (05-29-2003) Added (Files | Update RLO methods from maexplorer.sourceforge.net) command to update RLO methods from the server. RLO methods consist of pairs of R scripts and RLO files. Note: R must be installed on your computer to use these methods. When MAExplorer is restarted these new RLO methods will be available in the (Plugins | RLO methods) submenu.

    • Version 0.96.31.23: (05-27-2003) Fixed bug in reading RLOs if they were manually edited with a Windows text editor that replace line feeds with carriage returns.

    • Version 0.96.31.22: (05-22-2003) Minor changes in RtestPlugin and R evaluation support code.

    • Version 0.96.31.21: (05-18-2003) Reorganized RtestPlugin code and updated some of the code generators.

    • Version 0.96.31.20: (05-14-2003) Fixed bug where MAExplorer could crash if R was not installed. This was also in the RtestPlugin. Now, it detects that it is not there if you cancel the prompt for specifying R is when it can't find it. This is noted in a special .RbasePath file which contains "**NO_R_INSTALLED**". This is then used to prevent you trying to access any R commands or menus (i.e. keep you out of trouble).

    • Version 0.96.31.19: (05-14-2003) Fixed minor bugs in RtestPlugin and support for horizontal and vertical stacked data.

    • Version 0.96.31.18: (05-08-2003) Fixed minor bugs in RtestPlugin and support.

    • Version 0.96.31.17: (05-06-2003) Fixed prompt in "Update Plugins".

    • Version 0.96.31.16: (05-05-2003) Changed the "Update MAEPlugins from maexplorer.sourceforge.net" so that it will download ALL of the plugins if you answer YES rather than one at a time. More cleanup for RtestPlugin Template paradigm.

    • Version 0.96.31.15: (05-05-2003) In RtestPlugin, simplified R code to use new multi-line header data file paradigm and started to add data model Template generation code. OCL demo does not work with new paradigm.

    • Version 0.96.31.14: (04-30-2003) Fixed problem with R script evaluation if the sample names had spaces in the name. Quoting these fixed the problem.

    • Version 0.96.31.12: (04-28-2003) Fixed fatal error problem introduced in 0.96.31.12 if start MAExplorer with no database. This version fixes bugs in some of the RtestPlugin Demo R code generators and data files.

    • Version 0.96.31.11: (04-25-2003) Fixed quote problem with with executing R in some cases for RtestPlugin. Fixed Prev/Next problem.

    • Version 0.96.31.10: (04-23-2003) Added "Save ALL Demo RLOs" to RtestPlugin to generated demo RLOs for (Plugins | RLO methods) menu next time MAExplorer is restarted. The demo R Code generators are not validated yet.

    • Version 0.96.31.9: (04-22-2003) Fixed RtestPlugin problem with sample set names when generate RLOs for XY sets and OCL condition sample sets. Changed the export data formats so easier to use in R. The demo R Code generators are not validated yet.

    • Version 0.96.31.8: (04-21-2003) Fixed problem with RtestPlugin that sometimes showed up when installed in "C:/Program Files/" on a Windows system (needed to quote the path). It now works on windows. Note: RtestPlugin Demos 10-13 and Demo 16 are not working correctly. The demo R Code generators are not validated yet.

    • Version 0.96.31.7: (04-18-2003) Gene Set actions now work for RLOs. The demo R Code generators are not validated yet.

    • Version 0.96.31.5: (04-13-2003) Extended the RLO definition and support code. Extended RtestPlugin to access this. The demo R Code generators are not validated yet.

    • Version 0.96.31.4: (04-11-2003) Added "RLO methods to Plugins menu. This uses RLO files generated by the RtestPlugin. Additional bug fixes and better error reporting of R evaluation errors. The demo R Code generators are not validated yet.

    • Version 0.96.31.2: (04-07-2003) Extended RtestPlugin and corresponding changes in MJA API. The demo R Code generators are not validated yet.

    • Version 0.96.31.1: (04-03-2003) Cleaned up gene filter processing model.

    • Version 0.96.30.10: (03-29-2003) Better support for RtestPlugin.

    • Version 0.96.30.8: (03-28-2003) Modified the MJAReval, MJAgeneList, MJAfilter API class methods to support the RtestPlugin. RtestPlugin now supports "New RLO" at alpha level.

    • Version 0.96.30.6: (03-25-2003) The RtestPlugin editor is working (alpha level) for demonstration purposes-only at this time.

    • Version 0.96.30.5: (03-18-2003) The RtestPlugin is alpha level. Modified the MJAReval class methods to support this. This is still alpha-level, and may change.

    • Version 0.96.30.4: (03-17-2003) Added new functionality to evaluate MAExplorer data using the R system. There is a new MJAReval to access methods to do this. Using MJAReval, one can write MAEPlugins to process MAExplorer data in R and to bring back the results into MAExplorer. This is still alpha-level, and may change.

    • Version 0.96.30.3: (03-04-2003) Fixed radio-button usage of NormalizationPlugin so that only one normalization (either built-in or plugin) is active at a time. The MJA API was enhanced to add this functionality.

    • Version 0.96.30.2: (03-02-2003) Major ehnancements of the NormalizationPlugin paradigm in the MJA library to support a wide variety of normalization methods - both global and local. This includes support code throughout MAExplorer to support it and to make it easier to debug. Although it works, the NormalizationPlugin is still Alpha-level and will be documented fully when it enters Beta-level code.

    • Version 0.96.29.3: (02-25-2003) Rebuilt installers (Windows, Linux, Solaris) with Java Virtual Machine (JDK) 1.4.1_01 that has improved performance. You must do a download and reinstall -- not just an update from (File menu | Update MAExplorer from maexplorer.sourceforge.net).

    • Version 0.96.29.3: (02-23-2003) Added more support for FilterPlugins. The ExampleXYdataFilterPlugin example now works. It is meant to be a model for more sophisticated plugins. [TODO] it needs parameters GUI popdown.

    • Version 0.96.29.2: (02-21-2003) Fixed many problems with MJA classes where some methods which should have been public were not so MAEPlugins could not access them. This has been corrected. The FtestNconditionsFilterPlugin is now working.

    • Version 0.96.29.1: (02-19-2003) Added popup alert message window for bettering informing users of conditions that prevent them from doing the operation they requrested. They must press the Close button to pop-down the message, although they can do a SaveAs to a file to save the message. For complex problems, some of the messages may suggest what they need to do to correct the problem.

    • Version 0.96.28.3: (02-18-2003) Added 2 new filters: "Filter by HP-X,HP-Y 'sets' t-Test [p-Value] slider [RB]" and "Filter by current Ordered Cond. List (OCL) F-Test [p-Value] slider [RB]". The p-Value related Filter tests were changed to radio buttons [RB] so that only one would be operative at a time. If another p-Value dependent test is on, it will be shut off it you select a different p-Value dependent test. Added new support methods in the MJA API in classes MJAsampleList, MJAstatistics, MJAfilter, and MJAmath. Implemented the "List OCL" and "List All" OCLs in the OCL chooser. Improved the GUI legends in the Condition Chooser and the Ordered Condition Chooser.

    • Version 0.96.28.1: (02-15-2003) fixed default R2 and CR2. Modified State Scrollers paradigm so that it uses slider usage counters (instead of flags) so that MAEPlugins may share sliders (e.g., p-Value, cluster distance, etc).

    • Version 0.96.27.1: (02-12-2003) added named Ordered Condition List (OCL) editor as a builting method.

    • Version 0.96.26.1: (02-08-2003) added methods to MJAsampleList to get either normalized Filtered data or normalized complete set of genes data. Fixed bug in CompositeDatabase.

    • Version 0.96.25.5: (01-28-2003) Fixed bug in "Remove Parameter" in the condition chooser. Added new methods MJAcondition, added method in MJAsampleList to get all raw sample data.

    • Version 0.96.25.4: (01-13-2003) Fixed bug in "Remove Parameter" in the condition chooser.

    • Version 0.96.25.3: (01-02-2003) Fixed bug in "Remove Condition" where it previously deleted the wrong condition list.

    • Version 0.96.25.2: (12-14-2002) Added "Remove Condition" to (Sample menu | Choose named condition lists of samples) as well as fixed several bugs related to the condition chooser. Also added the same command to the (Edit menu | Sets of conditions | Choose named condition lists of samples) menu since it is a condition editing function.

    • Version 0.96.25.1: (12-11-2002) Made some of the tests for names case-independent for smoother operation.

    • Version 0.96.25: (12-06-2002) Added new a new command (Sample menu | Choose named condition lists of samples) lets you define or edit new named lists of hybridized samples.

    • Version 0.96.24.12: (12-04-2002) Added support for RefSeqID and alternate spellings of other genomic identifiers. Modified MAJcondition class to support ConditionChooser.

    • Version 0.96.24.11: (12-02-2002) Added support for RefSeqID and alternate spellings of other genomic identifiers. Modified MAJcondition class to support ConditionChooser.

    • Version 0.96.24.10: (11-30-2002) Updated the default URLs used for GenBank, UniGene, LocusLink, and added OMIM ID support.

    • Cvt2Mae Version 0.73 (11-26-2002) Added ability specify multiple chips/sample (e.g. Affy U133A and U133B) for the MAS5.0 data. It will merge the data into one sample during the conversion.

    • Version 0.96.24.8: (11-20-2002) The ConditionChooserPlugin is now working (alpha) and is included in the distribution. Additional changes were made to MAExplorer to support it.

    • Version 0.96.24.7: (11-18-2002) Fixed Scrollbar initialization problem for the non-linear scrollers (p-Values, CV value, etc.)

    • Version 0.96.24.6: (11-15-2002) Fixed Q&A error in the command "Resize MAExplorer memory limits for the next time it is run" is in the (Edit | Preferences) menu. Added or debugged methods to MJAproperty and extended MJAcondition MJA API.

    • Version 0.96.24.5: (11-07-2002) Add command to resize the memory limits of MAExplorer. The command "Resize MAExplorer memory limits for the next time it is run" is in the (Edit | Preferences) menu. After the command is run, you must exit and restart MAExplorer for the new memory limits to take affect. It changes the memory limits in the MAExplorer.lax startup file.

    • Cvt2Mae Version 0.72.4 (11-07-2002) Fixed bugs in "Define Gipo Fields" where it would be be available under certain conditions. Also fixed another bug introduced with the addition of the MAS5.0 data.

    • Cvt2Mae Version 0.72.3 (11-05-2002) Increase buffer sizes to fix problem for very large input files with large gene descriptions.

    • Version 0.96.24.4: (11-03-2002) Fixed bug in non-linear sliders in State Scrollers. This had affected the values for the p-value and CV value. Extended functionality of popup State Scrollers sliders window. You can now toggle between all sliders, even those not active, and only those that are active.

    • Cvt2Mae Version 0.72.2 (11-02-2002) Fixed bugs in Affymetrix MAS5.0 where the "Define Quant Fields" were not being mapped automatically and separate files were not converting correctly.

    • Version 0.96.24.2: (10-24-2002) Added (File | Update Plugins) operation. It will prompt you on whether you want to update each default plugin one by one.

    • Cvt2Mae Version 0.71.2 (10-23-2002) Moved the name of the available Cvt2Mae.jar version number to the popup dialog query when doing the "Update Cvt2Mae" operation.

    • Version 0.96.24: (10-22-2002) Extended the number of grid groups from 52 [A-Z,a-z] to 152 [A-Z,a-z,1-100]. Moved the name of the available MAExplorer.jar version number to the popup dialog query when doing a (File | Update MAExplorer) operation.

    • Version 0.96.23: (10-21-2002) Extended the MJAgene.java API class so MJAPlugins can set gene annotations and properties from the plugin using external data.

    • Version 0.96.22: (10-18-2002) Extended the (File | Update MAExplorer) command so that it reads the version number of the MAExplorer.jar file on the maexplorer.sourceforge.net server and displays it prior to requesting the user to finish requesting the update. This allows users to see if they want to do the update.

    • Cvt2Mae Version 0.71.1 (10-16-2002) Added a "Update Cvt2Mae" button. This will (1) backup the current Cvt2Mae.jar file as Cvt2Mae.jar.bkup; (2) copy the latest Cvt2Mae.jar file from the maexplorer.sourceforge.net Web site and replace your Cvt2Mae.jar file in your installation directory. Then when you restart MAExplorer, it will use the new version of the program. The much more time consuming alternative is to do an entire download and reinstallation from the Web site.

    • Version 0.96.21: (10-15-2002) fixed comments for (File | Update MAExplorer) to give the actual server it will use. Fixed default PseudoArray Image to use when have single Cy3/Cy5 sample to use (Red-Yellow-Green) display. Cvt2Mae Version 0.70.7 (10-15-2002) has a major re-indentation of the Java source code to make it easier to read using the convention describe in the previous revision.

    • Version 0.96.20: (10-14-2002) We added a "Define GEO Platform ID" command for arrays that were submitted to NCBI's GEO (Gene Expression Omnibus) for use with future MAEPlugins. You can now use the new "Update MAExplorer" command in the File menu. This will (1) backup the current MAExplorer.jar file as MAExplorer.jar.bkup; (2) copy the latest MAExplorer.jar file from the maexplorer.sourceforge.net Web site and replace your MAExplorer.jar file in your installation directory. Then when you restart MAExplorer, it will use the new version of the program. The much more time consuming alternative is to do an entire download and reinstallation from the Web site. This version also has a major re-indentation of the Java source code to make it easier to read. It uses the automatic java indentation engine in Fote 4.0 (2 spaces/tab, convert tabs/space, space before '{', no space before '(').

    • Cvt2Mae Version 0.70.6 (10-11-2002) Fixed bug in Configuration file generator.

    • Version 0.96.18: (10-09-2002) Changed popup HP chooser initial window sizing so it works more consistently across platforms.

    • Version 0.96.17: (10-04-2002) Fixed bug with DetValue slider (the test was reversed). Also increased the dynamic range and precision of value indicated. The default is to leave the Detection Value filter off. Changed the default PseudoArray Image if there is one Cy3/Cy5 sample from the intensity display to the sum of cy5 (red) and Cy3(green) channels.

    • Cvt2Mae Version 0.70.5 (10-03-2002) Fixed bug in Affy MAS4 converter when selected "Generate genomic IDs from Descriptiopn". Also fixed bug in Quant Field assignments so it now shows the "DetValue" field when it was selected by the DetValue checkbox in the Genomic Identifiers Wizard.

    • Version 0.96.16: (9-30-2002) Fixed bug with positive data filter whereby it defaulted to being on. This was changed so that you need to explicitly enable it in the Filter menu.

    • Cvt2Mae Version 0.70.4 (09-27-2002) Added GEO Platform ID support.

    • Cvt2Mae Version 0.70.3 (09-23-2002) fixing the ArrayLayout Affy MAS5 tmp file converter. It now appears to generate the temp file correctly and read it correctly. Need to verify the code generators for the GIPO.txt file. There was a problem with embedded double quotes and the MAS5 ALO was not correct. Also had to add the additional fields to the FieldMap when the tmp file is extended.

    • Version 0.96.15: (9-19-2002) Added a new "Filter by Spot Detection Value" data filter. This can be used with the new Affymetrix MAS5.0 "Detection p-value" mapped through the "DetValue" or "CorrCoef" field in the .quant file. The next version of Cvt2Mae after 0.70.2 will output the "DetValue" field.

    • Cvt2Mae Version 0.70.2: (09-18-2002) Added automatic delete of old default Affymetrix*.alo files when Cvt2Mae starts up. This was required since we changed the names to reflect the MAS-4.0 and MAS-5.0 versions of the analysis software. Without this, you would have multiple names for the same files in the Array Layout chip list.

    • Cvt2Mae Version 0.70.1: (09-17-2002) Fixed bugs introduced in 0.70. Added remap code to better handle MAS5.0 data.

    • Cvt2Mae Version 0.70: (09-15-2002) Added Affymetrix MAS5 array layout so it will map (Signal, Detection, Detection p-value) to (RawIntensity, QualCheck, CorrCoef) Quant data fields. Fixed bug in "SwissProtID" data parser.

    • Version 0.96.14: (9-14-2002) Fixed incorrect base URL in "Help" pull-down menu. It had pointed to the old MAExplorer Web site not the the new http://maexplorer.sourceforge.net/MAExplorer/. The same documents were always visible from the Web site, but this makes it more convenient.

    • Cvt2Mae Version 0.68: (08-30-2002) Fixed bugs in Edit Layout where it was not always saving the changes you make to the state. Pressing Cancel in the Edit Layout restores the initial state before starting the edit. Pressing Back saves the state of the current panel you are working - previously only Next or Finish did that.

    • Version 0.96.12: Fixed bug introduced in version 0.96.11 such that it would pop up the dialog query when you did a "Open Disk DB" command.

    • Version 0.96.11: Extended MJAeval methods so can more easily invoke a new .mae startup file via a client-server MAEPlugin as well as the invokeCommand method. Modified internal methods and structures to support this.

    • Cvt2Mae Version 0.67: (08-15-2002) Fixed bug where the generated pseudoarray image was overiding the user specified (grid,row,column) geometry.

    • Version 0.96.10: Fixed bug in pseudoarray image whereby the current gene icon (yellow circle) sometimes was not reset properly for a database with X,Y coordinate data if the current gene was selected and you switched the current sample.

    • Version 0.96.08: Fixed bug in pseudoarray image whereby the current gene icon (yellow circle) sometimes was not reset properly if you toggled the mouseover checkbox, selected a different sample using the sample list in the pseudoarray image, or scrolled the pseudoarray.

    • Version 0.96.07: Fixed bug in font size updates for the pseudoarray image and scatter plot. Added methods to MJAproperty to get font size and font familiy.

    • Version 0.96.06: Modified data filter property bits for Gene, MJAgene, and Filter to make them more consistent when writing MAEPlugins.

    • Version 0.96.05: Fixed bug in data filter that caused problems with FilterPlugin's.

    • Version 0.96.04: Fixed bug when run database with no startup database, it does not refresh spot lists in the PseudoArray image.

    • Version 0.96.02: Fixed bug when run database with no startup database and no default MGAP demo database. It now lets you search the disk to find a .mae startup file instead of aborting with a Dryrot error. Fixed bug with chooser where it sometimes fails to let you access the last element of a list. The NCI/LECB site and the SourceForge site now both have the same full installers with JDKs (except for AIX and HPUX) and with the MGAP database.
      A Cvt2Mae Version 0.66: bug has been fixed that makes it easier to automatically find the first row of spot data when scanning the users data input data file. Previously, the user might have to manually enter that starting data arow number in the Edit Layout wizard.

    • Version 0.96.01: Initial release on SourceForge Logo. Note: The NCI/LECB site will have the same installers., but with JVMs if you want to download a version with the JVM for your computer. The SourceForce does not have the JVMs because of space limitations.


    • Version 0.95.20: Inserted Open Source License into all code.

    • Version 0.95.19: Changed default for (File | Save file DB) and (File | SaveAs file DB) so that it saves all samples in the new .mae startup file - not just the samples in HP-X 'set', HP-Y 'set or in HP-E 'list'. Made the three status lines in the main window non-editable.

    • Version 0.95.18: Maintenance release. The Open Java API has been changed in the mjaGeneList, mjaGene, mjaEval classes.

    • Version 0.95.17: Maintenance release. The Open Java API has been updated and now includes working updateCurGene(), updateFilter(), updateSlider(), updateLabel(), and close() methods as well as new methods in the MJAgeneList API class. The stable release is now the version with the MaeJavaAPI active. The previous stable release 0.95.04 did not have the API.

    • Version 0.95.16: Maintenance release.

    • Version 0.95.15: Maintenance release to fixed FilterPlugins so the FilterTestPlugin example is working. Fixed bug recently introduced that prevented selecting spot in pseudoarray image.

    • Version 0.95.14: Fixed bug recently introduced that prevented menu commands from working correctly.

    • Version 0.95.12: Maintenance release. Changes to help support MAEPlugins and additional functions in the Open Java API. Fixed Quant input data parser bug so it now handles strictly lower-case letters for (Grid,Row,Column) data.

    • Version 0.95.11: Maintenance release. Changes to help support MAEPlugins and additional functions in the Open Java API.

    • Version 0.95.10: Maintenance release changes to help support MAEPlugins and additional functions in the Open Java API. Menu Plugin command is enabled to load MAEPlugins built using the Open Java API (under construction).

    • Version 0.95.08: Maintenance release to help support MAEPlugins and additional functions in the Open Java API. Menu Plugin command is enabled to load MAEPlugins built using the Open Java API (under construction). Added Plugins/ folder to distribution that is installed where you install MAExplorer. It contains several alpha-level JAR file plugins that work with the (Plugins | Load Plugin) command. We will be releasing the source code soon as the Open Java API alpha release in put on the MAEPlugins Web site.

    • Version 0.95.05: Maintenance release. Changes to help support MAEPlugins and additional functions in the Open Java API. Menu Plugin command is enabled to load MAEPlugins built using the Open Java API (under construction). Added (Plugins | Test Plugin Code) to invoke code for a plugin a debugging context (maintenance).

    • Version 0.95.04: Same as Version 0.95.03 optimized. It was compiled with the optimizing Java compile without debugging symbol tables. All Plugin support codes was removed. This results in a MAExplorer.jar file of 449Kb vs 599Kb for. Since the MGAP Web site Applet version can not use Plugins, this version is optimized for Applets and is to preferred over the Beta 0.95.03 version for now.

    • Version 0.95.03: Maintenance release. Changes to help support MAEPlugins. Menu Plugin command is enabled, but will only load MAEPlugins built using the Open Java API (under construction).

    • Version 0.95.01: Increased the number of spots/array that may be used from 16K to unlimited. Maintenance release. Changes to help support MAEPlugins.

    • Version 0.94.15: Fixed bug where occasionally, it would not load additional samples from the (Samples | Set Samples from Lists | ...) commands.

    • Version 0.94.14: Maintenance release. Changes to help support MAEPlugins. This release fixes a menu-related bug that occured if you used (File |Open File DB) to switch to a different database or to start the initial database rather than starting directly from the .mae startup file. The problem was that some of the sub menus generated might not correspond to the type of database. For example, if the previous database was an intensity database with duplicates F1 and F2, switching to a Cy3/Cy5 database might have some of the menus still appear as (F1/F2).

    • Version 0.94.12: Maintenance release. Changed placement of magenta circle marker so it is more visible in ClusterGram when selecting genes in the dendrogram/ClusterGram for the EGL by clicking on a gene with the Control key pressed.

    • Version 0.94.11: Fixed fatal error introduced ni 0.94.10 that occurs if you start the database with no data. Version 0.94.10 did work if you started it with a particular .mae startup file.

    • Version 0.94.10: Maintenance release. The "Filter by positive Quant data" toggle checkbox in the (Analysis | Filter menu) is enabled for all databases - not just those requesting it. If the database has 2 channels (F1, F2) or (Cy3,Cy5) each channel is checked. If the background correction is enabled, the background corrected values are tested to see if any of them are negative. Changed the labels in the scatter plot so that instead of displaying the current gene values as intensX= and intensY= or intens'X= and intens'Y= , it now displays it as X= and Y= or X'= and Y'=. The ' indicates that background correction is active. A new (Analysis | Plot | Show Microarray | "Scale pseudoarray image by 1/100 to zoom low-range values) command has been added to rescale intensity and (Cy3+Cy5) and (HP-X + HY-Y) (Red-Yellow-Green) plots to so low values are easier to visualize.

    • Version 0.94.09: Maintenance and Reference Manual update. Added copy of "Show 'Edited Gene List'" to Edit EGL submenu so easier to find.

    • Version 0.94.08: Changed button layouts on popup plots so it is easier to resize them to remove the right part of the plot if you want to conserve screen space.

    • Version 0.94.07: Fixed bug recently introduced in the histogram plots which prevent them from working. Fixed another histogram plot bug where the frequency count for "><" thresholding was reported incorrectly.

    • Version 0.94.06: Added the saving of the current values of all of the State Threshold scrollers when the (Edit menu | Preferences | Adjust all Filter threshold scrollers) in the message logging area when the State Thresholds popup window is closed. The Cvt2Mae (Version 0.60) has been extended to now handle Scanalyze data files as well as supporing a separate GIPO input data file.

    • Version 0.94.05: Added logging of MAExplorer messages to buttons in popup plots and reports. Fixed bug where old fixed gene sets for empty wells were not always restored correctly fro the previous state.

    • Version 0.94.04: Added logging of MAExplorer messages and command history in popup windows in the View Menu as "Show log messages" that occur during a session and "Show log of command history" respectively. The windows may be saved in log files. Added additional error detection of bad (field,grid,row,column) data when reading the GIPO and Quant files.

    • Version 0.94.03: Maintenance release for improving MAEPlugin support.

    • Version 0.94.02: Minor bug fixes related to the renaming of HybProbe to Samples menus. Also, extensive update of reference manual in both text and figures to showing many of the renamed menus so it is synchronized with the program.

    • Version 0.94.01: Major version release.
      Renamed all previous references in the program to "hybridization probe" or "hybridization sample probe" to the new term "hybridized sample" for clarity. Changed many "HybProbe Menu" to "Samples Menu" and also many menu selections as well as plot and report labels to reflect this change. We are in the process of updating the manual computer screen figures and PDF slide presentations so that "hybridization sample probe" is shown as "hybridized sample". Also fixed other minor problems including fixing the inverted color scale for the "Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy5+Cy3 ratio or Zdiff" command.

    • Version 0.93.03: Minor reorganization of (Plots | Clusters) submenus. This is the last release using the "HybProbe" menu.

    • Version 0.93.02: Beta-test of JDK-1.3 MAExplorer.jar file and new InstallAnywhere 4.5 installer. This adds JDK-1.3 installers for MacOS-X, IBM AIX, HP-UX, Linux. However, MacOS 8.1-9.x use the MRJ-2.2.5 JDK1.1.8 library. It also fixes a problem recently found running MAExplorer on some Windows-NT systems.

    • Version 0.93.01: Major version release.
      Moved "Cluster Plots" submenu of the "Plots" submenu up one level in the "Analysis" menu.

    • Version 0.92.24: Maintenance and added more Plugins support. Top level Plugins menu appears, but is disabled in production version.

    • Version 0.92.23: Maintenance and changed names of some commands and some messages. Major edit of Reference Manual including expanded table of contents.

    • Version 0.92.22: Last Stable Release.
      Major version release. Optimized colors in grayscale display for "Pseudocolor (HP-X,HP-Y) 'sets' p-value". Fixed error and optimized t-test computation.

    • Version 0.92.21: Changed grayscale display for "Pseudocolor (HP-X,HP-Y) 'sets' p-value" to display it as a color spectrum. Changed data displayed so p-Value is in the front of the list when you click on a spot etc. Added tests to disallow this display unless the data supports it.

    • Version 0.92.20: Beta testing new pseudoarray image display "Pseudocolor (HP-X,HP-Y) 'sets' p-value" that displays a gray value proportional to the p-Value in a t-Test of the HP-X 'set' vs HP-Y 'set'.

    • Version 0.92.18: Improved EP plot labels so it no longer prints the "MID:" prefix. Improved spot features list so it no longer prints the "plate[null,null,null]" if the plate coordinates are not present in the GIPO file.

    • Version 0.92.17: Fixed a bug which incorrectly computed the max and minimum ratio Cy3/Cy5 ratio values if the data allows negative intensity on these channels after background subtraction. This is required to support the GenePix array data format converter in the new version of Cvt2Mae (11-24-2001).

    • Version 0.92.15: A new (Edit | Preferences) command Adjust all Filter threshold scrollers - popup the state scroller window with all of the thresholds. This is useful when you want to adjust thresholds before you enable data Filtering or clustering. A bug was fixed where the state of the previous clustering method was not always cleared if you aborted clustering by clicking on the delete window button (eg. in Windows or the Mac). Click on a row with the Control (Shift) key pressed in the ClusterGram for hierarchical clustering will add(remove) that gene to(from) the Edited Gene List. If the (View | Show 'Edited Gene List') is enabled, then it will draw a magenta '*' before the gene name to indicate that you have selected it. This lets you select a particular subset of the genes from the hierarchical cluster. The Cvt2Mae data converter now has an alpha-release to let you convert non-standard data using the <User-defined> data option.

    • Version 0.92.14: The name of sample names and sample file names IDs has been made more flexible. If the "Database_File" is longer than 32 characters, there may be problems reading that file with MacOS-8/9. Therefore, you can use the "DatabaseFileID" to specify the "Database_File" (in which case they would be the same). Then, the labels used for the samples are the "Sample_ID" fields. This is upwards compatible with the previous method which used the "Database_File" for both the file name and the label name. The Reports of Highest (Lowest) Cy3/Cy5 data for a single sample were not easily accessible and this has been fixed. Alternate names for automatic detection of ESTs, ESTs similar to named genes were expanded and also includes the "EmptyWell" gene set (i.e. spots with no genes) (see Automatic Gene Class naming based on Gene Name in Appendix C Table C.4.1). A clustering option was added, Use median instead of mean for K-means clustering, of mean when computing K-means clustering (Bickel, 2001)).

    • Version 0.92.13: Maintenance release. Changed per-sample Good Spot filter algorithm and the CV filter algorithm. Added options to filter by either "HP-X 'set'" or "HP-Y 'set'". Algorithm now uses filters "HP-X or HP-Y" and "HP-X or HP-Y 'sets'" rather than the previous "HP-X and HP-Y" and "HP-X and HP-Y 'sets'". Adding some of the missing state that was not being saved during a "SaveAs DB" operation when creating a checking .mae startup file.

    • Version 0.92.12: Maintenance release for Plugins.

    • Version 0.92.11: Maintainence release. Increased width of scrollable sample names selection windows so it is easier to see the details on long sample names.

    • Version 0.92.10: Maintenance release. Due to a problem that some users are experiencing with the Java Installer, we are reverting to the previous version 4.01 (that does not support Mac-x). We will resolve these problems as soon as possible. Also fixed a problem with the Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5 ratio or Zdiff which is now resolved.

    • Version 0.92.09: Maintenance release. Distribution of Java application with InstallAnywhere 4.5 (previous was version 4.0.1). This now supports MacOS-X, other Unix systems and may fix some other installation problems.

    • Version 0.92.08: Fixed bug in Edit use (Cy5/Cy3) else (Cy3/Cy5) for each HP command in HybProbe menu that changes were not saved in the state. Fixed bug introduced in HP Cy3 vs Cy5 intensity scatter plot.

    • Version 0.92.07: Added display of Cy3 and Cy5 channel intensity when click on gene in Pseudocolor Red-Yellow-Green Cy3/Cy5 ratio or Zdiff mode. Also added additional support for plugin installation from Configuration DB and .mae startup files.

    • Version 0.92.06: The Pseudocolor Red-Yellow-Green Cy3/Cy5 ratio or Zdiff - display the Cy3 and Cy5 microarrays as a pseudocolor image as the sum of Cy3 (Red) and Cy5 (Green).

    • Version 0.92.04: Internal changes. Also fixed bug in reading Affymetrix data converted with Cvt2Mae in the case where there is no genomic identifier and the "Location" is used as the genomic identifier.

    • Version 0.91.03: Made configuration database "swapRowsColumns" default to FALSE. This should probably never need to be true except for the initial MGAP database. If this flag was not set to false in previous configuration DB files, it would swap rows and columns in the pseudoarray image display.

    • Version 0.91.02: maintenance revision.

    • Version 0.91.01: Major version release.
      This corrects a few minor bugs including crashing when starting up an empty database (that bug was introduced sometime in the last month). It is the first release with some reorganized code.

    • Version 0.90.08: Fixed bug in "Filter by 'Good Spot data'" where the initial "Check spots for Good Spot mode" was not properly initialized.

    • Version 0.90.07: Additional fixes in QualCheck mapping now handles the three types: Alphabetic codes, MAExplorer integer property codes, continuous floating point monotonic quality value. If the latter is used, then a "Spot Quality threshold" slider will popup when it is invoked. The MAExplorer Plugins Web page is also available at thei Web site and will track progress with the development of the MAExplorer Plugins Open Java API for investigators adding their own analysis methods.

    • Version 0.90.06: Fixed bugs in Genomic database popups. It was ignoring the request to popup a web page for particular genomic IDs.

    • Version 0.90.05: Added QualCheck mappings of Affymetrix "Abs Call" of "P" (or "G" or "T") to Good Spot, "A" to Bad Spot, and "M" (or "B" or "F") to Marginal Spot. This allows the "Filter by 'Good Spot data'" to handle this type of data.

    • Version 0.90.04: Added experimental GenomicMenu/URL configuation options for future use.

    • Version 0.90.03: Added QualCheck data filter for "Filter by 'Good Spot data'". This will be active if the QualCheck data is available in the .quant files.

    • Version 0.90.02: Fix bug in some of the State scroller controls when adjusting a scroller latches into a fixed state that can only be changed by restarting MAExplorer. Added LocusLink popup Web pages from LocusID (if it exists) or LocusLink from GenBank identifiers.

    • Version 0.90.01: Major version release.
      (very Beta) Changing convention so use "MasterID" as the master gene index. This makes it more flexible than when used the CloneID as the master gene index. Added LocusLink.

    • Version 0.89.41: Added new scatter plots for comparing Cy3/Cy5 ratio data to compare either Cy3 or Cy5 of HP-X against Cy3 or Cy5 for HP-Y. Added additional change limit constraints "Compare channels meeting range" to "Filter by spot intensity [SI1:SI2] sliders". This allows partial filtering by spot intensity.

    • Version 0.89.40: Some of the State Scrollers were too sensitive at low values. Therefore, we set them to respond non-linearly with a more precise vernier at the low end.

    • Version 0.89.39: Fixed bug t-Test if using multiple samples with arrays with 1 Field.

    • Version 0.89.38: Fixed bug in Filter that allows the user to use either "positive only" or "positive or negative" intensity data (used with Affymetrix data).

    • Version 0.89.37: Increased the dynamic range of the scrollers for p-value, cluster distance, [Z1:Z2], |Diff XY|. Fixed labels in the Cy3 vs Cy5 plot so they are updated when the current sample changes.

    • Version 0.89.36: Fixed error in "Inside Range" for "Filter by Ratio ..." and "Filter by Cy3/Cy5 Ratio ...". Fixed HP-X,Y,E labels in the popup HybProbe menu "Choose HP-X,Y,E ..." command. Also added a new button "<< that removes all entries from the Selected list to the Remainder list. Changed the way the Cy3 vs Cy5 (for ratio data) or F1 vs F2 (if you are using intensity data where all spots are replicated) scatter plots work - the HP-X vs HP-Y scatter plots are not changed. The data Filter still works for all of the plots. Changed the name of the "NPN clustering" method in the menus and reports to "K-means clustering" since that is essentially what it is.

    • Version 0.89.35: Improved HP sample ID reporting in various lists, lists, etc. Added HybProbe menu command "Edit use (Cy5/Cy3) else (Cy3/Cy5) for each HP" for use with ratio data. This selectively swaps (Cy3,Cy5) data entries so may use (carefully!) dye-swap data for replicates. Added UP/DOWN buttons in the HP chooser to make it easier to adjust the order of the HP-E list.

    • Version 0.89.34: Changed the pseudoarray image generation paradigm so that it uses a fixed size spot and spacing and makes the pseudoarray image size dependent on the number of grids, rows, columns and fields. This results in a more consistent interface.

    • Version 0.89.33: Added "Filter by Cy3/Cy5 HP-X ratio or Zdiff sliders" to the Filter menu. This is useful for filtering data from a single sample.

    • Version 0.89.31: Added Rename command for Gene Sets and Condition lists. Added "SaveAs GeneSets" to K-means clustering. This saves all of the clusters as named Gene sets ("Cluster #1", "Cluster #2", etc.);

    • Version 0.89.30: Fixed problem with popup Web browser for Macintoshes.

    • Version 0.89.29: Optimized sorting on startup so it starts much faster when working with large arrays.

    • Version 0.89.28: Gene set and Condition set operations now let you select the sets by a pull-down choice menu as well as by typing in names of sets. Additional error checking added for input data files.

    • Version 0.89.26: Enhanced configuration file parser and sample name menu default; better error checking on input files to help catch some errors in manually edited user data files.

    • Version 0.89.25: Fixed configuration parser errors when reading non-standard data.

    • Version 0.89.24a: Fixed data for MAExplorer MGAP demo.

    • Version 0.89.24: Fixed bugs in popup browser to use GenBank ID (if it exists) with NCBI server, else it uses the Clone-ID (if it exists) with the nciarray cloneID-to-GenBank server.

    • Version 0.89.22: Fixed bugs in ClusterGram, saving current Gene Class.

    • Version 0.89.21: Fixed bugs in HP Chooser, Pseudocolor Red-Yellow-Green HP-XY plot menu, HP vs. HP correlation Report.

    • Version 0.89.20: Added correct scale for (Plot | Show Microarray | Pseudocolor Red-Yellow-Green HP-XY ratio or Zdiff).

    • Version 0.89.19: Added (Plot | Show Microarray | Pseudocolor Red-Yellow-Green HP-XY ratio or Zdiff) command to display the X and Y data as an additive sum of red (HP-X) and green (HP-Y) data.

    • Version 0.89.18: New version of download Java installer program that improves portability on different platforms (eg. MacOS, etc). Minor changes (spelling errors in parts of the program). Improved documentation.

    • Version 0.89.17: Added commands and functionality to a) find all copies of a named gene in the guesser using the SetE.G.L. button, b) added the "Replicate genes" GeneClass, c) added the "Filter by Genes with replicates" data Filter.

    • Version 0.89.16: Fixed several consistency checks when reading non-standard array data.

    • Version 0.89.15: Ignore enclosing space characters in SamplesDB, Configuration and .mae startup files.

    • Version 0.89.14: Fixed bug in Gene guesser when press the "Done" button. It sets the gene, but did not update the pseudoarray image. Disable genomic browser options in View menu if dbEST3'/5' or GenBankAcc3'/5' is not in the GIPO file database. Since GenBank and dbEST may be accessed via the mAdb Clone report or UniGene report, this should not be a problem. In the Reference Manual a) figures were updated for Genes instead of the older notation of Clone (Version 0.88.*); b) the Short Tutorial (Appendix A) was updated and clarified.

    • Version 0.89.12: Added "/Report" directory to hold SaveAs text (.txt) and plot window image (.gif) files. Improved fatal error (called DRYROT errors) reporting. Additional parts of the state saved.

    • Version 0.89.11: Added "Filter by positive quant. data" to filter out raw quantified data with negative values which may occur with some types of arrays. Also improved some of the popup window updates when the data Filter, normalization, EGL, or current clone changes.

    • Version 0.89.10: Fixed bug in correlation coefficient computation so it works correctly with all Gene Classes and data Filter.

    • Version 0.89.09: Fixed bug in reading configuration file that occurs if SubMenus are omitted. Fixed bug recently introduced when updating the finding clones similar to the current clone.

    • Version 0.89.08: Fixed bug in auto-update of List of Gene Sets when used with adding(removing) genes using the Control(Shift) clicking on a gene in the array image, scatter plot, etc.

    • Version 0.89.07: Fixed bugs in state saving.

    • Version 0.89.06: Save the full State when do a (File | SaveAs DB). The full state is restore when you startup the database you saved. Corrected bug in Gene Set Difference operator. Clarified some menu option names. Can now set # genes to report or Filter in a highest/lowest ratio genes in (Filter | Set max # genes in highest/lowest report). Added Report for (Report | Genes in 'Normalization Gene List').

    • Version 0.89.05: Added information fields in HP chooser.

    • Version 0.89.04: Corrected bug in Ratio Median correction option if using ratio data (i.e. Cy3/Cy5). Improved scatter plot labeling.

    • Version 0.89.03: Optimized the main screen area usage. Corrected data-dependent bug which sometimes occurs with the Use Ratio Median correction option if using ratio data (i.e. Cy3/Cy5).

    • Version 0.89.02: Set the Use Ratio Median correction option if using ratio data (i.e. Cy3/Cy5).

    • Version 0.89.01: Major version release.
      Because MAExplorer can be used with both spotted clone arrays and oligo arrays,we renamed clones as genes (except where Clone ID is used) in both the MAExplorer program and in the Reference Manual.

    • Version 0.88.08: Enabled saving the name of the last project when you exit so it knows which file to use as a default when you do an "Open disk DB". Fixed the mnX/mnY data in XY-Statistics Report. Fixed title update in main window when open a new database. Also has fix for Linux Java installer (tested on Red Hat Linux version 6.0).

    • Version 0.88.07: Disabled the (File | Set Project) command as there is occasionally a problem with reading the maeProject.txt database file which causes it to hang. This should not be a problem for users with multiple databases as you still get the popup startup file browser when you invoke (File | Databases | Open file DB).

    • Version 0.88.06: Edit submenu "Sets of Conditions" HP lists is working and is saved/restore with the File menu "Save disk DB"/"Open disk DB".

    • Version 0.88.05: Added mouse-over to main window and moved gene locator button to left. Added "Use ratio median correction" option to Normalization menu. This option can be used for rescaling Cy3, Cy5 intensity data if the medians are unequal. Also, the manual has been updated and more discussion on data mining has been added.

    • Version 0.88.04: Edit submenu "Sets of HybridProbes" Condition lists are partially working (not working with File menu "Save disk DB"/"Open disk DB", set operations not enabled yet).

    • Version 0.88.03 Edit submenu "Sets of clones" and File menu "Save disk DB"/"Open disk DB" Save/Restore clone sets as .cbs clone bit set files in stand-alone.

    • Version 0.88.02: cleaned up SamplesDB.txt and MaExplorerConfig.txt specifications and sample files in Appendix C.

    • Version 0.87.08: Fixed some bugs in HP-X,-Y,-E chooser. Added additional checking for missing field names in data files. If a field is incorrect, MAExplorer can't continue. It now gives a popup "Dryrot" error message to give us information about which fields are incorrect.

    • Version 0.87.06: May now use combined chooser for selecting the active HP-X, HP-Y and HP-E probes.

    • Version 0.87.05: May now save all popup text windows as .txt files in stand-alone mode.

    • Version 0.87.01: May now save all plot windows as GIF images as .gif files in stand-alone mode.

    • Version 0.86.36: Fixed bug in "Filter by spot intensity [SI1:SI2] sliders". Allow simultaneous viewing of HP-X vs HP-Y and Cy3 vs Cy5 (or F1 vs F2) scatter plots. You may now also simultaneously view and HP-X/HP-Y and Cy3/CY5 ratio histograms.

    • Version 0.86.35: A new "Filter by spot intensity [SI1:SI2] sliders" mode has been added to the Filter menu. It is useful for filtering individual changes when analyzing Cy3/Cy5 data.

    • Version 0.86.34: The new stand-alone distribution is now generated with InstallAnywhere 3.5 instead of 3.5 beta. This corrects some of the problems we had seen with very large core images.

    • Version 0.86.34: The handling of "Out of memory errors" seen when clustering large numbers of clones on a small computer has been changed. Previously, it shut off the cache and continued the computation - although VERY slowly. Now, it prints out the following message:
       There is not enough memory to cluster current filtered clones. Options:
          1. reduce the number of filtered clones and try again, or,
          2. disable cluster-cache (Clustering menu) - will be VERY slow.
      

    • Version 0.86.33: A new "Set project" command has been added to the File menu. It selects the current project, if multiple projects have been set up (using the future "New project" command).

    • Version 0.86.29: A new "Show mouse-overinfo" mode has been added to the View menu (default is on). It reports HP probe and spot/clone details in the array image, scatter plot, expression profile overlay plot, etc.

    • Version 0.86.27: In stand-alone mode, the "Open disk DB" now lets the user open a .mae startup file. The user browses the local file system to select the startup file to use.

    • Version 0.86.25: When using ratio (i.e. Cy3/Cy5) data, the menu entries, plot and report legends change the names from (F1,F2) to (Cy3,Cy5) where appropriate.

    • Version 0.86.24: Corrected bug in using the ratio and intensity thresholding of mean HP data was previously using the single probe data.

    4.3 Web Browser problems when running MAExplorer as an applet

    Because MAExplorer is a large system, and there may be occasional problems running it in some Web browsers on some operating systems. We recommend you run MAExplorer as a stand-alone application as it is more robust.

    4.4 Handling fatal error reporting (i.e. DRYROT errors)

    If you encounter a fatal error that is detected by MAExplorer, it will popup an error reporting window. We call this a "DRYROT" error (thanks to "S.A.I.L." - Stanford AI Lab) because something is wrong in the program or in the user's data files and from which it can not recover. This type of error should not have happened. Please save and e-mail the report to us so we can try to fix the bug or diagnose the problem. The following figure shows an example of part of a DRYROT error report.

    Example of a fatal Dryrot Error

    Figure 4.4 Example of a fatal Dryrot Error window. This may occur for a variety of reasons. This window lists the main reason and also lists some of the MAExplorer state information. If you wish, you may save this window (press the "SaveAs" button) and mail it to us. We may try to correct the problem in the next release if it is a problem with MAExplorer. Alternatively, it could be a user data error.


    Example of a fatal Dryrot Error after SaveAs

    Figure 4.4.1 Example of a fatal Dryrot Error window after SaveAs. This tells you where the saved error message file was saved and the email address to send it to if you wish.


    Release Archive for stand-alone MicroArray Explorer on NCI/LECB

    Release Archive for stand-alone MicroArray Explorer on NCI/LECB

    This is an archive of some of the older stable versions of the MAExplorer stand-alone application program. These are the full installers which include the Java JDKs for all operating systems as well as the MGAP database. The user reference manual (available as a zip file) specific for that version is also included. After a while, we will remove some of the older releases. To find what the current and beta releases are, see the Install home page. The changes between releases are listed in Section 4.2 Revision Notes.

    Release Release Date Manual (.zip) for Release
    0.96.02 07-02-2002 -
    0.95.20 05-31-2002 MaeRefMan.zip (10Mb)
    0.95.16 05-24-2002 MaeRefMan.zip (10Mb)
    0.95.04 03-22-2002 MaeRefMan.zip (10Mb)


    MAExplorer - Microarray Exploratory Data Analysis

    Acknowledgements

    Primary contributers to MAExplorer were Peter Lemkin (LECB/NCI), Greg Thornwall (SAIC/FCRDC) in the Laboratory of Experimental and Computational Biology, NCI/NIH, and Jai Evans (DECA/CIT/NIH).

    Primary contributers to Cvt2Mae were Peter Lemkin (LECB/NCI), Greg Thornwall (SAIC/FCRDC), Bob Stephens (ABCC/NIH).

    We wish to thank the many members of Lothar Hennighausen's Laboratory of Genetics and Physiology (NIDDK) who inspired the initial development of MAExplorer and its continued development. Thanks also to:

    Greg Alvord (SAIC/FCRDC),
    Kevin Becker and Chris Cheadle (NIA/NIH),
    Breast Cancer Think Tank (NCI),
    Damien Chaussabel (NIAID),
    Terry Clark and Josef Jurrek (U. Chicago),
    Mitko Dimitrov (LECB/NCI),
    Jai Evans and Chris Santos (DECA/CIT/NIH),
    Troy Moore (Research Genetics),
    Peter Munson (CIT/NIH),
    Alan Li (SourceForge),
    Quang Tri Nguyen (LECB+LCRC/NCI),
    John Powell and Esther Asaki (CIT/NIH),
    Eric Shen (U. Arizona),
    Moshe Shani (Agr. U. Israel),
    Richard Simon (NCI/NIH),
    Bob Stephens and Gary Smithers (ABCC/FCRDC),
    Ron Taylor (U. Colorado),
    Mark Vawter (NIDA/NIH, UC-Irvine),
    John Weinstein (LMP/NCI), David Kane (SRA/NCI), Ajay (LMP/NCI),
    and to many others for useful discussions and suggestions that have helped improve the MAExplorer's capabilities and usability.

    Thanks also to Jeff Thomas, Charmaine Richman, and Tom Stackhouse (NCI) for helping with the MAExplorer Open Source process.

    MAExplorer - Microarray Exploratory Data Analysis

    References to related exploratory data analysis methods and MAExplorer

    This short list of references is limited to a few related to exploratory data analysis methods for microarrays as they relate to MAExplorer. It is not meant to be inclusive. More extensive lists of references to many of the array preparation and data mining methods can be found in some of these papers and on the Internet.

    1. Beardsly T (1999) Scientific American, March 1999, 35-36.

    2. Bickel, D (2001) "Robust Cluster Analysis of DNA Microarray Data: An Application of Nonparametric Correlation Dissimilarity. Proceedings of the Joint Statistical Meetings of the American Statistical Association (Biometrics Section, to appear). [ http://www.mathpreprints.com/math/Preprint/bickel/20010730/3/]

    3. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M (2001) Minimum information about a microarray experiment (MIAME) - toward standards for microarray data. Nature Genetics 28: 365-371. [See http://genomics.nature.com/supplementary_info, and http://www.mged.org/)]

    4. Cleveland WS (1985) The Elements of Graphing Data. Wadsworth Press, Monterey, CA, pp 1-323.

    5. DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, Chen Y, Su YA, Trent JM (1996) Use of a cDNA microarray to analyze gene expression patterns in human cancer. Nature Genetics 1996, 14: 457-460.

    6. Dudoit S, Yang YH, Callow MJ, Speed TP (2000,Aug) Statistical methods for identifying differentially expressed genes in replicated microarrays experiments. TR-578, Dept. of Statistics, Stanford Univ., Stanford, CA.

    7. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. PNAS USA 95: 14863-14868.

    8. Jagota, A (2001) Microarray Data Analysis and Visualization. Bioinformatics By The Bay Press, 101 pp. ISBN-09700297-3-X.

    9. Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: An Introduction to Cluster Analysis. J. Wiley and Sons, Inc., New York.

    10. Kerr MK, Churchill GA (2001) Statistical design and the analysis of gene expression microarray data. Genet Res. 77(2):123-8. Review.

    11. Kerr MK, Churchill GA (2001) Bootstraping cluster analysis: Assessing the reliability of conclusions from microarray experiments. PNAS 98:8961-8965.

    12. Korn EL, Troendle JF, McShane LM, Simon R (2001) Controlling the number of false discoveries: Application to high-dimensional genomic data. BRB, NCI, Bethesda, MD. Aug 22, 2001, Tech Report #3.

    13. Lipkin LE, Lemkin PF (1980) Data-base techniques for multiple two-dimensional polyacrylamide gel electrophoresis analyses. Clin. Chem. 26: 1403-1412. (GELLAB-II home page http://www.lecb.ncifcrf.gov/lemkin/gellab.html)

    14. Lemkin PF, Lester EP (1989) Database and search techniques for two-dimensional gel protein data: a comparison of paradigms for exploratory data analysis and prospects for biological modeling. Electrophoresis 10: 122-140.

    15. Lemkin PF (1995) Representations of protein patterns from 2D gel electrophoresis databases. In: Pickover, C., (Ed) The Visual Display of Biological Information, World Scientific Publishers, River Edge, New Jersey, pp 43-59.

    16. Lemkin PF (1993) Xconf: a network-based image conferencing system. Comput. Biomed. Res. 26, 1-27.

    17. Lemkin PF (1997) Comparing Two-Dimensional electrophoretic gels across the Internet. Electrophoresis, 18: 461-470. [Extended paper], (Web site http://www.lecb.ncifcrf.gov/flicker/)

    18. Lemkin PF, Thornwall G (1999a) Flicker image comparison of 2-D gel images for putative protein identification using the 2DWG meta-database. Molecular Biotechnology, 12:(2) 159-172. [Extended paper and Web site http://www.lecb.ncifcrf.gov/2dwgDB/]

    19. Lemkin PF, Myrick JM, Lakshmanan Y, Shue MJ, Patrick JL, Hornbeck PV, Thornwall GC, Partin AW (1999b) Exploratory Data Analysis Groupware for Qualitative and Quantitative Electrophoretic Gel Analysis Over the Internet - WebGel. Electrophoresis 20:(18) 3492-507. (PDF), (also see WebGel server: http://www.lecb.ncifcrf.gov/webgel).

    20. Lemkin PF, Thornwall GC, Walton KD, Hennighausen L (2000) The Microarray Explorer tool for data mining of cDNA microarrays - application for the mammary gland Nucleic Acid Research 28:(22) 4452-4459. (PDF).

    21. Lemkin PF, Thornwall GC, Walton KD, Hennighausen L (2000) MicroArray Explorer - a tool for data mining of microarrays - Overview. (PDF).

    22. Lemkin PF, Thornwall GC, Walton KD, Hennighausen L (2000) MicroArray Explorer - a tool for data mining of microarrays - Examples. (PDF).

    23. Lemkin PF, Thornwall GC, Powell J, Asaki E (2000) Using MAExplorer with the NCI/CIT mAdb Web server. (PDF) and [http://nciarray.nci.nih.gov/]

    24. Lemkin PF (2001) Introduction to Data Mining of Microarrays using the MicroArray Explorer. (PDF) or (PPT).

    25. Lemkin PF (2001) Using Cvt2Mae to convert Affymetrix array data for MAExplorer. (PDF)

    26. Lemkin PF, Thornwall G, Hennighausen L (2001) MicroArray Explorer - A Java-based Tool For Data Mining Microarrays. Paper presented at the AMS-IMS-SIAM Summer Conference on Statistics in Functional Genomics, June 10-14, 2001. (PDF).

    27. McShane LM, Radmacher MD, Freidlin B, Yu R, Li M-C, Simon R (2001) Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. BRB, NCI, Bethesda, MD. July 1, 2001, Tech Report #2.

    28. Radmacher MD, McShane LM, Simon R (2001) A paradigm for class prediction using gene expression profiles. BRB, NCI, Bethesda, Nov 19, 2001, Tech Report #1.

    29. Schneiderman B (1997) Designing the Human Interface, 3rd edition. Addison-Wesley Pub. Co., NY, pp 1-638.

    30. Schulze A, Downward J (2001) Navigating gene expression using microarrays - a technology review. Nature Cell Biology 3: E190-E195.

    31. Simon R, Radmacher MD, Dobbin K (2001) Design of Studies Using DNA Microarrays. BRB, NCI, Bethesda, MD. 2001, Tech Report #4.

    32. Sneath PHA, Sokol RR (1973) Numerical taxonomy. W.H. Freeman and Co, San Francisco. pp 1-573.

    33. StatSoft Inc. (2002) Electronic Statistics Textbook, Tulsa, OK: StatSoft. Available at http://www.statsoftinc.com/textbook/stathome.html

    34. Tufte E (1997) Visual Explanations. Images and quantities, evidence and narrative. Graphics Press, Cheshire, CT, pp 1-156.

    35. Tukey J (1977) Exploratory data analysis. Addison-Wesley Pub. Co., Reading, MA, pp 1-688.

    36. Vawter MP, Barrett T, Cheadle C, Sokolov BP, Wood WH III, Donovan DM, Webster M, Freed WJ, Becker KG (2000) Application of cDNA Microarrays to Examine Gene Expression Differences in Schizophrenia. Submitted.

    37. Weinstein JN, Myers TG, O'Conner PM, Friend SH, Fornace AJ, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johynson GS, Wittes RE, Paull KD(1997) An Inforation-Intensive Approach to the Molecular Pharmacology of Cancer. Science 275: 343-349.

    38. White KP, Rifkin SA, Hurban P, Hogness DS (1999) Microarray Analysis of Drosophila Development during Metamorphosis. Science 286: 2179-2184.
    MAExplorer - Microarray Exploratory Data Analysis

    Appendix A. Short tutorial for MAExplorer

    This tutorial is for use with MAExplorer, an exploratory data analysis facility for microarray DNA databases. It may be used with any MAExplorer database. As with all tutorials, they are only starting points for getting you started - in this case into understanding the data mining analysis environment. Try out new options on your own, you can't break anything :-).

    This tutorial lets you

    1. Analyze expression of individual genes
    2. Analyze expression of gene families and clusters
    3. Compare expression patterns in multiple hybridized samples
    NOTE: THIS APPENDIX IS BEING REVISED AND EXPANDED...

    A.1 Demonstration data

    Note that the downloadable MAExplorer stand-alone application includes a subset of 50 hybridized samples from the MGAP database including a number of startup files for that data (see the
    the list of startup .mae files included in the download installation).

    There is also a pre-computed example of an Ordered Condition List using 4 conditions of replicates of C57B6 (pregnancy day 13, lactation days 1 and 10, and stat5a(-,-) 15 samples. The database also includes 4 additional condition sets of this data and an Ordered Condition List of the 4 conditions (in the State/ directory). This may be used to demo the OCL F-test filter.

    If you have access to another MAExplorer database, you can use it instead since the tutorials are fairly generic.

    Using the stand-alone application for the tutorial

    These same subsets as well as other subsets of the MGAP data are available in the set of .mae startup files distributed with MAExplorer. To access these files,

    1. Start MAExplorer after you have installed it. Eg. in Windows, go to the Windows "Start Menu" and click on MAExplorer. If it is not in your Start Menu, you can go to where you installed it (typically C:\Program Files\MAExplorer) and click on MAExplorer.exe.
    2. Then after it starts, go to the "Files" menu and select "Open disk DB" and select the startup file you want. Alternatively, you can go directly to the list of startup files in C:\Program Files\MAExplorer\MAE) and double-click on one of the startup files.

    A.2 General instructions:

    Throughout this tutorial we refer to condition X and condition Y. These are different hybridized samples in the particular database you have loaded. For example, in the MGAP database X might be lactation and Y might be pregnancy. X and Y 'sets' are multiple samples of these two conditions.

    First, select one of the start up databases.

    If the particular samples you want to analyze are not listed in that example, after it starts you will be able to add samples you do want and remove samples you don't want - regardless of which example was intially used if the database "Samples" database contains additional hybridized samples.

    When it starts, a main window will pop up. It then downloads a gene database tables and the particular hybridized samples you specified. When it is ready for you to begin interaction, the menu bar will become active and it will display a green Ready - click on a gene to query database message. Depending on your Internet connection speed, it may take a few minutes to set up. If you are running MAExplorer as a stand-alone application and it is getting data from your local disk, startup will be much faster.

    Second, go to the A.3 instructions for self-guided tutorial below for instructions on what to do next.

    HINT: print this tutorial page and then read the following instructions from the printout rather than trying to keep this window visible. You might also print the parts of the MAExplorer Reference Manual for the same reason.

    HINT: You might want to keep a record of the commands you have used or the messages and measurements you have made. To do this you need to enable message and command history logging. Go to the View pull-down menu and then select the type of logging you want using the Show log of messages or the Show log of command history commands.

    NOTES:. On computers with low resolution (i.e. less than 1024 X 780) you may need to resize the windows and move them to different parts of the screen to view them simultaneously.

    
    

    A.3 Self-guided tutorial of MAExplorer - notation and examples

    The following is a self-guided tutorial (you issue the commands) that illustrates some of the data analysis capabilities. In the following examples, the notation "go to A:B:C" means go pull-down menu A, then submenu B and, then make selection C. "Selecting a gene" from the microarray image or scatter plot means clicking on a spot in the pseudoarray image or a point in the any of the plots.

    
    

    A.3.1 Review of types of gene data available in the database

    step 1: go to Analysis: GeneClass: All genes
               the array shows all genes with white circles.
    step 2: go to Analysis: GeneClass: All named genes
               the array shows named genes with white circles.
    step 3: go to Analysis: GeneClass: ESTs similar to genes
               the array shows ESTs similar to named genes with white circles.
    step 4: go to Analysis: GeneClass: ESTs
               the array shows unknown ESTs with white circles.
    step 5: go to Analysis: GeneClass: All genes and ESTs
               the array shows all named genes and all ESTs with white circles.
    step 6: go to Analysis: GeneClass: Replicate genes
               the array shows replicate genes having at least 2 copies in the
               array with white circles.
    step 7: go to Analysis: GeneClass: Calibration DNA
               the array shows calibration DNA (if present) with white circles.
    step 8: go to Analysis: GeneClass: Your plates
               the array shows clones from user's plates (if present) with white circles.

    
    

    A.3.1.1 Analysis of the expression of a single known gene

    1. ratio between two conditions X and Y (HP-X, HP-Y)
    2. expression profile of a set of conditions (HP-E) (see Example A.3.1.7)
    step 1: click on the blue "Enter gene name" button to pop up a name entry window
    step 2: start typing gene name into blue text entry window
    step 3: once gene names appear, click on gene of choice
    step 4: press "Done" button in pop up window
               A yellow circle will define the gene as the "current gene" in the microarray
               pseudoarray image (info on gene is also provided in the status area above the array).
               If there are replicate grids (left and right fields of repeated genes are denoted
               by F1 and F2) in the array (HP). The mean(HP-X,HP-Y) values and the (HP-X/HP-Y)
               values for the specified gene are reported are reported.
    step 5: alternatively, click on an array spot of choice to define any gene
               in the array as the new current gene

    
    

    A.3.1.2 Find a subset of genes with a common substring (e.g. *ONCO*)

    step 1: click on the blue "Enter gene name" button to pop up a name entry window
    step 2: start typing "*ONCO*" (without the quotes) into blue text entry window
    step 3: once gene names appear, press "Set E.G.L." button in pop up window
               Magenta squares will indicate these genes in the pseudoarray image.
               These include the 'onco'genes and the proto-'onco'genes
    
    

    A.3.1.3 Two conditions - scatter plots:

    Create a scatter plot of two hybridized samples where condition X data is on the X axis and condition Y data on the Y axis.

    step 1: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y.
               then click on yellow circle in scatter plot to get HP-X/HP-Y ratio for the gene
    step 2: click on any point in the scatter plot
               this also alternatively defines any gene in the plot as the new current gene
    step 3: zoom in on a region of the plot using the vertical or horizontal scroll bars
    step 4: click on another point in the scatter plot to get the HP-X/HP-Y ratio another gene
    step 5: press "Close" button to remove pop up window

    
    

    A.3.1.4 Scatter plot of Cy3 vs Cy5 or replicate spots (F1 vs F2) of one sample

    Create a scatter plot of Cy3 vs Cy5 channels or replicate spot F1, F2 data if your database is contains (Cy3,Cy5) ratio data or it contains replicate spot fields (F1,F2).

    step 1: go to Analysis: Plot: Scatter plots: Cy3 vs. Cy5
               or go to Analysis: Plot: Scatter plots: F1 vs. F2
               Then, click on green circle in scatter plot to get Cy3/CY5 ratio for the gene
               or F1/F2 ratio for replicate spots for that gene
    step 2: click on any point in the scatter plot
               this also alternatively defines any gene in the plot as the new current gene
    step 3: zoom in on a region of the plot using the vertical or horizontal scroll bars
    step 4: click on another point in the scatter plot to get the HP-X/HP-Y ratio another gene

    If you are working with Cy3/Cy5 dye-swap data, you may swap the Cy3/Cy5 channel data to Cy5/Cy3 for any selected subset of samples. This may make it easier to use the data in various ways when data mining. If you do not have this type of data, go to step 7.

    step 5': go to Samples: Edit (Cy5/Cy3) else use (Cy3/Cy5) menu
    step 6': select the samples you wish to swap and press "Done". This
               enables you to see the swapped results in the scatter plot
    step 7: press "Close" button to remove pop up window

    
    

    A.3.1.5 Filter by expression ratio between two conditions X and Y

    step 1: go to Analysis: Plot: Histograms: HP-X/HP-Y
               the histogram shows the ratios
    step 2: move pop up plot so you can see it and the array simultaneously
    step 3: choose (click on) a ratio bin
               genes filtered by the ratio range of the bin will light up on the array ('+'s)
    step 4: click on different bin in the histogram to select another bin
    step 5: click on word "Freq" on left in histogram to remove the histogram bin filter

    Note of caution: if the signal is close to background the X/Y ratio may be bogus.
    You can filter out low intensity genes by

    
    

    A.3.1.6 Filter by spot intensity range

    step 1: go to Analysis: Filter: Filter by spot intensity [SI1:SI2] sliders: Use spot intensity [SI1:SI2] sliders
    step 2: adjust intensity lower bound (SI1) to remove low ratio genes
    step 3: when done, remove the 'Filter by intensity sliders' by toggling it off (redo step 1 to toggle it off)
    step 4: repeat steps 1-3, but this time use Filter : Filter by [I1:I2] sliders :
               Use spot intensity (or Cy3/Cy5) [I1:I2] sliders
    
    

    A.3.1.7 Multiple conditions - expression profile plots of HP-E data:

    step 1: go to Analysis: Plot: Expression profile: Display a gene's expression profile
    step 2: after the expression profile window pops up, click on a gene in array to see its profile
    step 3: click on a line in the profile plot to see its intensity
    step 4: click on a different gene in the array to see its profile
    step 5: press "Show HPs" button to see the list of samples used
    step 6: press "Close" button to remove pop up windows


    
    

    A.3.2 Changing the normalization between hybridized samples

    You may change the normalization method used to scale data between hybridized samples so they may be compared.
    
    

    A.3.2.1 Set normalization

    step 1: go to Analysis: Normalization: Median intensity
    step 2: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
               to see the effect of normalization on the scatter plot. Note how outliers appear.
    step 3: go to Analysis: Normalization: Zscore of intensity
    step 4: go to Analysis: Normalization: Zscore of log intensity, stdDev
    step 5: go to Analysis: Normalization: Unnormalized
               this does not scale data between samples.
    step 6: go to Analysis: Normalization: Median intensity
               this leaves the normalization method in Median mode.


    
    

    A.3.3 Analysis of the expression profiles of gene classes

    You may restrict the set of genes by Gene Class. Several built in gene classes are defined. You may also set up additional ones and filter by those (not covered in this short tutorial).
    
    

    A.3.3.1 Filter by gene class membership

    step 1: go to Analysis: GeneClass: All known genes
               the array only shows named genes (additional gene subclasses are being added)
    step 2: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
               to see the two condition expression of just these genes
    step 3: go to Analysis: Plot: Expression profiles: Display Filtered genes expression profiles
               to see the multiple condition expression of just these genes. This may take a
               while if there are many genes
    step 4: you can click on a line in any of the plots to see the samples' intensity value for that gene
    step 5: when done, press "Close" button in all pop up plot windows
    
    

    A.3.3.2 Gene Reports

    step 1: go to Analysis: Report: Gene reports: Filtered genes: Genes passing Filter
               Clicking on a blue entry will bring up I.M.A.G.E, dbEST, UniGene, or GenBank,
               LocusLink, or mAdb Clone database in pop up Web page
    step 2: press "Close" button in report, and close this pop up Web page
    step 3: go to Analysis: Report: Table format: Tab-delimited
               to enable creating Excel-compatible reports
    
    

    A.3.3.3 Exporting Gene Reports to Excel

    step 1: repeat step 1 of the Gene Report, but this time to make text-formated report
    step 2: cut the text from this window and paste it into an Excel window.
               This is useful for exporting data if you are on a Windows PC
    step 3: go to Analysis: Filter: all genes
               to restore it to all of the genes from all named genes
    step 4: go to Analysis: Report: Table format: Spreadsheet
    step 5: press "Close" button in report


    
    

    A.3.4 Analysis of the expression profile of multiple hand picked genes

    Users can manually define a set of genes which are kept in the Edited Gene List (E.G.L.). Various operations can then use the EGL to restrict the set of data being analyzed.

    
    

    A.3.4.1 Define a list of edited genes, then plot all their expression profiles at one time

    step 1: go to View: Show 'Edited Gene List'
               this turns on the 'Edited Gene List' magenta square box overlays
    step 2: hold CONTROL key and click on genes in array to add a gene
    step 2': hold SHIFT key and click on genes in array to delete a gene.
               This lets you edit a list of genes. It also works when clicking in a scatter plot
    step 3: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
               to see the Edited Gene List in the scatter plot
    step 4: try defining (or removing) E.G.L. genes in the scatter plot by holding the
               CONTROL (or SHIFT) key when clicking on points in the scatter plot
    
    

    A.3.4.2 Filtering by edited gene list

    step 1: go to Analysis: Filter: Filter by 'Edited Gene List'
    step 2: go to Analysis: Plot: Expression profiles: Display Filtered genes expression profiles
               scroll through the plots to see all of the profiles
    step 3: go to Analysis: Filter: Filter by 'edited gene list'
               this turns off the 'edited gene list' filter
    step 4: press "Close" button in expression profiles window
    
    

    A.3.4.3 Report of edited gene list

    step 1: go to Analysis: Report: Gene report: genes in 'edited gene list'
               reports edited genes
    step 2: press "Close" button in report
    step 3: go to Analysis: Filter: Filter by 'edited gene list'
               this turns off the 'edited gene list' filter
    step 4: go to View: Show 'edited gene list'
               this turns off the 'edited gene list' squares overlay


    
    

    A.3.5 Identify a cluster of genes with similar expression profile to the current selected gene

    step 1: go to GeneClasses: All named genes and ESTs
    step 2: go to Analysis: Plot: Cluster plots: Cluster genes with expression profiles similar to current gene
               this will pop up a cluster summary and cluster distance slider control window.
               Move the summary and slider windows so you can see all 3 windows. The size of
               the cyan boxes on similar genes in the pseudoarray is proportional to the similarity.
               Adjust the cluster distance slider to smaller values and note how the number of genes clustered decreases.
               It should be set for a reasonable number considering the material you are analyzing.
    step 3: select (click on) a new current gene
               the genes which belong to that cluster are labeled in the array with cyan boxes
               and are defined as the "current cluster". The current gene you click on has
               a green circle around it
    step 4: press "Cluster Report" button in the cluster summary
               this pops up a Gene Report for the clustered genes
    step 5: press "Close" button in the report
    step 6: press "EP plot" button in the cluster summary
               this pops up a scrollable list of expression profile plots sorted by similarity
               to the current selected gene.
    step 7: press "Close" button in the report
    step 8: press "Close" button in the cluster summary


    
    

    A.3.6 Identify clusters of genes with similar expression under various conditions using data mining filters

    step 1: go to GeneClasses: ESTs similar to genes
    step 2: go to Analysis: Plot: Cluster plots: K-means clustering of gene expression profiles
               this will pop up a cluster summary and slider control window. Move the summary
               and slider windows so you can see all 3 windows. The size of the
               magenta circles in the array is proportional to # genes/cluster
    step 3: select (click on) a new current gene
               the genes which belong to that cluster are labeled in the array with tiny green
               numbers are defined as the "current cluster". The current gene you click
               on has a green circle around it
    step 4: go to View: Show 'edited gene list'
               genes in the current cluster were also copied to the edited gene list
    step 5: go to Analysis: Report: Gene report: genes in 'edited gene list'
               reports genes in the current cluster
    step 6: press "Close" button in report
    step 7: go to View: Show 'edited gene list'
               this turns off the 'edited gene list' squares overlay
    
    

    A.3.6.1 Varying the number of clusters

    step 1: vary the "# of clusters" slider value from 6 to 10, then 20
               note the number of clusters changes and the gene cluster composition also changes
    
    

    A.3.6.2 Defining a new cluster "seed" to recluster the genes

    step 1: select a new current gene in array and press the "Recompute clusters" button
               this recomputes the clusters using the current gene as the new seed gene

    
    

    A.3.6.3 Cluster expression profile plots

    step 1: press "EP plot" button and scroll down the list after they appear
               the primary nodes for each cluster are indicated with red labels in the set of
               profiles, and the other genes are labeled with their cluster number
    step 2: press "Mean EP plot" button and scroll down the list after they appear
               these are the mean expression plots of the primary nodes clusters.
    
    

    A.3.6.4 Report of all clusters

    step 1: press the "Cluster-Report" button to get a sorted cluster
               list scroll the spreadsheet to the right to see the cluster statistics
    step 2: press the "Mn-Cluster-Report" button to get a sorted cluster list
               scroll the spreadsheet to the right to see the mean expression profiles
    step 3: press "Close" button in pop up windows
    
    

    A.3.6.5 Current cluster in scatter plot

    step 1: go to Analysis: Plot: Scatter plots: HP-X vs. HP-Y
    step 2: move the plot so you can see both scatter plot and array
    step 3: click on a gene in the cluster or on spots in the scatter plot
               note that the green cluster numbers are drawn in the scatter plot
    step 4: go to Edit: Sets of genes : Save 'Edited gene list' as gene sets
               this will pop up a dialog box requesting "Enter new gene set name"
    step 5: type "Genes in current cluster class"
               this will save the current cluster in a gene set. This gene set will
               be used in the next example
    step 6: press "Close" button in pop up windows
    step 7: (optionally) investigate hierarchical cluster with clustergrams and
               dendrograms by going to Plot : Cluster Plots : Hierarchical clustering plot for HP-E


    
    

    A.3.7 User Gene Set operations

    You may manipulate sets of genes. Some of these are predefined for you by the database (eg. All named genes, ESTs, etc.). Others are defined by particular operations (E.G.L., clustering, etc.), and lastly others may be defined by you using logical operations on these sets (OR, AND, DIFFERENCE).
    
    

    A.3.7.1 List of the current gene sets

    step 1: go to Edit: Sets of genes : List saved gene sets
               this lists the current list of gene sets
    step 2: Change the E.G.L.
    set of genes and note how the # of E.G.L. genes changes in the list.
               You can add (remove) genes to the E.G.L. by clicking on a spot in the array while the
               CONTROL (SHIFT) key is held down.
    
    

    A.3.7.2 Filter by user defined gene set

    step 1: go to Edit: Sets of genes : Set 'User Filter Gene Set' (for Filter)
               this will request a gene set to use with the Filter in a pop up dialog box.
               Enter gene set # for the set for "Genes in current cluster class" which you saved
               in the previous example.
               then press "Ok" in the dialog box.
    step 2: go to Analysis: GeneClass: All genes and ESTs
               this resets the filter to look at all genes and ESTs
    step 3: go to Analysis: Filter: Filter by 'User Gene Set' membership
               this restricts the genes to the saved current cluster in the previous example

    
    

    A.3.7.3 Gene set operations

    step 1: go to Edit: Sets of genes : OR (Union) of 2 gene sets
               this will request 3 gene set names in a pop up dialog box.
               Enter set # for (All known genes) for the 1st gene set name,
               Enter set # for (Genes in current cluster class) for the 2nd gene set name,
               Enter "Union of known genes and genes in current cluster" for new gene set name.
               then press "Ok" in the dialog box.
               this computes the union of the two gene sets into a new gene set
    step 2: go to Edit: Sets of genes : Set 'User Filter Gene Set'
               this will reset the 'User Filter Gene Set' for the Filter in a pop up dialog box.
               Enter the set number or the beginning of the set name 'Union' that is the
               set for "Union of known genes and genes in current cluster" just saved.
    step 3: try saving other Filtered genes sets and doing other gene set operations.


    
    

    A.4 Additional tutorials

    If you wish to investigate MAExplorer in more detail, try some of the suggested examples in the
    advanced tutorial (Appendix B) in the reference manual.
    
    

    
    
    
    

    MAExplorer - Microarray Exploratory Data Analysis

    
    

    Appendix B. Advanced tutorial for MAExplorer

    There are a number of things you may do in this facility. We wrote this advanced tutorial to help demonstrate some of its capabilities. A
    short tutorial (Appendix A) is also available and we recommend doing it before attempting the advanced tutorial. Sources of startup data to use with the tutorials are listed in the short tutorial. As with all tutorials, they are only starting points for getting you into the analysis environment - try out new options on your own, you can't break anything :-).
    1. Analyze expression of individual genes
    2. Analyze expression of gene families and clusters
    3. Compare expression patterns in multiple hybridized samples
    NOTE: THIS APPENDIX IS BEING REVISED...

    Here are some things to try

    1. You could click on the Reference manual in the Help menu. That is this reference manual, but it appears in a new Netscape Web browser window so you may read it while working on MAExplorer. Delete the window when you are finished with it.

    2. Click on the Glossary in the MAExplorer Help menu. This section describes terms used in MAExplorer. Delete the window when you are finished with it.

    3. Click on the Index. It may be useful in finding things that are not in the table of contents.
    4. Note: a hybridized sample is a microarray hybridized with cDNA derived from the mRNA from a particular experiment sample (see notation defined in the Overview). Currently, you may start MAExplorer with preloaded set of samples by starting the stand-alone MAExplorer on a .mae file. Alternatively, you may start it with no samples loaded. It the latter case, you would load samples you are interested in from the Samples menu.

    When first started, it loads some initial data it needs as well as the particular hybridized samples you specified. After MAExplorer starts, it displays "Ready - click on a gene to query database" and the menus becomes active. Here are some things to try.
    1. Click on different genes (i.e. spots in the pseudoarray image). The pseudoarray image may or may not correspond to the actual array - depending on how the data was derived. Notice the data that gets displayed in the three text lines above the image. If the spot you click on is a named gene (e.g. [1-A4,3] at row 4 column 3 in Field 1), it will also print out the GeneName of the gene.

    2. Look at the pull-down menus. They consist of sets of commands with similar functionality grouped together in sub-menus. In particular, look at the Analysis menu. It contains an ordered list of submenus that may be thought of as the sequence one might perform an analysis. In reality, an analysis is more complicated and involves iterating various steps (see Figure 3.1).

    3. Go to the "Samples" menu. This menu allows selecting hybridized samples for the HP-X, HP-Y, HP-X and HP-Y 'sets', and the HP-E list. You have several ways to do this. The easiest way is to use the "Choose HP-X, HP-Y and HP-E" sample selection wizard. The other alternative is to use one of the set of cascading menus that may be used to change the selected hybridized samples. Note that the HP-X and HP-Y submenus assign samples for subsequent analysis such as X-Y scatter plots. The HP-E submenu sets the list of samples for expression profiles. Note how you may assign a sample either by going through the cascading menus or from an alphabetic list of all hybridized samples in the pseudoarray image. To change the default HP-X (HP-Y) sample, click on the purple "[X]" ("[Y]") box on the left side of the image (above the list of samples) so it is selected. Then click on the desired purple "*" adjacent to the samples listed on the left edge of the pseudoarray image. You may switch between using single and multiple samples (i.e. 'sets') with HP-X and HP-Y. Note the "HP-X:" and "HP-Y:" labels at top left when you switch between single and multiple samples. Go to the HP-X/-Y 'set' and HP-E submenus and list the contents of the respective sets to see what they contain. Note how one may add or remove samples from these sets.

    4. Go to the "Samples" menu. Then select "Choose named condition list of samples". This lets you define new condition lists of samples. Go to the "Edit" menu then "Sets of Conditions (samples)" to see additional ways to manipulate these condition sets. For example, you might define a new condition and the assign it to the working HP-X 'set'.

    5. Go to the "Samples" menu. Then select "Choose ordered list of conditions". This lets you define a new or edit an old Ordered Condition List(OCL). In the included MGAP there, there is a pre-computed example of an Ordered Condition List using 4 conditions of replicates of C57B6 (pregnancy day 13, lactation days 1 and 10, and stat5a(-,-) 15 samples. The database also includes 4 additional condition sets of this data and an Ordered Condition List of the 4 conditions (in the State/ directory). This may be used to demo the OCL F-test filter. Go to the " Filter" menu and select "Filter by current Ordered Condition List (OCL) F-test [p-Value]". This will popup a p-value slider where you can adjust the criteria for selecting genes passing the F-test.

    6. Go to the "GeneClass" submenu in the "Analysis" menu. This is set of cascading menus that may be used to change the default Gene Class. Different genes belong to different gene classes and this is a way of sub-setting the data. You may currently set it to All Genes, All Named Genes, ESTs similar to genes, ESTs, Good genes, All named genes and ESTs, Replicate genes (multiple copies on the array), Calibration DNA. Select "ESTs similar to genes". Notice that the display now only shows the red (white) circles on the ESTs similar to genes in the intensity (ratio) pseudoarray image. Look at the circle overlays on the microarray image - note how they changed. Go back and set the gene class to "All genes" and then "Calibration DNA". The "Calibration DNA" is the set of spots that may be used for normalizing data between microarrays. Check out the "Set gene class subset" submenu. Leave the gene class set to All Genes.

    7. If you are connected to the Internet, go to the "View" Menu. Then turn on the switch "Enable display current gene in popup XXXX Web Browser". Depending on your database, XXXX may be GenBank, LocusID, UniGene, dbEST or mAdb Clone DB. Then click on a gene in the image. It pops up a Web browser showing the genomic database Web page for that gene (if any). If you click on a different spot, it will reuse the popup Web browser with new data. If you don't want it to be active, go back to the "View Menu" and click on "Enable display current gene ..." again to disable it.

    8. Go to the "Normalization" submenu in the "Analysis" menu. This is used for normalizing data between microarrays so they may be compared. The default is to scale the data to the median value for each array. Investigate the other normalization methods. If your quantified data contains spot background data, you may also enable background correction that subtracts a microarray specific overall background intensity value from each intensity value in each array.

    9. Go to the "Report" submenu in the "Analysis" menu. You may generate a report in several formats "Spreadsheet" or "tab-delimited", the latter being cut and paste compatible with Excel. In the "Samples Report" sub-submenu in the "Report" submenu, click on "Hybridized Samples", then switch to the other Report format and do it again. Click on "Hybridized Sample Web links". In the "Gene Reports" submenu, try "All named genes". Then try the "Highest HP-X/HP-Y ratio" in the "Filtered gene reports" submenu. If you are using a list of HP-E or HP-X/-Y 'sets' of samples, you might try looking at the expression profile ratios or statistical data respectively though other Reports.

    10. Review the data " Filter" submenu in the "Analysis" menu. Select the "Filter by expression sliders" [I1:I2]. Expression is intensity in the case of 33P or biotin labeled, or (Cy3/Cy5) in the case of ratio data. This pops up a state sliders window. The I1 scroll bar (lower limit of the gray value intensity) to about 100. Look at the genes that were eliminated because they are out of the range. If you select the "Filter by spot intensity sliders" [SI1:SI2], it will filter genes by spot intensity in either F1 or F2 duplicate (if you have duplicate spots), or the Cy3 or Cy5 spots. You can select which sets of HPs to use (current HP, HP-X and HP-Y, HP-XY sets, or HP-E). You might disable the [I1:I2] filter while you experiment with [Si1:SI2] filter. If you have (F1,F2) or (Cy3,Cy5) data, put up the F1 vs F2 or Cy3 vs Cy5 scatter plot and you can visualize how the thresholding works.

      In the Filter menu, add the "Filter by ratio or Zdiff sliders". Then the [R1:R2] ratio range sliders are added to the state slider window and may be used for filtering genes. If the normalization method is one of the Zscore methods, it filters by the difference of the Zscores otherwise by the ratio and the [Z1:Z2] range is used. Note that the genes that pass the filter will appear to have a red (white) circle in the pseudoarray intensity (ratio) grayscale (pseudocolor), or red "+" in the scatter plots so you might try moving the controls while in those plot modes. Try some of the other filters. The spot CV test removes genes where replicate spot values (F1 and F2 in the case of a single sample or replicate samples in the case of HP-X and HY-Y 'sets' or the HP-E' list of genes) are not well correlated. The t-Test filter may be used with sets of X and Y samples to find genes with a p-value less than the specified threshold.

    11. Go to the "Plot" submenu in the "Analysis" menu. Then got to the "Show Microarray" submenu, try pseudocolor ratio (or Zscore) modes and finally leave it in the pseudograyscale image intensity mode. Try using the "Scatter Plots" and "Histograms" in the Plot menu. Vary the normalization methods and see how it affects the array image and scatter plots. When you are in a scatter plot, click on a point. It will display data similar to when clicking on an image. If UniGene data is available in your database, Go to the Views menu and set "Enable display current gene in popup UniGene Web browser" and click on a point again. This pops up another Web browser window and lookup that gene in the UniGene database. Change the threshold slider and notice how points appear and disappear. Click on a bin in the ratio histogram, it will filter the genes so that only the genes that have ratios in that bin are displayed.

    12. Go to the "Plot" submenu in the "Analysis" Menu and then to the "Expression profile" submenu. Then select the "Display gene expr. profile for HP-E". It pops up a window with two buttons "Show HP names" and "Close". Then click on a gene in the image. It will draw the expression profile for that gene. Move the mouse so it is over one of the vertical bars in the plot to get the data for that particular HP. If you click on a different spot in the image it will display the new expression profile in the same window. Click on the "Show HP names" button to popup a window with the list of HPs matching the numbers in the expression profile plot. Now create a scatter plot of two HPs and then click on a red "+" in the plot. It will update the expression profile plot with this gene. If you want to compare several expression profile plots, repeated create the "Display gene expression..." windows from the View menu. Move the windows close to each other so they are easier to compare. Only the last one you created allows you to change the gene. If you don't want a popup window anymore, click on its "Close" button. You may save the scatter plot as a GIF file by pressing the "SaveAs" button which will save it in the Report subdirectory associated with the startup data..

    13. Next we look at gene clustering. Go to the "Plot" submenu in the "Analysis" menu, and then to the "Cluster plots" submenu, try "Cluster genes with expression profiles similar to current gene". This pops up a slider with the cluster threshold. Then click on a gene in the image. It will then popup a text window with the genes whose cluster distance is less than the cluster threshold. It is sorted by minimum cluster distance. Notice that some genes in the image have different size blue boxes around them. The larger the box, the smaller the cluster distance and the more similar. Move the cluster threshold slider. This will change the clustering as seen in both the image and in the cluster popup window. Click on the "Report" button in the text window. This will popup a report on these genes sorted by minimum cluster distance. Then click on the "EP plot". This pops up a scrollable list of expression profile plots for the genes you have filtered by this test on so you can review the actual expression profiles. Close this window and set the Filter to a small number of genes such as with "ESTs similar to genes". Note that the genes passing this filter are saved in the E.C.L. and may be saved as a gene subset or part of the data Filter.

      Turn on one or more Filters to reduce the number of genes to say under 100 (e.g. t-test or spot CV filters). Then press the "Go 'Cluster all genes'" button in the cluster window. This is equivalent to invoking the "Cluster counts of Filtered genes by expression profiles" command from the "Cluster plots" submenu. Notice the Filtered genes has blue circles of different sizes. The larger the circle, the more genes there are that are similar to that gene. Move the cluster threshold slider and note that the number of similar genes changes, the size of the blue circles will change. As with the other cluster mode, you may generate a report of sorted cluster counts. Click on a gene with the largest green circle. This will then switch you back to single gene clustering mode where you can investigate that gene in more detail.

    14. Next we look at K-means gene clustering. Go to the "Plot" menu and in the "Cluster plots" submenu, try 'Display K-means gene expression profiles for Filtered HP-E' (similar to K-means clustering). This pops up a scroller for "N-clusters", the number of clusters desired, with a default of 6 clusters. It will then popup a K-means report window with various controls. The centers of the N clusters are indicated with magenta circles where the size corresponds to the number of genes in the cluster. If you click on a gene in the array, it will draw all members of its cluster as green numbers (try this with the scatter plot present as well). Press "EP plot" to scroll through a list of expression profiles of the genes sorted by cluster. Press "Mean EP plot" to scroll through the summary expression profiles of the clusters. Press "Cluster-Report" and "Mn-Cluster-Report" to