3. Exploratory Data Analysis - Introduction to Data Mining

Data mining is the uncovering of relevant patterns of interest in data from a particular problem domain (
Tukey, 1977). Typically this involves using various statistical techniques to identify the patterns including cluster analysis. See StatSoft Inc's, 2002 on-line statistics textbook for definitions of clustering and other statistical terms. Researchers across a wide range of fields such as (Tufte, 1997) and (Cleveland, 1985) have suggested that a major aspect of this problem is finding the correct means of graphical presentation to allow humans to be a part of the pattern recognition process. Tufte argues that the proper display of quantitative data in the context of the problem domain can aid in the understanding of complex sets of data. This carries over to the analysis of microarrays with data mining involves having statistical, genomic knowledge database, and graphical components for success. (Jagota, 2001) discusses a number of methods and applications for microarray data analysis and visualization. Other useful resources are the sets of papers in ( "Chipping Forecast", Nature Genetics supplement, Jan, 1999), and ( "Chipping Forecast II, Nature Genetics supplement, Dec, 2002).

This section briefly addresses some of the issues you need to consider. However, a full discussion of the issues involved is beyond the scope of this manual. These issues are covered in other more focused statistical methods literature and you might also address them in consultation with biostatisticians. The Internet has vast resources for microarrays. A few to get you started might include: a microarray citation electronic library http://arrayit.com/e-library/, the National Library of Medicine PubMed journal search engine, a general microarray Listserv GENE-ARRAYS@ITSSRV1.UCSF.EDU. The MGED group (Brazma, 2001) has published the MIAME standard which specifies (Minimum Information About a Microarray Experiment). This information is useful in doing an analysis. Also try searching using general Internet search engines. There are a number of public microarray data repositories. One that we find useful is NCBI's GEO (Gene Expression Omnibus), that contains array data and MIAME compliant information about the arrays.

Organization of Sections in this Chapter

  • 3.1 discusses Objectives in data mining, discovery and analysis,
  • 3.2 describes the steps in an analysis,
  • 3.3 describes how the database may be interrogated for gene spot intensity & identification.
  • 3.4 describes selecting subsets of genes using the data filter.
  • 3.5 describes selecting subsets of hybridized sample conditions.
  • 3.6 describes setting thresholds using state-slider controls.
  • 3.7 describes how to export (i.e. save) report and plot data to your local computer.

    3.1 Objectives in data mining, discovery and analysis

    There are a number of objectives an investigator has when analyzing a set of data. The types of analyses and how useful they are depends on what they wish to get out of the analyses as well as the type of data.

    A good and appropriate experimental design (i.e. the design and setting up of experiments to subsequently be analyzed) is critical for resolving significant differences in gene expression between experimental conditions. We touch on some of the issues here. (Simon, 2001), (Dudoit,2000), and Kerr and Churchill (2001a, 2001b) discuss some of the issues of experimental design for microarrays. We do not currently implement the Kerr-Churchill method. However, some of the issues involved in experimental design based on the types of arrays are discussed in Section 3.1.1 for (Cy3/Cy5)-labeled as well as 33P-labeled samples.

    If users are comparing two different types of samples, the analysis would be different than if they were comparing an ordered sequence of samples (e.g. time series, cell cycle, dose-response, tumor-stage, etc.). MAExplorer gives users the ability to:

    1. Organize their experiments by sample characteristics allowing them to perform a variety of mining analyses comparing gene expression patterns between sets of different samples or comparing a single sample within an array's set of genes.
    2. Explore, compare, and record these analyses and share the results or intermediate data with other investigators.
    3. Use graphical direct-manipulation combined with statistical methods, clustering and spreadsheet techniques to gain different insights into the structure of the data.
    4. Access public Internet genomic databases for particular genes that are found to be interesting.

    Briefly, data mining is the discovery of potentially interesting patterns in the data that were previously unknown. One approaches the analysis of a set of data with minimal expectations. However, some idea of what you are interested in helps focus the search. But beware of the trap of mining the data until you get the results you hope for. The following figure helps illustrate this process.

    Flow chart of a typical data mining session

    Figure 3.1 Flow chart of a typical data mining session. The user makes some initial decisions on the experimental design such as which hybridized samples to compare, the type and numbers of replicates. They then make initial guesses as to the normalization method to use, and the gene subset (the gene class) to concentrate on when setting the data filter. The data is viewed in various modalities to get a feeling for its inherent dynamic range and where interesting outliers might appear. Clustering and plots helps bring these differences into view. The results are then evaluated and either the process is finished or the views are refined by adjusting data normalization and filter parameters, data subsets to be investigated, clustering methods, plots etc. and the process repeated until the user is able to see the differences between gene subsets more clearly or no significant differences appear to be found.

    Obviously, this approach is a first approximation to what is eventually required. But it does capture the flavor of the data-mining process. Typically the user would refine the search using variations of the data filters and might contrast (using gene sets and hybridized sample condition lists operations) results found under one set of conditions with those found under another set of conditions.

    Recording the analysis steps during your data mining session - command history

    Because of the iterative nature of this process, you might want to keep a record of the commands you have used or the messages and measurements you have made. To do this you need to enable message and command history logging. Go to the View pull-down menu and then select the type of logging you want using the Show log of messages or the Show log of command history commands.

    3.1.1 Some experimental design issues of microarray experiments


    Proper experimental design of microarray experiments is critical to successful use of microarray data. Several recent reports discuss some of the key issues involved in various aspects of statistical analysis of microarrays: (Radmacher, 2001), (McShane, 2001), (Korn, 2001), (Simon, 2001), (Dudoit,2000).

    Comparing HP-X/HP-Y for Cy3/Cy5 data as 'ratio of ratios'

    If we have two samples HP-X and HP-Y with a common reference sample P (e.g. Cy5P), then we would be comparing the HP-X "intensity" Cy3X/Cy5X against the HP-Y "intensity" Cy3Y/Cy5Y. Alternatively, you can label Cy3 as the common reference sample P in which case just swap Cy3 and Cy5 in these equations. If you are using a common reference standard (i.e. Cy5X1) is the same sample as Cy5Y1 eg. a pooled sample Cy5P, then
     a)   (Cy3X/Cy5X1) / (Cy3Y/Cy5Y1)
     b)   (Cy3X/Cy3Y)
    However, this new comparison is accompanied by additional noise because of use of the two Cy5P intermediaries.

    An alternative method would be to compute (Cy3X/Cy5Y) directly. However, this too has its own sources of error and other problems, namely that not all genes are labeled symmetrically with the two dyes since different dyes may have different sequence specific affinities due to a variety of causes. For that reason, dye-swap experiments are often done. I.e. the two samples would be run as (Cy3X/Cy5Y) as well as (Cy3Y/Cy5X). If one were to plot (Cy3X/Cy5Y) against 1.0/(Cy3Y/Cy5X) and the data were perfectly symmetric (which they are not) then one would expect a straight line. That is generally not what you get in practice.

    Another issue is that when you have a number of samples A, B, C, D, ..., N and wish to compare them, there are a number of alternate experimental designs you can use with different resulting sets of advantages and problems. If a common pooled Cy5P sample P were used, then the following experiments would be done:

       (Cy3A/Cy5P), (Cy3B/Cy5P), ... , (Cy3N/Cy5P)
    This assumes that there is enough of the pooled sample P to be used for all of the experiments - otherwise additional sources of error would be introduced. MAExplorer is ideally used with this common reference sample P. It a common pooled sample is not used, then the experimental design becomes more complicated - especially if dye-swap experiments are performed for all samples. For N samples taken 2 at a time (i.e. Cy3 and Cy5), then the number of experiments may be impossibly large to perform for other than a very small N. Eg. for N of 3, the number of experiments is 3 and 6 if dye swap experiments are also performed. For N of 4, the number of experiments is 6 and 12. And this is without doing any replicate experiments. If a reasonable number of replicates is added, then this set of experiments becomes even difficult to perform.

    MAExplorer is currently not oriented to handling these large combinatoric types of non-pooled sets of experiments. However, you do have the ability to swap (Cy3,Cy5) data on an individual basis so you could compute an average of data from dye-swap experiments - but with the caveats or non-uniform labeling mentioned above.

      [(Cy3X/Cy5Y) + 1.0/(Cy3Y/Cy5X)]/2
    In general, this is probably not a very good estimate.

    3.1.2 Design philosophy of MAExplorer methodology

    There are several ways to implement a data mining system on moderate size databases. The first is that all computations are performed on a Web server and the user's Web browser displays the results. The second is download an applet from the Web server, get the data from the Web server and do computations in the Web browser. A third way is do download data from a Web server and run a local stand-alone program on the data. MAExplorer can be run using both the second and third ways. However, we encourage the use of the stand-alone paradigm as having the best bandwidth and being the most robust. The browser-based computation paradigm (as opposed to server-based) is somewhat unusual. It keeps both the program and data on the server, making user maintenance of the latest versions easier than if they had to constantly upgrade the program or data. This also has the distinct advantage of giving the user instantaneous feedback through rapid visual and tabular views and the ability to more effectively navigate the data since the analysis is done on their desktop computer. Because it is easy to access reference data from other genomic sources (e.g. UniGene, GenBank, NCI/CIT's mAdb clone DB, dbEST, GeneCard, etc.), it can be accessed from their respective Web servers as needed. Complex browser-based computations are used in other data mining or intensive computation domains. With the increased bandwidth of the Internet and compute power and memory of PCs approaching the Cray supercomputers of the previous decade, this paradigm becomes even more feasible. However there are limits to how well it scales because of Web browser limitations. Appendix E.2 discusses these issues in more detail

    The major focus of the MAExplorer is interactive data mining with an emphasis on direct graphical and tabular manipulation of the data. The investigator is able to interact with the system by clicking on spots in the array image, points in graphic plots, cells in spreadsheets, by manipulating threshold sliders or typing in gene names/clone Ids. This level of interaction allows investigators to search for and identify patterns of differences with greater ease than with a more static graphic system since it is easier to test ideas by "grabbing onto the data". For example, "what" is the identity of "this" outlier I am pointing to in a scatter plot; "which" genes are best clustered with "this" gene in this clustergram and are perhaps co-regulated; "which" genes have expression ratios within the range of the histogram bins to that I am pointing?

    Direct user manipulation of data, as incorporated in MAExplorer, was defined by (Schneiderman, 1997) who defends the position that the direct manipulation of data in data mining is an extremely effective means to amplify human creativity in understanding patterns. Schneiderman's dogma states "overview first, zoom, and then filter details on demand" and favors the use of "shallow search trees, slide controllers, and information-right screens with tightly coordinated panel view of data", (Beardsly, 1999). MAExplorer also uses many of these direct manipulation principles. It was designed to run on the desktop computers with data residing on the same computer and loaded into its memory for rapid direct manipulation - for both the Web browser and stand-alone versions.

    3.1.3 Evolution of MAExplorer from earlier proteomic data mining systems

    MAExplorer was designed to do flexible exploratory quantitative data analysis of gene data from microarray hybridized sample experiments. Many of the data-mining concepts are derived from a system called GELLAB-II (http://www.lecb.ncifcrf.gov/lemkin/gellab.html) that is a UNIX-based stand-alone exploratory data analysis system for 2D protein gels over multiple experiments (Lipkin and Lemkin, 1981), a review (Lemkin and Lester, 1989) and examples of graphical representations of this type of data (Lemkin, 1995). An on-line GELLAB-II Web-Poster (http://www.lecb.ncifcrf.gov/lemkin/gellab-ep93wd.html) is available showing various screen shots of GELLAB-II in action. Whereas GELLAB works with sets of corresponding spots (i.e. proteins) across sets of 2D gel samples, MAExplorer works with sets of genes (spots in the microarray) across sets of hybridized sample microarrays. With protein gels, one typically has spot alignment problems since gels are generally not superimposable. This is often called the rubber-sheet distortion problem and requires localized alignment of spots based of neighboring spot constellation morphology. We have used Web-based visual methods to visually compare gels including the Flicker (http://www.lecb.ncifcrf.gov/flicker/) image comparison system a Java applet, (Lemkin, 1997), and the 2DWG (http://www.lecb.ncifcrf.gov/2dwgDB/) meta-database of 2D gel images, (Lemkin, 1999a). Since the genes are precisely spotted on the arrays, aligning spots between arrays is not required and greatly simplifies that the data analysis problem.

    Part of the Flicker system allows comparison of user 2D gel images with standard images from SWISS-2DPROT for putative identification of unknown spots in the user gels. The user would select a standard 2D gel image from over 20 tissue types, enter their own 2D gel image and align them at spots of interest. They could then switch to a database access mode, click on those spots and generate popup SWISS-2DPROT Web pages for those proteins - similar to Clone reports in MAExplorer. That is accessed at http://www.lecb.ncifcrf.gov/flicker/swissProtIdFlkPair.html.

    MAExplorer will have a groupware facility similar to what we have done with our WebGel (http://www.lecb.ncifcrf.gov/webgel/) system described in (Lemkin et al., 1999b). It is a two-dimensional electrophoresis system for sharing data analyses. In WebGel, users may perform a data-mining analysis and leave the state of the their analysis and accompanying notes to share with their collaborators on a login-protected basis.

    3.1.4 Concepts used in data mining with MAExplorer

    This section introduces some of the concepts used in data mining microarrays with an emphasis on how they are used with MAExplorer.

    Gene data filters - a Boolean AND of gene set tests

    A primary MAExplorer concept is that of
    gene data filter that selects a working set of genes by the conjunction (Boolean AND) of user selectable tests. Each test further restricts the working set of genes to those meeting the test. These criteria include gene membership in particular gene classes, membership in particular user defined or computed gene subsets, and meeting a variety of statistical constraints. Statistics include intra- and inter-array CV, X-Y sets t-tests. Range test criteria include X/Y ratio ranges and histogram bins, intensity ranges and histogram bins. Membership criteria include test if genes are in the current-cluster (derived from cluster-analysis), gene set membership, etc. By selectively including one or more of these filter restrictions, the user can home in on the data that appears to be interest. Of course as in real mining, what appears interesting may not be interesting based on further investigation.

    Set operations on gene subsets

    Because of the complexity of comparing many different replicated samples, it may be difficult to manually organize the resulting comparisons. MAExplorer offers set-theoretic operations on sets of genes and sets of hybridized samples (i.e. intersection, union, difference) to help with this organization (step 9 in Table 3.2). The results of set operations may be saved and used in subsequent set operations, normalization, as well as with the data filter. This is useful when comparing and documenting procedures, methods, and analyses from several subsets of experiments.

    User exploration states

    Users needs to be able to save and restore the current state of their explorations of the data and option settings to document and continue at later times. When running in stand-alone mode, the user may save their data mining session on the local disk as in named (.mae file extension) startup files. Clicking on one of a startup file will restart MAExplorer and restore the state to that of the time it was saved. In addition to filter and parameter status, the HP-X, HP-Y, HP-X and HP-Y 'sets', HP-E 'list', the named gene sets and HP condition sets are saved as part of the state

    User groupware sharing of exploration states with collaborators

    [In the future], these could be saved on a public Web server using multiple named state files. These are protected for the user using a login procedure. A groupware sharing of these intermediate exploratory results is available when they allow another user to access selected states. User states and groupware sharing complete step 11 in the analysis described in Table 3.2.

    We now discuss using these tools for analyzing ones data.

    3.2 Steps in data mining, discover, and analysis

    An analysis scenario may use many methods for viewing the data. A typical sequence of analysis steps is listed below in Table 3.2 in the order they might be performed. Note that this is a rough guide for a possible analysis and the iteration and backing up for of some of these steps is required for data mining complex sets of conditions, especially in the setting of constraints for the "data filter" (step 4) when the user focuses on subtle patterns of interest (c.f. Figure 3.1).

    Table 3.2 Steps in a data-mining analysis.
    1. Select a set of hybridized samples to be analyzed. Additional samples may be added or removed from the HP-X, HP-Y and HP-E sample lists prior to or during an analysis.
    2. Organize samples by selecting query type: 2-class (X vs. Y) sample sets or N-sample ordered expression lists.
    3. Select normalization (transformation scaling) method appropriate to the type of data and what types of changes you are looking for (patterns or outliers).
    4. Restrict search using the "data filter" to subset genes (pre-filter data) by gene-class, gene-subset, ratio range, intensity range, coefficient of variation, t-test, expression profile, cluster membership, etc.
    5. Review scatter and expression plots, ratio and intensity histograms to gain insights in overall characteristics of the data range and intensity distributions. Export results using either SaveAs or cut and paste operations.
    6. Cluster genes by similar expression profiles and other clustering methods. Review similarity, clustergram and dendrogram plots and reports. Export using either SaveAs or cut and paste operations.
    7. Select subset(s) of genes of interest in plots or reports.
    8. Access other Web genomic databases for genes of interest from spreadsheets or plots.
    9. Save genes of interest in named gene sets and perform set operations if required.
    10. Create gene reports for export to Excel using tab-delimited Reports and either SaveAs or cut and paste operations.
    11. In the stand-alone version of MAExplorer, save the state of exploratory data analysis for later use or sharing.

    In designing a data mining experiment, the first decision to be made is selecting the set of hybridized samples to be compared (steps 1 and 2). This is accomplished by setting the current hybridized sample-X (HP-X) and hybridized sample-Y (HP-Y). In Figure for the scatter plot we selected a single C57B6 pregnancy day 13 and a single Stat5a (-,-) pregnancy day 13 as current HP-X and current HP-Y samples. Changing the normalization changes the view in the scatter plot so that hidden differences may be more apparent (see Figure

    The names of the current HP-X and HP-Y samples are displayed at the top of the main window. The current HP-X and HP-Y samples may be changed at any time by clicking on a new sample from a list of samples shown on the left side of the main window or from lists of samples organized by sample population in the Samples menu.

    The next decision to be made is selection of the genes to be studied by choosing a subset from the gene class menu list (step 4). Further selection occurs throughout the analysis by clicking on spots in microarray images, points in graphic plots or cells in spreadsheets, by adjusting threshold sliders, or using the text-entry "guesser" to type in gene names, clone IDs, genomic IDs, samples, etc.

    The next decision the user must make is to set the intensity data normalization mode (step 3). Normalization of quantitative data is crucial when comparing data between different hybridized microarrays because of spotting, hybridization efficiency, uniformity, and other systematic errors.

    Genes of interest may be separated for all of the genes in the database using a cascade of data filters (step 4). Additional filtering options are easily accessible in the (data) Filter menu. Some of the filters require additional parameters. These parameters are set by state scroll bars that pop-up on the screen when data filters requiring them are added to the filter cascade. Changing scroller values causes the data filter to be automatically be reapplied and a new set of genes to be computed.

    It is desirable to reduce false-positives found by the data filter by eliminating genes with high quantification variability between duplicate spots on the same sample or spot duplicated in replicate samples. If duplicate genes are available on the array (denoted by Field 1 and Field 2 or F1 and F2 spots), this allows the computation of a coefficient of variation (CV) for the duplicates. This CV may be used in a data filter to reduce potential false-positives. CV is computed as 2|F1-F2|/(F1+F2) using those spot values for each gene, as StdDevHP/MeanHP for a set of replicate hybridized samples.

    Graphical views of the data give the user additional insights into the data. These include spot intensity and ratio or Zdiff pseudoarray images, scatter plots, histogram plots, expression profile plots, cluster plots showing genes similar to a specified gene, the number of clustered genes for each clone, divisive clusters for K-means clustering, and clustergrams and dendrogams for hierarchical clustering.

    Scatter plots are useful for visualizing data from two conditions

    The scatter plot method (step 5) allows the user to plot the intensity data between two samples, the X-sample and the Y-sample. Gene data may be spot data for two different samples (HP-X and HP-Y), means of two different sets of hybridized samples of replicate samples (sets of HP-X and sets of HP-Y), or the left and right normalized replicate data (F1 vs. F2) for the current hybridized sample. If Cy3/Cy5 data is used, then each sample is the ratio of data from two different hybridized samples. So if we have sample Cy3a and Cy3b then HP-X could be Cy3a/Cy5 and HP-Y could be Cy3b/Cy5 such that we are scaling the Cy3a and Cy3b samples using a common Cy5 normalization sample. Scatter plots are useful for obtaining a better understanding of the outliers when comparing different hybridized samples and determining the reproducibility of spotting when comparing F1 vs. F2 data or replicate sample data.

    Filtering genes by histogram plots of ratios, Zdiffs or intensities

    Histogram plots may be generated from either X/Y ratios or (X-Y) Zdiffs of two different hybridized samples (single samples or X and Y replicates) or from the F1/F2 intensities of a single hybridized sample. Selecting a bin in a histogram restricts filtered genes to those that are contained in that histogram bin. As an alternate method, data filtering by ratio (Zdiff) or intensity range may be used with adjustable range scrollers independent of the histograms. However, histograms and scrollers may be used together. For example, one could filter by the ratio histogram after filtering out genes with low-intensity values that may be considered noise using the intensity sliders. That might help eliminate falsely high ratios resulting from dividing high X values by a very small noisy Y values. Histograms are useful for getting a better understanding of the range and distribution of the gene intensities or ratios.

    Expression profile plots (EP-plot) of N conditions for viewing time series, etc.

    List HP-E is an ordered list of samples - as different from HP-X and HP-Y that are unordered sets of samples. The expression profile (step 5) of a gene is the plot of its normalized intensity as a function of the samples in the ordered HP-E list. It may be plotted for the current gene in a pop-up window. Selecting a different current gene causes the EP-plot to be displayed for that gene. Multiple EP-plots may be created to view the differences between a few genes you are investigating further. The HP name button pops up a window with the ordered list of samples so you can see the details of the sample names being plotted. Selecting a line in a plot displays the intensity data and sample name for that hybridized sample. The data may be plotted as a bar, point or continuous curve and error bars may be turned off to better compare multiple plots.

    When there are too many EP-plots to be viewed simultaneously, you might use a scrollable list of expression profile plots that lets you scroll through an arbitrarily large list of genes. However, it is difficult to compare genes that are not sorted in some way (i.e. clustered). Therefore, these are most useful when used after clustering the data and displaying the scrollable EP-plots of the cluster-order data.

    Clustering is one way of possibly finding co-expressed genes that exhibit similar expression changes in a set of samples. Genes may show similar co-expression, but that does not prove they are co-regulated at the same point in a pathway - merely that measurements of those genes in a particular set of experiments show similar expression. However, identifying genes with similar expression for which some information is already known about some of the genes may be useful as a starting point to help figure out gene function and pathway using additional experiments and analysis.

    There are many methods for doing clustering - each with advantages and disadvantages. We present three methods in MAExplorer and plan on adding a variety of more powerful methods through the MAEPlugin facility under development.

    Finding clusters of genes with similar expression profiles: similar, cluster counts, K-means, and hierarchical methods

    We may define a cluster of genes as a set of genes whose expression profiles are found to be similar (step 6). The samples used in computing the expression profiles are specified by the HP-E ordered list. You can scale the list of normalized intensity data for each gene to 1.0 (resulting in finding genes with similar shaped EP-plots). Alternatively, if you don't scale this data it will cluster more on magnitude changes. You can select either the Euclidean distance or the correlation coefficient of the EP lists between two genes as the measure of gene-gene distance. Similarity is 1.0 - normalized distance.

    The first cluster method finds a cluster of genes whose expression profiles are similar to that of the currently selected gene. This list of genes is restricted by the constraint that the cluster distance between each of these genes to the selected gene is less than the "Cluster threshold" distance set by the user with a scroll bar. It displays genes that are found both with blue boxes (the larger the box, the higher the similarity) and in a text report window showing the genes and their distances to the current gene. By varying the threshold and observing the results, the user can find a set of highly correlated genes. If the threshold is set to 0.0, no genes are found. If it is set too high, all data filtered genes are found. So it is critical to adjust the threshold to a reasonable level commensurate with the type of data being analyzed and the approximate number of genes expected.

    A second cluster method draws blue circles in the array image around all filtered genes meeting the threshold criteria, where the larger the circle the larger the number of similar genes (i.e. passing the threshold) are found to be clustered with that gene. Clicking on a gene toggles between the first and second methods. For both of these methods, it will pop-up a "Cluster Distance" threshold scroller and recomputes the clusters if you change the scroller value or the current gene. It also shows a text report that displays the number of genes similar to each data filtered gene.

    A third method called "K-means" clustering K genes (we call primary nodes) whose expression profiles are most orthogonal to each other. It uses the current gene as the first or "seed" node. It then finds the gene furthest from this and assigns it as node 2. Then the gene furthest from both nodes 1 and 2 is assigned to node 3, etc. This process is repeated until all K nodes are assigned. Then the remaining genes are assigned to the closest node. Having defined the initial cluster centers, it recomputes the centroid of each of the clusters. The centroid can alternatively be computed using a median instead of a mean in which case we would be doing K-median clustering (Bickel, 2001). K genes are then reassigned to the nearest new centroids as the new K-means node instances. Finally, the remaining genes are assigned to the nearest centroid. A scrollable K-means cluster text window report pops up with genes sorted by cluster. Clicking on a gene in either the array image or scatter plot assigns all genes in the cluster to which that gene belongs to the "current cluster". Genes in the current cluster are labeled in the array and scatter plot with a small number of the cluster. In addition, genes in the current cluster are copied to the E.G.L. where they can be used in a report, saved in a named gene set, or used for additional filtering. It also pops up a "N-clusters" scroll bar window to let you dynamically adjust the number of clusters. Changing N will recompute the clusters. When the K-means is recomputed, it uses the current gene as the initial seed gene.

    The fourth method is a hierarchical clustering method that generates a clustergram and dendrogram similar to that of Eisen's red-black-green clustergram (Eisen, 1998). This was derived from the clustered correlation map (ClusCor) of Weinstein et al. (Weinstein, 1997). The MAExplorer clustergram and dendrogram are dynamic and may be interrogated and used to set the current gene. This means that it may also position a corresponding ordered list of expression profile plots to the same gene so you may view the data as a plot as well. The dendrogram may be zoomed in to explore a part of the dendrogram in more detail. As with the K-means clustering, a report can be made of the ordered genes.

    Gene reports: dynamic spreadsheets for Web access or tab-delimited for exporting data to Excel

    Pop-up report windows (step 7) may be generated for either individual genes or a global array sample data. Instances of the latter include experimental information and Web links, global statistics, correlation coefficients between array samples, etc. Gene reports may present this data in a number of ways. These include: highest/lowest gene ratios, profiles, parametric and cluster statistics, etc. Reports may be presented as either dynamic Web-interactive spreadsheet tables or as static tab-delimited tables. The latter is useful for exporting data using cut and paste into Excel (step 10). If the user clicks on a blue hyperlinked cell in a dynamic spreadsheet table, it pops up another Web browser window and loads it with data (step 8) from the respective Internet genomic database such as mAdb Clone DB, UniGene, GenBank, dbEST, GeneCard, and MGAP model and histology Web pages.

    Collaborative groupware environment

    Having immediate access to collaborator's data is a powerful research tool. A collaborative environment is being implemented for MAExplorer that allows groups of users to share data and intermediate results. These capabilities include: 1) the ability to save and restore exploratory data mining sessions (states) through the Web server including named sets of genes, and 2) to selectively share these states with collaborators. The latter process is sometimes called a groupware environment because if offers a collaborative group the ability to share and interact. These capabilities are modeled after our WebGel system (Lemkin et al., 1999b). In addition, users can create a custom database Web page as a subset of samples from the entire database. This may be saved on their own computer through their Web browser's "File/Save as" command. This hypertext file could then be used at a later time to access the database or be E-mailed to a collaborator to do the same.

    3.2.1 Definition of expression profile

    It is helpful to define an expression profile. There may be alternate definitions, but the following is useful for getting an understanding of how it might be computed. An expression profile ej of an ordered list of N samples (k=1 to N) for a particular gene j is a vector of scaled expression values vjk.

    Then, the expression profile is expressed as a list of values:

       ej = (vj1, vj2, vj3, ..., vjN)
    A difference between two genes p and q may be estimated as a N-dimensional metric "distance" between ep and eq. The Euclidean distance is then defined as
       dpq = (1/N SUMj=1:N (vjp - vjp)2 )1/2
    Other distance measures may include correlation coefficient, city-block (or manhatten distance) etc.

    For scaled data such that dpq has a maximum value of 1.0 ovger all samples. A similarity measure could be computed as 1.0 - distance or

      spq = 1 - dpq

    3.2.2 Clustering Methods

    Clusters represent one way to identify similar gene expression across a set of experiment samples. There are many ways to cluster the data, some of which are available in MAExplorer. These include:
    1. Finding genes with similar expression
    2. K-means clusters where the number of clusters K is fixed prior to clustering and a particular gene is used to start off the cl.
    3. Hierarchical clustering where a binary tree hierarchy is created
    Other methods include Self Organizing Memory (SOM), fuzzy clustering, Support Vector Machines (SVM), etc. Clustering similar genes

    If we have a particular gene s (the "seed" gene), we may want to find a set of all genes {gj} similar to gs. We can find this set of genes by testing We define a particular gene gj as similar to seed gene if the distance between genes s and j meets the following criteria.
      djs < T
    The threshold T is set by the investigator and in MAExplorer is changed using a slider. Typically, the set of all genes {gj} found is sorted by similarity before being viewed. K-means clustering

    K-means clustering finds K clusters of genes with similar expression profiles to a given gene (see
    Sneath and Sokol, 1973). Given the number of clusters K, we could use high variance of clusters to determine if they should split into sub-clusters. K-means clustering does not need a distance matrix (see Hierarchical clustering which follows), so it is faster and may cluster large numbers of N genes. However, it is highly dependent on seed selection. It may be useful for getting an initial estimate - especially if other techniques (such as silhouette plots) are also used. The following is a simplified definition of one way to compute a set of K-means clusters of gene expression profile data.

    1. Pick seed gene s and put it into cluster 1 (let k = 1)
    2. For all clusters j = 1 to k, find gene q such that djq is a maximum
    3. Set k = k+1. Put gene q into new cluster k
    4. For j = k to K, repeat steps 2 and 3 until there are K clusters
    5. Then, assign (N-K) remaining genes q into one of the K clusters j with minimum djq
    6. Compute new virtual gene cluster centroids as means or medians {ek} for each of K clusters
    7. Reassign all N genes q into K new clusters with minimum dpq using virtual genes {ep}
    8. Variants: use multiple seed genes, range of K values, minimize CV Hierarchical clustering

    Hierarchical clustering of a set of genes will generate a binary tree of clusters with the genes at the terminal ends of the tree and a single cluster of the entire tree at the top (also called the root) of the tree. See (Sneath and Sokol, 1973) for a discussion on hierarchical clustering. There are many other variants of hierarchical clustering. Hierarchical clustering requires a distance matrix or the equivalent of one [there are more efficient ways to compute it]. ForN genes (terminal clusters), it generates 2N-1 clusters. Distance matrix is upper diagonal matrix D of dpq of size N(N-1)/2.

    D can get quite large for clustering a large number of genes N [for N=5000, this is > 50 Mbytes!]

    The following is a simplified definition of one way to compute a hierarchical clustering of gene expression profile data.

       1. Assign all N genes to clusters 1 to N, set n to N
        2. Find two clusters p and q such that dpq is a minimum doing the following:
    2.1 Compute a virtual cluster vector ek = average (ep,eq)
    2.2 Set n = n+1
    2.3 Assign virtual cluster to new cluster
       3. Repeat step 2 until n = 2N-1.

    3.3 Display of gene spot intensity and identification data measurements

    You may select the current gene by clicking in the pseudoarray image or in the X-Y scatter plot and MAExplorer reports. The microarray grid coordinates, normalized quantified spot intensity data, plate coordinates, gene name (if known) and associated data for that gene. If you are displaying a pseudocolor ratio (X/Y) or Zdiff (X-Y) image, it will report HP-X/HP-Y or (HP-X - HP-Y) data respectively. It also sets the gene as the current gene. The pseudoarray image coordinates are reported as:

         [<field>-<grid name><row#>,<col#>].

    If there is only one field in the array, it will appear as field 1. In the above example, [1-A4,3] is field 1 grid A row 4 and column 3. Note that the pseudoarray coordinates are for visualization purposes in MAExplorer and may or may not be the same as the coordinates on the actual array. That depends on how the MAExplorer database was defined in the configuration file described in Appendix C.

    When the current gene is defined, it will draw a yellow (green) circle around the spot in the ratio (intensity) pseudoarray image and display other features of the gene in the three-line status area near the top of the main window. If background correction is enabled (the "Use background intensity correction" in the Normalization menu), then spot intensity values will appear as intensity' (with background intensity subtraction) and intensity (without background subtraction).

    There are a number of different reporting formats available depending on the array display mode and particular normalization method selected. These include: the pseudoarray image of the intensity of a single sample, the pseudocolor ratio X/Y or Zdiff (X-Y) image (using either HP 'sets' or single samples), or the ratio of Cy3/Cy5 for dual-labeled dyes or F1/F2 for replicate spots for a single sample. In addition, the normalization mode is also displayed in the reporting line. We will present examples of each of these different reporting formats.

    You may show the intensity data for a particular spot in the currently displayed pseudoarray image. First select the "Pseudograyscale image" option in the "Show Microarray" submenu in the "Plot menu". If your data has duplicate grids (i.e. fields F1 and F2) then you may look at F1, F2 and mean (F1+F2)/2 data in the reports when you click on a spot. If the "Gang F1-F2 scrolling" switch is disabled in the "View menu", then the intensity value is the intensity data value for the gene at that location. If the "Gang F1-F2 scrolling" switch is enabled, then it reports intensity[F1], intensity[F2], and the F1/F2 ratio. These two formats are shown in the following two examples for a C57B6 pregnancy day 13 samples in the MGAP database:

    a) Field F1 spot for a single spot in a single sample with the median intensity selected.

    [1-A4,5] intensity=4.5267, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    b) Field F1 and F2 replicate spots for a single sample. The top line is shown for each of the different normalization methods.
    [1-A4,5] intensity[F1]=-0.3067, intensity[F2]=-0.2312, F1-F2=-0.0755, (Norm.: Zscore intensity)
    [1-A4,5] intensity[F1]=4.5267, intensity[F2]=6.2408, F1/F2=0.7253, (Norm.: median intensity)
    [1-A4,5] intensity[F1]=0.8755, intensity[F2]=1.1457, F1-F2=-0.2701, (Norm.: log median intensity)
    [1-A4,5] intensity[F1]=-0.1442, intensity[F2]=-0.0945, F1-F2=-0.0497, (Norm.: Z-score, stdDev, log intensity)
    [1-A4,5] intensity[F1]=-0.1533, intensity[F2]=-0.1004, F1-F2=-0.0528, (Norm.: Z-score, mean abs.deviation, log intensity)
    [1-A4,5] intensity[F1]=630.9911, intensity[F2]=869.9273, F1/F2=0.7253, (Norm.: calibration DNA intensity)
    [1-A4,5] intensity[F1]=1919.9376, intensity[F2]=2646.957, F1/F2=0.7253, (Norm.: scale to max. (65K) intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    If the "Pseudocolor HP-X/HP-Y ratio or Zdiff" option is selected in the "Show Microarray" submenu, data is reported as either Ratio or Zdiff data depending on the normalization method selected. The data used in the following examples is for C57B6 pregnancy day 13 (HP-X) compared with Stat5a (-,-) pregnancy day 13 (HP-Y).

    c) Ratio data for two samples X and Y in separate hybridized arrays. Ratio data for the field F1 and F2 spot data as well as the mnX/mnY ratio is reported. The median normalization was used in this example.

    [1-A4,5] HP-XY: mn(X,Y)=(5.383,6.834) (X/Y)(F1,F2,mean)=(0.651,0.928,0.787), (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    d) Zdiff data for two separate samples X and Y. Ratio data for the field F1 and F2 spot data as well as the mnX-mnY Zscore difference is reported. The three Zscore, ZscoreLog, and logMean normalizations were used in this example (first lines are shown).
    [1-A4,5] HP-XY: mn(X,Y)=(-0.269,0.151) (X-Y)(F1,F2,mean)=(-0.470,-0.370,-0.420), (Norm.: Zscore intensity)
    [1-A4,5] HP-XY: mn(X,Y)=(-0.119,0.051) (X-Y)(F1,F2,mean)=(-0.199,-0.142,-0.170), (Norm.: Z-score, stdDev, log intensity)
    [1-A4,5] HP-XY: mn(X,Y)=(1.010,1.224) (X-Y)(F1,F2,mean)=(-0.362,-0.064,-0.213), (Norm.: log median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    e) Example of when the "Use dual HP-X & HP-Y Pseudoimage" mode is enabled in the "Show Microarray" submenu of the "Plot" menu. This displays mean data for the HP-X and HP-Y data side-by-side. The median normalization was selected.
    [1-A4,5] intensity[X]=5.3837, intensity[Y]=6.8342, X/Y=0.7877, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds

    Reporting for multiple hybridized samples when using HP-X/-Y 'sets'

    If you have enabled MAExplorer to "use HP-X and HP-Y 'sets' of multiple samples" rather than single samples" in the Samples menu, it will report a spot differently using the means (mn), standard deviations (S.D.), coefficient of variations (CV) for the samples in the HP-X and HP-Y 'sets'. For duplicate fields, these are computed using the normalized average of F1 and F2 spots for each gene in each samples. The data used in the following examples is for three C57B6 pregnancy day 13 (HP-X) samples, and five Stat5a (-,-) pregnancy day 13 (HP-Y) samples.

    f) Multiple HP-XY 'sets' using median normalization for the pseudoarray image display for the HP-X 'set' of three C57B6 samples.

    [1-A4,5] HP-X 'set' mean intensity=3.295 stdDev=1.482 CV=0.449 n=3, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    g) Multiple HP-XY 'sets' using median normalization for the pseudoarray image display for the HP-Y 'set' of five Stat5a (-,-) samples.
    [1-A4,5] HP-Y 'set' mean intensity=8.180 stdDev=0.986 CV=0.120 n=5, (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    h) Multiple HP-XY 'sets' using median normalization for the pseudoarray image display for the HP-X and HP-Y 'sets' when the "Use dual HP-X & HP-Y Pseudoimage" mode is enabled in the "Show Microarray" submenu of the "Plot" menu.
    [1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120)\
          n(X,Y)=(3,5), (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, plate[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds
    i) Multiple HP-XY 'sets' using median normalization for ratio (HP-X/HP-Y) data for the "Pseudocolor HP-X/HP-Y Ratio or Zdiff" display.
    [1-A4,5] HP-XY 'sets': mn(X,Y)=(3.295,8.180) mnX/mnY=0.402 SD(X,Y)=(1.482,0.986) CV(X,Y)=(0.449,0.120) \
            n(X,Y)=(3,5), (Norm.: median intensity)
    CloneID: 1248228, dbEST3': 2279072, GenBankAcc3': AI463183, UniGene: Mm.13859, platey[5,A,5]
    GeneName: Mus musculus ribosomal protein L41 mRNA, complete cds

    j) Multiple HP-XY 'sets' p-value using median normalization for ratio (HP-X/HP-Y) data for the "Pseudocolor (HP-X,HP-Y) 'sets' p-value display.

    [1-A7,20] HP-XY: mn(X,Y)=(3.449,0.853) (X/Y)(F1,F2,mean)=(4.09,4.008,4.041), (Norm.: median intensity)
    CloneID: 1382656, dbEST5': 1775754, GenBank 5': AI036495, UniGene: Mm.300,  plate[12,A,8]
    GeneName: Carbonic anhydrase 3

    Reporting for Cy3 and Cy5 channels for a single hybridized sample

    k) If you have Cy3/Cy5 data, then you can look at the two channels for a single sample ( the current HP sample). For median normalization and the display set to "Pseudocolor Red(Cy5)-Yellow-Green(Cy3) Cy3/Cy5 data" display.
    [1-A6,11] Cy5/Cy3=0.3588, Cy5=67.324, Cy3=187.622, (Norm.: median intensity)
    CloneID: IMAGE:1054189, 
    GeneName: expressed sequence AW213287

    Reporting HP-X (Cy3 or Cy5) vs HP-Y (Cy3 or Cy5) data for 2 samples

    l) If you want to compare Cy3 or Cy5 in the HP-X sample with a Cy3 or Cy5 value in the HP-Y sample, you do it through the special Cy3,Cy5 scatter plots. There are four types of plots:
    1. HP-X Cy3 vs HP-Y Cy3
    2. HP-X Cy3 vs HP-Y Cy5
    3. HP-X Cy5 vs HP-Y Cy3
    4. HP-X Cy5 vs HP-Y Cy5
    After the plot is started, clicking on a scatter plot will report data from the point in that plot will print the following data as shown in the following example where HP-X Cy3 is plotted against HP-Y Cy3.
    [1-A5,16] intensX=4.695, intensY=5.923, (X-Y)=-1.2275, (Norm.: log median intensity)
    CloneID: IMAGE:963758, 
    GeneName: RIKEN cDNA 2410114O14 gene


    3.4 Selecting subsets of genes using the data Filter

    Genes may be selected on a number of criteria specified by the
    data Filter (Section 2.4.3) that is a cascade of data tests. The first might be the gene class (Section 2.4.1) to restrict the set of all genes to a particular subset. Various numeric and statistical data tests might be applied on the remaining genes to exclude those not meeting these tests. For example, genes having a high coefficient of variation between duplicate spots on the same sample or duplicate samples could be eliminated. Then one could select genes that had a (HP-X/HP-Y) ratio greater than 4.0 but less than 8.0, etc. The latter could be done using either the ratio scrollers or by clicking on that bin in the ratio histogram plot. See the Filter menu options and look at one of the tutorials (Appendix A) for ideas on adjusting the Filter to close in on a particular subset of genes.

    3.5 Selecting subsets of hybridized sample conditions

    Sets and lists of hybridized sample conditions (HP-X and HP-Y sets, HP-E list) may be selected using various commands from the Samples Menu (Section 2.2) including pull-down menus, guessing by name or part of a name, or using a "Chooser" (see Figure 2.2.1) to design your settings for the current (HP-X and HP-Y sets, HP-E list). The Chooser is the easiest way to select these entries. In addition, if you want to change the current HP-X or HP-Y individual sample, you can do this directly from the array pseudoarray image by clicking on the [X] or [Y] part of the image and then selecting the particular sample to use. Note that if the mouse-over checkbox is enabled, then moving the mouse over the sample names gives you the full sample name. Otherwise, the sample name may be truncated.

    3.6 Setting threshold values using the state-scroller sliders

    You may filter genes using a variety of thresholding operations (see the Filter menu (Section 2.4.3) to select any of these). For example, these include a spot intensity (per channel) range [SI1:SI2], gene intensity range [I1:I2], ratio range [R1:R2], Zdiff range [Z1:Z2], Coefficient Of Variation (CV) range [0:1.0], p-value range [0:1.0] for the t-test, etc. Additional threshold scrollers are used with the clustering methods including the number of clusters (default 6), the maximum cluster distance from a gene to a another gene for the latter to be considered in the same cluster, and the absolute difference between HP-X and HP-Y.

    For the intensity and ratio threshold filters, the range interpretation may be inside, or outside the specified range. The ratio range [R1:R2] is between 0.01 and 100.0. The Zdiff range [Z1:Z2] and [CZ1:CZ2] are between -4.0 and +4.0. The intensity threshold range [I1:I2] is set to the dynamic range of the min and max intensity for the current normalization method.

    A list of possible threshold sliders is shown in the following table. When a Filter is enabled that requires a slider, it pops up the State Scrollers window that contains one or more slides. When you disable all filters that use these sliders, the popup window will disappear. The corresponding Ratio R1[R2] or Zdiff Z1[Z2] sliders are used if you are using a ratio or Zscore normalization - and will change if the normalization changes while the filter is active.

    Some of the sliders are implemented with a non-linear scale so that you have more resolution at the low end (eg. p-Value, Spot CV, Diff HP-XY).

    Depending on the set of data Filters selected, there may be multiple sliders present in the State Slider popup window (eg. see Figure 2.4.3).


    Table 3.3.1. List of threshold sliders. Sliders are enabled in the State-Scroller popup window when the corresponding data filters are enabled.

    Slider nameAssociated with operation
    Spot Intensity SI1Filter by spot intensity range per channel
    Spot Intensity SI2Filter by spot intensity range per channel
    Percent SI OKFilter by percent of spots whose spot intensity
    is in threshold range criteria meets the AT LEAST or AT MOST criteria
    Intensity I1Filter by gene intensity range
    Intensity I2Filter by gene intensity range
    Ratio R1Filter by gene X/Y ratio range
    Ratio R2Filter by gene X/Y ratio range
    Zdiff Z1Filter by gene X-Y Zdiff range
    Zdiff Z2Filter by gene X-Y Zdiff range
    Ratio CR1Filter by Cy3/Cy5 gene X/Y ratio range
    Ratio CR2Filter by Cy3/Cy5 gene X/Y ratio range
    Zdiff CZ1Filter by gene (Cy3-Cy5) X-Y Zdiff range
    Zdiff CZ2Filter by gene (Cy3-Cy5) X-Y Zdiff range
    p-ValueFilter by t-Test
    Spot CVFilter by Coefficient of Variation
    Cluster DistancePlot - cluster by expression similarity
    # of ClustersPlot - K-means clustering
    Diff HP-XYFilter by absolute difference (HP-X,HP-Y)
    Spot QualityFilter by continuous spot quality (If data available)

    3.7 Exporting report and plot data

    Data is typically reported in MAExplorer in report and plot windows. These may be saved using cut and paste if your are using MAExplorer as an Applet or with "SaveAs" buttons on the popup windows if you are running it as an stand-alone application . Reports are then saved as text (.txt extension) files, and plots are saved as GIF (.gif extension) files.

    If you are running on a windowing system supporting cut and paste, then you may cut and paste data from reports and plots into applications on your system that allow you to save or print this data. Set the Report menu table-format to "Tab-delimited". Then, in Windows 95/98/NT/2000/XP, cut data from the popup tables (or other text reports) and paste it into Microsoft Excel. In Windows, you can capture (i.e. "cut") the entire screen by pressing the "Prt Sc" or print screen button. To capture a specific window (e.g. a scatter plot), hold the "Alt" key when pressing the "Prt Sc" key. Then go into a Windows imaging application (such as PhotoShop) and paste it into the application. In PhotoShop, in the File menu, select New (or type Control/N). Then when the window is opened, click on the window and paste the MAExplorer screen you had cut into the image window by typing Control/V. In both Excel and PhotoShop you may print the data or save it in a file.