Cvt2Mae Appendix

A. Cvt2Mae Users Guide
B. Generation of a pseudoarray geometry if no array geometry is specified
C. Examples of some typical input files

A. Cvt2Mae User's Guide

This section contains a detailed description on how to use Cvt2Mae's sequential step by step process to convert your data to the MAExlorer format.

Step 1 Select Array Chip (Array Layout)

Cvt2Mae has several predefined arrays (Affymetrix, GenePix, and Scanalyze) available from a pull down menu. These may be edited or customized and saved under a different name. This allows the array layout to be setup in a particular format which then can be used many times to convert data.

One can also create, edit and save their own custom Array Layouts using the <user-define> Array Layout. If you define other custom arrays, they can be saved and will appear in this pulldown menu for next time. User defined Array layouts are saved as a text file in a subdirectory called "ArrayLayout" under the name used in the Edit Layout menu with the extension ".alo". You may also remove array layouts using the "Remove Layout" button. It will popup a file dialog from which you can choose an Array Layout to delete. Downloading and installing newer versions of Cvt2Mae will not wipe out your custom defined array layouts.

Step 2 Select input file(s)

Depending on your data, there may be several variations on how the data is configured. For example some array applications will save multiple files for each sample, while others will have multiple samples in one file. Sometimes the GIPO data is available as a separate GIPO file. Once the files have been selected go to step 3 below.

2.1 Selecting a single data file

Click on the "Browse input file name" button to popup a file selection window. Then select the data file for conversion. The file name(s) will then appear in a text area below the "Browse input file name" button. Some arrays (e.g. Affymetrics) may have data for multiple samples in one file.

2.2 Selecting multiple files

You can convert multiple files. Repeatedly select files with the "Browse input file name" button. Each time you select a file, it will appear in a list below the "Browse input file name" button. Some data will consist of multiple files such as a separate GIPO.

W.1 Array Layout Name and Vendor

This is used as the file name for the layout - a unique name of the array layout designator. This is generally specified by the chip vendor. If it is your own chip then use your own designator to differentiate your chip designs. Also, you may choose the name of the chip vendor. If your are specifying a <User-defined> chip, then you can use whatever you wish - eg. your organization.

W.2 Grid Geometry

These are the parameters for the orientation of the spots in the pseudo array displayed on the main window of MAExplorer. Basically a field is made up of grids, grids are made up of rows and columns of spots.

Number of duplicated spot Fields in array
Some arrays have multiple or duplicate fields.This is the number of duplicated spot Fields in array. EVERY spot is duplicated. For example, each grid of spots is duplicated. We refer to these in MAExplorer as F1 and F2. If there are no duplicates, then there is 1 field.
Number of Grids per Field
Number of Grids per Field. A grid contains Grid Rows X Grid Columns of spots
Number of spots per Grid Row
Number of Spots per Grid Column
Use Mol. Dynamics 'NAME-GRC' else (Grid, Row, Column)
Use the Molecular Dynamics 'NAME-GRC' specification for (grid, row, column), otherwise use separate fields for (grid, grid_row, grid_col)
Specify array layout by Grid-geometry or by number spots per array
If you specify the array layout by Grid-geometry (ABOVE), then enter (#Fields, #Grids, #Grid-rows,#Grid-cols). If you specify the layout by the maximum number of spots in the array (BELOW), it will estimate a pseudo-layout that the spots will fit on the this array for visualization purposes. It does not correspond to the actual array layout which you do not have to enter.
Maximum number of spots in array
This is the maximum number of spots that may occur in your data.

W.3 Input File Starting Rows Data

Row containing a list of sample names
Number of the row containing the names of the multiple samples if the file contains multiple samples. (Row #s start at row 1.) If there are no sample names, set it to 0 or leave blank. If you change it from 0 to any positive row number, it removes input files and re-reads the first file to get the proper data field names so you may use it with the Assign GIPO Fields and Assign Quant Field assignment operations.
Row containing a list of quantitative file names
Number of rows that contains the names of the data file Field names
First row containing quantitative file data
Number of first row that contains quantitative array Data in the file. It is assumed that this is followed by the rest of the array data.
First row containing optional separate GIPO file data
Number of first row that contains optional separate GIPO file array Data.
Optional comment
Leave it blank, if there are no comment lines.
Initial keyword for each data row
If you specify this, it checks for it in each data row

W.4 Ratio Fluorescence Data

Ratio or Intensity data
Data for MAExplorer is either ratio data such as Cy3/Cy5, or intensity data such as P33, etc.
If Ratio data use (Cy5/Cy3) else (Cy3/Cy5)
Ratio data may be presented as either (Cy5/Cy3)or (Cy3/Cy5)
Fluorescent dye for intensity 1 if (ratio data)
The dye to associate with quantified data intensity 1.
Fluorescent dye for intensity 2 if (ratio data)
The dye to associate with quantified data intensity 2.
Have background intensity data
The input data file includes background intensity data that you want to include. You do NOT have to include that data.

W.5 Optional Microarray Coordinates

Use microarray pseudo (X,Y) coordinates
Generate a microarray pseudo image using a representation of the array based on Grids, Grid Rows, and Grid Columns. Otherwise, use the (X,Y) data supplied for each spot - if it exists. If this option is set, it will overide the actual (X,Y) coordinates if that option is selected as well.
Use actual microarray pseudo (X,Y) coordinates
The actual (X,Y) coordinate data exists for each spot. If the data exists but you do NOT select this option, it will use the pseudo-array option.
Reuse (X,Y) coordinates of first sample for all samples
Reuse (X,Y) coordinates of first sample for all samples. This is used if you want to 'Flicker' array pseudo images between two samples.
Swap microarray rows and columns
Reverse rows and columns in the microarray pseudo image.

W.6 Optional Genomic Identifiers

Has Location data
The user data file has Location identifier data. These could be 'probe_set' for Affymetrix, 'Incyte ID' for Incyte, etc. and are used as the gene identifier if there are no other IDs.
Has Clone ID data
The user data file has I.M.A.G.E 'Clone ID' data.
Has GenBank data
The user data file has 'GenBank' identifier data. See http://ncbi.nlm.nih.gov/ for more information.
Has UniGene ID data
The user data file has 'UniGene' identifier data.
Has dbEST data
The user data file has 'dbEST' identifier data.
Has LocusLink data
The user data file has 'LocusID' identifier data.
Has SwissProt data
The user data file has 'SwissProt' identifier data. See http://www.expasy.ch/ for more information.
Has Plate data
The user data file has user Plate well identifier data. This uniquely identifies the source of the spotted clone.
Get Genomic IDs from 'Description'
The Genomic IDs are encoded in the 'Description' field of user input file (the Affymetrix encodeing of ids). It will find and generate genomic IDs for: Clone_ID if /cl=XXXX is in the Description, GenBank if /gb=XXXX is in the Description, UniGene if /ug=XXXX is in the Description. If this switch is enabled, then the explicit ID options are disabled.

W.7 Optional Gene Names (descriptions)

Has Gene Class user data
The user data file has Gene Class ontology data for each gene. [FUTURE]
Has UniGene Name user data
The user data file has UniGene Name data. This could be used if the default 'GeneName' description is not available.
Has separate per-spot QualCheck user data per-sample
The user data file has 'Quant' QualCheck data. This data is on a per-spot basis for each array hybridization. The code (see MAExplorer Reference Manual Appendix C Table C.4.2) may be used to flag bad spots or missing spot data.
Has 'GIPO' QualCheck user data for entire DB
The user data file has GIPO QualCheck data. This data is on a GIPO basis for the entire database. The code (see MAExplorer Reference Manual Appendix C) may be used to flag bad gene data.

W.8 Optional DNA Calibration and user plate names, Unigene speices name

Name of calibration DNA (if in database)
This is the default name given to calibration DNA spotted on the array for calibration purposes and indicated in the GIPO file 'Clone ID' or 'GenBank ID' field.
Name of researcher's special clones (if in database)
This is the default name given in place of a I.M.A.G.E. Clone ID when the researcher's clones have not yet been placed im the I.M.A.G.E. respository and thus have no ID. It indicated in the GIPO file 'Clone ID' or 'GenBank ID' field.
Name of empty wells
If you have empty or blank rows of spot data, type in the name (i.e. 'empty') if any.
Name species (opt)
Species name (Mouse, Human, ...). It is used to document the Array Layout
Name UniGene Species prefix (opt)
UniGene species prefix (Mouse Mm, Human Hs, etc.). This is used in querying Genomic Web databases. If you do not see the prefix you want in the choice menu, type it in.

W.9 Optional Database and Data Quantifcation program

Your name of the created database (opt)
The name you want to give to the created database name.
Your name of the database subset (opt)
Your name for the database subset used in the initial .mae startup file.
Generic project name for all samples (opt)
Generic name of the project to be used for all samples in the database. If no name is specified, it uses the input data files folder.
Name of spot quantification program (opt)
Name of the program used to quantitate the spot data from the sample images.

W.10 Optional Hybridized Sample "set" class names

Default name of X samples 'set'
This is the name for the samples assign to the 'X set'.
Default name of Y samples 'set'
This is the name for the samples assign to the 'Y set'.

W.11 Optional Default Data Filtering Thresholds

Default cluster similarity threshold [0 : 1000]
Default cluster similarity threshold used in some of the clustering methods. This is the initial value shown in popup sliders.
Default # genes in highest/lowest
The number of genes reported in gene Reports or in the data Filter when this restriction is invoked.
Default # clusters for K-means clustering [1 : 1000]
Default # of clusters used in the K-means clustering method. This is the initial value shown in popup sliders.
Default p-value threshold (for t-tests) [0.0 : 1.0]
Default p-Value used in the t-Test data Filter. This is the initial value shown in popup sliders.
Default Coeff. Of Variation threshold [0.0 : 1.0]
Default Coefficient Of Variation used in the data Filters. This is the initial value shown in popup sliders.
Default absolute difference threshold [0.0 : 4.0]
Default absolute difference threshold used in the data Filters. This is the initial value shown in popup sliders.

3.2 "Assign GIPO Fields" Button

MAExplorer extracts data from the gene-in-plate-order (GIPO) gene coordinate table. This links spots in a microarray to these Genomic "gene ID"s and gene names. This table may also contain Clone ID, GenBank, dbEST, UniGene IDs, LocusID corresponding to these Master Gene IDs. An optional table of Clone IDs and Gene Classes the gene belongs to may also be defined. The "Assign GIPO Fields Button" will allows the user to associate or customize their fields. They may have different names than what MAExplorer uses. A detailed description of the MAExplorer GIPO file can be found in the MAExplorer manual under Appendix C.

Often GIPO files supplied by array vendors have additional fields not currently used by MAExplorer. You can leave them in (they will be ignored) or take them out (loading a database is faster).

3.3 "Assign Quant Fields" Button

Quantified hybridized sample array spot data (Quant files) from each array is put into a separate data file. Essentially, a data file contains one or more spot intensity values per gene in each row of the data file. A spot location is specified by a GIPO (field#, grid#, grid column#, grid row#) 4-tuple with the field value optional. If the field specification is omitted and there are duplicate spots in multiple fields of grids, then it is defined implicitly. In that case, the corresponding spot intensity data for each field for a gene is specified as separate columns going from left to right. Similar to the GIPO fields the "Assign Quant Fields Button" will allows the user to associate or customize their fields. They may have different names than what MAExplorer uses. A detailed description of the MAExplorer Quant files can be found in the MAExplorer manual under Appendix C.

Step 4 Choose output folder/directory

This allows you to pick the location of where the converted data will be stored. We recommend using the first one, "Create New project Folder".

A. "Create New project Folder": This will create a new project folder anywhere you want .
B. "Merge with Exisiting Project Folder": This will put the new files into an existing folder. This will not merge data.
C. "Use Input folder for output files": Use the current input project directory.

Step 5 Convert Data

Once you have setup all of the essential parameters you can now convert your data by clicking on the green "Run" button. The status lines will show what the converter is doing. Once it is done, the red Abort button will change into a green "Done" button which you press to exit the program. This means that your data has been successfully converted and you can now go to the MAE folder to click on the Startup.mae file to run MAExplorer on your converted data.

B. Generation of a pseudoarray geometry if no array geometry is specified

MAExplorer requires the data in the GIPO and Quant files be specified by a spot position. This is indicated by the array spot geometry of (#fields, #grids, #rows/grid, #columns/grid). The #fields is the number of duplicated sets of grids if available - it is 1 otherwise. This 4-tuple must be specified in the Configuration file. However, some array data does not have spot geometry position data available. The alternative is to generate a pseudoarray geometry. This is possible since the pseudoarray image in MAExplorer is used simply to indicate success of the data filter or relative differences depending on the "(Plot | Show Microarray)" menu option. The algorithm presented below will generate a geometry (nGrids,nGridRows,nGridCols) that is compatible with the visual use of the pseudoarray. The only assumption is the nRowsExpected, the number of spots in the microarray (rows in the database input file). The number of spots in the array is computed automatically and the option to use the pseudoarray instead of the actual array geometry is selected in the Edit Layout Wizard for Grid Geometry.

Pseudoarray Geometry Algorithm
OPT_GRID_SIZE = 1200; /* Optimal grid size for MAExplorer viewing */ ROWS_TO_COLS_ASPECT_RATIO = 3.0/4.0; /* desired rows/cols aspect aspect for a grid */ extra = 0; /* # of extra grid cols required */ /* Estimate # of grids. Assume a square aspect ratio */ if(n <= OPT_GRID_SIZE) nGrids = 1; else nGrids = (n / OPT_GRID_SIZE)+1; /* Estimate rows (r) and columns (c) from a rectangular grid * where cols = (4/3) rows. * Then, c = (4/3)r and r*c= area. * Then (4/3)*r*r = area or * r = sqrt((3/4)*area). */ if(nRowsExpected > 0) while(true) { /* iterate to optimal size */ gridSize = n/nGrids; nGridRows = sqrt( ROWS_TO_COLS_ASPECT_RATIO * gridSize ); nGridCols = (nGridRows / ROWS_TO_COLS_ASPECT_RATIO); nGridCols += extra; estTotSize = (nGrids * nGridRows * nGridCols); if(estTotSize > nRowsExpected) break; else extra++; /* keep trying until meet criteria */ } /* iterate to optimal size */

C. Examples of some typical input files

A typical sample database table might look like:

Single spot/gene intensity data.

  grid    grid col    grid row   RawIntensity  Background
  1       1           1          2226.8        32.6
  1       1           2          1234.8        25.6
      . . .
  10      25          28         3333.8        23.6

Double spots/gene intensity data contained in two fields of duplicate spots.

  grid   grid col  grid row  RawIntensity1 Background1 RawIntensity2 Background2
  1      1         1         2226.8        32.6        2345.9        39.4
  1      1         2         1234.8        25.6        1245.9        39.4
      . . .
  10     25        28        3333.8        23.6        3345.9        25.4

Double spots/gene intensity data contained in two fields of duplicate spots.

  field grid   grid col  grid row  RawIntensity  Background
  1     1      1         1         2226.8        32.6
  1     1      1         2         1234.8        25.6
      . . .
  1     10     25        28        3333.8        23.6
      . . .
  2     1      1         1         2226.8        39.4
  2     1      1         2         1234.8        39.4
      . . .
  2     10     25        28        3333.8        25.4

Double spots/gene intensity data using the Molecular Dynamics' NAME_GRC notation.

  NAME_GRC      RawIntensity1   RawIntensity2
  GRID- 1-R1C1  2126.500        3662.350
  GRID- 1-R2C1  2311.430        3306.290
  GRID- 1-R3C1  3696.470        5780.310
  GRID- 1-R4C1  3167.450        5245.440
  . . .

Cy3/Cy5 spot/gene ratio data.

  grid   grid col  grid row  Cy3     Cy3Bkgd   Cy5      Cy5Bkgd
  1      1         1         2226.8  32.6      2345.9   39.4
  1      1         2         1234.8  25.6      1245.9   39.4
      . . .
  10     25        28        3333.8  23.6      3345.9   25.4

[MAExplorer home | Cvt2Mae home | Help desk | LECB/NCI/FCRDC ]