NIA Array Analysis Tool



General Description of NIA Array Analysis tool

Terms of use

What can you do with the NIA Array Analysis tool?

  1. Evaluate the statistical significance of differential gene expression based on microarray data
  2. Make log-ratio plots and scatter-plots
  3. Find clusters of tissues with similar expression patterns and identify specific genes for each cluster
  4. Find major patterns of variability in gene expression using Principal Component Analysis (PCA) and biplot
  5. Cluster genes according to their contribution to principal components
  6. Find genes whose expression matches a given pattern (pattern matching)
  7. Plot the dendrogram for replications to check abnormal arrays
  8. Plot the error function (SD vs. expression level)
  9. Make a correlation matrix
  10. Normalize input data
  11. Import principal components from an earlier analysis to overlay 2 sets of results
  12. Save results of analysis for personal or public access

What are advantages of the NIA Array Analysis tool?

  1. Outlier detection and handling
  2. A set of error models for reducing the proportion of false positives
  3. Direct use of False Discovery Rates (FDR) for evaluation of significance
  4. Singular value decomposition (PCA-biplot) for visualization of linkages between genes and tissues
  5. High-quality interactive 3-D graphics based on public-domain VRML viewers
  6. Import principal components
  7. Optional permutation test
  8. Server-based software (speed does not depend on the power of your computer)
You are welcome to use our server to analyze your data. However, if you plan an extensive use of it (in terms of disk space and computation time) then we reecommend to install the NIA Array Analysis software on your own server (Windows, Unix, or Linux).
Installation instructions are available in the readme.txt file.

Components of the NIA Array Analysis tool

  1. Input data format
  2. Arrayjoin tool for compiling multiple files
  3. Filtering and adjustment of data
  4. Gene annotations
  5. ANOVA
  6. Hierarchical clustering
  7. Principal Component Analysis (PCA/SVD)
  8. Pattern matching
  9. Gene list analysis (GO, domains, location)
  10. Glossary

1. Input data format

The input file is a tab-delimited text. You can assemble it in Excel and then save as a tab-delimited text file. The first row consists of column headers. The first column has gene IDs which can be numerical or textual. If you use split-normalization, then the second column should be indexes (numerical) of arrays or portions of arrays to be normalized separately; in this case your data start from the third column. If you do not use split-normalization, then your data starts from the second column. You can normalize data yourself before uploading it or you can use a normalization tool that is a part of our software. If your input data are log-transformed, then select the appropriate log-transformation type (Log10, Log2, or Loge) before loading the file. Missing values are coded as -9999.

For 1-color arrays, each column represents an array. For 2-color arrays, a pair of adjacent columns represents an array. The order of columns should match to the order of tissue/experiment names that you specify in the input form. First you put data from 1-st tissue/experiment (1st replication then 2nd replication, ...), then 2-nd tissue/experiment, and so on. If some replications have multiple subreplications, then subreplications should be placed together one after another.

Below are examples of input file formats. You can use column headers to automatically generate the experiment design (see details in examples).

(1) Compare 2 tissues with 1-color arrays. Several replicates are needed (we suggest using 3 biological replicates); some replicates may have technical subreplicates.
(2) Compare multiple tissues with 1-color arrays. At least some tissues should have independent biological replicates; technical subreplicates are possible.
(3) Compare 2 tissues with 2-color arrays. Several replicates are needed. Dye swap is recommended as technical subreplicates.
(4) Compare multiple tissues that are all hybridized with universal reference on 2-color arrays. At least some tissues should have independent biological replicates; technical subreplicates are possible. Dye swap is not needed.
(5) Visualization of a data set with no replications

If the software did not recognize your experiment design or if the generated design is wrong, you can always correct it manually by entering tissue/experiment names and the number of replications/subreplications in the form.

2. Arrayjoin tool for compiling multiple files

Arrayjoin Tool is designed for compiling an input file from multiple scanner files. Scanner files should be tab-delimited text files. The software also checks the error for mean values and can eliminate data with high error. This software is available only for Windows. However you can compile it yourself for other OS.

To use the software you need to make a tab-delimited parameter file (name it as convenient for you). Two columns are required, and 2 additional columns are optional.

Column 1 indicates a dye-swap (0=no dye swap, 1=dye swap). If your array has only one color (or radioactive label) then put 0. For example, if you compare embryo and placenta and selected green color for embryo and red color for placenta as a normal design, then a dye swap would be red color for placenta and green color for embryo.
Column 2 has file names. We recommend writing full file names that include the drive, path, and file name.
Column 3 (optional) has tissue/experiment names. If no tissue name provided then column headers will be taken from file names.
Column 4 (optional) has replication numbers. Arrays with same tissue and same replication numbers will be considered sub-replications (technical replications).
The order of files in the parameter file should be the same as required by the NIA Array Analysis tool. Do not save the parameter file to your desktop! Put it either into the directory of the program or any other directory.

Example of parameter file with dye swap:
0C:\Projects\Array\embryo_placenta064.txtembryo1
1C:\Projects\Array\embryo_placenta065.txtembryo1
0C:\Projects\Array\embryo_placenta066.txtembryo2
1C:\Projects\Array\embryo_placenta067.txtembryo2

Example of parameter file without dye swap:
0C:\Projects\Array\embryo074.txtembryo1
0C:\Projects\Array\embryo075.txtembryo2
0C:\Projects\Array\placenta074.txtplacenta1
0C:\Projects\Array\placenta075.txtplacenta2

The program can be configured to read any tab-delimited input files. It can skip a specific number of header lines and then extract specific data columns. To configure the program correctly you need to execute it from the command line. Parameters in the command line override default settings which correspond to the file format generated by Agilent scanner. The syntax of command is:

arrayjoin.exe [-i input][-o output][-a aliasFile][-t1 error/mean][-t2 error/avr.error][-d color#][-r rowsSkip][-h headerRow][-m meanCols][-e errorCols][-id idCol][-c col,min,max]

(1) input = infput parameter file
(2) output = output file
(3) aliasFile = file with alias names for oligos. We include the file with alias names for Agilent oligos (oligo-alias.txt). The first column is a unique oligo name, and the second column is a comma-separated list of aliases.
(4) error/mean = threshold for acceptable (Error/Mean) ratio for a gene. This parameter is used only if the scanner provides an error value for gene intensity. If the ratio Error/Mean is below the threshold, then gene intensity is directly taken from the file. If it is above the threshold, then the value will be either replaced by a missing value (-9999) or by a surrogate value (explained in the following paragraph). The default value =0.6. If you wish to take all values directly from the file, then make this threshold high (e.g, 10 is high enough).
(5) error/avr.error = threshold for (gene error/expected error for this intensity) This parameter is needed to determine how to handle a gene that has Error/Mean ratio above the threshold specified in the paragraph above. First, the computer estimates the average error for genes with specific intensity level (error may change depending on gene average intensity). Then, the actual error is compared with this expected error (i.e., average error). If the ratio (actual error/expected error) is above this threshold then the mean value is replaced by a missing value (-9999). If the ratio is below the threshold but the mean is greater than the expected error value, then the mean value is left unchanged. Finally, if the mean value is below the error value, then it is replaced by a surrogate value that is equal to the expected error. Default value = 3.
(6) color# = number of colors in the array Use 1 for 1-color array and 2 for 2-color arrays
(7) rowsSkip = Number of header lines to skip The text file generated by a scanner usually include several header lines (8) headerRow = row number that contains column descriptions. The default value = 10 because the current version of Agilent output files has headers in the 10th row. Headers are used to find correct output columns. The following options override these column numbers
(9) meanCols = column number for means Specify the column number for mean intensity of a gene. If scanner software includes normalization or color balance, then use normalized (or processed) means. For 1-color array specify just one number. For 2-color arrays put 2 numbers separated by a comma.
(10) errorCols = column number for mean errors (use -1 if no errors provided) Specify the column number for error of mean intensity of a gene. If scanner software includes normalization or color balance, then use normalized (or processed) errors. For 1-color array specify just one number. For 2-color arrays put 2 numbers separated by a comma.
(11) idCol = column number for oligo ID. All arrays should have the same order of oligo (genes or clones) ID in the file. Specify the column number with oligo ID. If the alias file is specified then oligo ID will be considered an alias and replaced by the name specified in that file. If several features have the same ID) then the value is averaged for all features with the same ID.
(12) col = column number for condition that a row is included into output (use -1 if no condition). Some arrays use control features for normalization. It is often convenient to remove these features from output. If the value is in the range of (min, max), then it is included in the output file. Column number, min, and max should be comma-separated with no space.

Launch the program arrayjoin.exe, it will prompt you to enter the name of the input parameter file (if you have not specified it in the command). Enter the name of parameter file. If the parameter file is located in the same directory as the program (i.e., arrayjoin.exe) then you don't need to specify the path (directory) to the parameter file. If the parameter file is located elsewhere you need to specify the full file name that includes the path. Then the program prompts for the name of output file (if you have not specified it in the command). If you cannot find the output file in the program folder, specify the full path for the output file.

3. Filtering and adjustment of data

Data normalization option is available at the file upload step. Normalization is needed for 1-color arrays. Although the same method can be used for 2-color arrays, we never tested it with 2-color arrays. Most companies that produce 2-color arrays also provide software for image analysis and data normalization. It is better to use commercial software for normalization of 2-color arrays than our tool. Our normalization tool implements the non-parametric method that equalizes multiple quantiles of the probability distribution of gene expression using a piece-linear function that converts actual quantiles in each column into target quantiles (see details).

Split-normalization means that normalization is done for a subset of genes at a time. The major reason for using split-normalization is that all genes do not fit on one array. In this case the full set of genes may be represented by 2 or more arrays that require independent normalization. For split-normalization the input file must have array indexes in the second column.

The NIA Array Analysis software starts the pre-processing of data with log-transformation. Negative data are treated as 0.0000000001. User-defined cutoffs can be assigned to filter and adjust the data. If a data value is less than the minimum cutoff, then it is replaced by the minimum cutoff value. This adjustment may artificially lower the error variance for low-expressed genes. To avoid this effect, the software adjusts the averaged error variance for genes with average intensity within 2*SD from the minimum cutoff, by not letting it decrease as the average intensity decreases. The maximum cutoff simply ignores genes with the average intensity exceeding the cutoff value.

In the case of two-color arrays with a common reference, the software adjusts signal according to the reference signal. Adjustment is done on a gene by gene basis in 2 steps: (1) the mean reference signal, MR, is estimated for gene G using all arrays; (2) signal, S, for gene G in each array is replaced by adjusted value S1 = S-R+MR, where R is reference signal in the same array.

Two-color arrays with a common reference occasionally have cross-channel correlation which results in underestimation of gene expression change. Theoretically, red and green intensities should be independent, however we often observed that the intensity in the red channel, which resulted from hybridization with the same reference sample, increased with the increasing intensity in the green channel. The NIA Array Analysis tool has an option to adjust data that have cross-channel correlation. Adjustment is done on the gene-by-gene basis if the average log-intensity of the reference is by log(5) less than the average log-intensity in the other channel (i.e., at least 5-fold difference), and if the cross-channel correlation is >0.7. In this case, the reference intensity is set to its average value for this gene plus the adjustment to the average reference intensity for all genes in each array.

Outliers can be removed based on a user-selected z-threshold level. The default threshold (z = 8) removes only most deviating outliers. Estimation of the z-value is based on the error variance model (see below) which uses ANOVA results. This creates a circular reference which is resolved by iterative ANOVA with outlier removal after each iteration. The process stops if no new outliers were detected.

4. Gene annotations

If a probe annotation file is uploaded, then all results are accompanied with annotations which may contain hyperlinks to various web resources. Association with major gene indexes (e.g., Unigene, TIGR, MGI, NIA Mouse Gene Index) may provides unique gene identity, and can be used as input for programs that estimate the frequency of Gene Ontology terms (e.g. GenMAPP).

Table 1. Example of output table with annotations.
PlotGene idSymbolAverage logintensityLog-ratioFold change
FDR
Noisy
Annotation Gene Index4 'U' Cluster RefSeq Accession GenMAPP (MGI)
Hist
Z00014577-1
 3.92931.933785.84200  
 
 
 
Hist
Z00004888-1
Ctsl  4.36861.776959.82700cathepsin L  
U034911  
NM_009984.2  
MGI:88564  
Hist
Z00017881-1
Rbm10  3.44571.594839.33600RNA binding motif protein 10  
U043750  
NM_145627.1  
MGI:2384310  
Hist
Z00000306-1
BC003965  3.80421.575437.61800cDNA sequence BC003965  
U017665  
 
 
Hist
Z00010554-1
Slc34a2  3.10391.54835.31800solute carrier family 34 (sodium phosphate), member 2  
U005581  
NM_011402.1  
MGI:1342284  
Hist
Z00000225-1
5730436H21Rik  3.09651.535134.28401RIKEN cDNA 5730436H21 gene  
U019121  
NM_134139.1  
 
Hist
Z00006495-1
Ppat  3.00371.517532.92300phosphoribosyl pyrophosphate amidotransferase  
U026328  
NM_172146.1  
MGI:2387203  

The file with gene annotations is a tab-delimited text file with headers in the first row. The following three columns are required:
First column is probe ID (oligo ID, or cDNA clone ID), which should match to the gene ID in the data file that you analyze. Gene ID can be either a number or a text. If possible, select gene ID that can be referenced on the web (e.g., GenBank accession#).
Second column is gene symbol. If several genes match to the probe, but a comma and space between gene symbols. If there is no space after the comma, then table columns in the output will be too wide.
Third column is gene annotation.

The file may have additional columns if necessary (e.g., gene bank accession number, Unigene, LocusLink, MGI, etc.). These columns should have headers to be displayed in all tables. You can use HTML hyperlinks in additional columns (recommended). Do not use hyperlinks in the first column!

To upload the annotation file, click on the "Add Array" button. Fill the form as specified. In the form you can provide the URL link to the Probe ID (1st column). Instead of the probe name put <NAME>. Then click the "Add/Update Array" button.

5. ANOVA

Statistical analysis is based on the single-factor ANalysis Of VAriance (ANOVA). We assume that all replications are independent, which means that the entire process from tissue collection to RNA extraction, labeling and hybridization is done in parallel processing lines. Multiple hybridization of the same labeled RNA or dye swaps can be used as sub-replications, however they are averaged by the NIA Array Analysis software.

Two-color arrays are used either for pair-wise tissue comparison, or for multiple comparison if one color (e.g., red = CY5) is used for the same reference sample in all arrays. We mostly use the Stratogene (TM) Universal Mouse Reference as a reference sample. The software cannot analyze complex experiment designs with multiple reference samples.

The average error variance for genes with similar intensity is needed for adjusting the actual error variance (see error models below). It is estimated using the sliding window of adjustable size applied to genes sorted by their average intensity. Because some genes may have unusually high error variance due to outliers, a small proportion of highest variance values (1% by default) are not used for variance averaging. The error model attempts to get a better estimate for the true error variance than the error variance estimated from data, which we call here 'actual error variance'.

In this software we use 5 error models:

  1. Actual error variance; with this option each gene is processed independently from other genes (null error model)
  2. Intensity-specific average error variance; this and subsequent options stabilize ANOVA results
  3. Bayesian model (Baldi and Long, 2001. Bioinformatics 17: 509-519); here the error variance is a weighted average of the actual and average error variances, and it is an intermediate case between models #1 and #2;
  4. (default) Maximum of intensity-specific average error variance and actual error variance; this option is the most conservative
  5. Maximum of intensity-specific average error variance and Bayesian error variances; it is an intermediate model between #2 and #4
Option #4 is the most conservative and thus we use it as a default. If the error variance is too high, then none of error models is reliable. Thus we tag genes that have high error variance (>5 times the average), so that they can be examined visually. To reduce the number of genes with high error valiance, the user may select more stringent criteria for removing outliers. Error model also affects the degrees of freedom for the error variance (denominator in Fisher's test). For error models 2, 4, and 5, the d.f. is multiplied by the size of the sliding window in which error variances are averaged. For error model 3, d.f. = bayesian degrees of freedom which are specified by the user before the analysis.

The statistical significance is determined using the False Discovery Rate (FDR) method which has been proposed by Benjamini and Hochberg (Benjamini, Y. & Hochberg, Y., 1995. J Roy Stat Soc B 57: 289-300) and now has become a standard for the analysis of gene expression. The FDR equals the expected proportion of false positives among genes detected as significant. There are multiple methods for FDR estimation which yield slightly different results. In the NIA Array Analysis tool we have implemented the method of Benjamini and Hochberg:

(1)
where r is the rank of a gene ordered by increasing p-values, pi is the p-value for gene with rank i, and N is the total number of genes tested. The FDR value increases monotonously with increasing p-value. It indicates the proportion of false positives among all genes with p-values lower or equal to the p-value of the gene of interest.

Pair-wise comparison of means between 2 tissues/experiments is based on FDR statistics too. Z-values are estimated from the error model using equation
z = (M1-M2)/sqrt[ErrVar*((1/n1)+(1/n2))](2)
where Mi and ni are the mean and sample size for tissue i, and ErrVar is the error variance estimated from the error model. Significant genes are displayed using scatter-plot, log-ratio plot (Fig. 1), and tables (e.g., Table 1).

Fig. 1. Scatter-plot and log-ration plot for mouse preimplantation stages (Here and below we use the data of Hamatani et al. 2004. Developmental Cell 6: 117-131).

6. Hierarchical clustering

Hierarchical clustering of tissues is done using the average distance method. Then a set of specific genes is identified for each cluster using the following method. For each gene we identify tissue T1 within the cluster with the lowest average expression, E(T1), and tissue T2 outside of the cluster with the highest average expression, E(T2). Assume that K genes have E(T1) > E(T2); these genes always have a higher expression in tissues within the cluster than in all tissues outside of the cluster. To determine if the difference E(T1) - E(T2) is statistically significant, we calculated z-values based on the error model, and p-values based on a single-tail normal distribution. Finally we estimated FDR values using equation (1) in which N is the minimum between 2K and the total number of genes. The set of K genes represents only a half (the positive part) of the normal distribution, thus K is doubled for estimating the FDR.

Results are displayed as a dendrogram which also shows the number of specific genes (Fig. 2). For example, cluster #9 (unfertilized egg + 1-cell embryo) has 1296 specific genes. By clicking on the cluster box, you will get a table of cluster-specific genes.

Fig. 2. Hierarchical clustering of mouse preimplantation stages. The number of cluster-specific genes is shown next to the cluster box.

7. Principal Component Analysis (PCA/SVD)

Principal component analysis (PCA) is done using the Singular Value Decomposition (SVD) method that generates eigenvectors both for rows and columns of the log-transformed data matrix (Gabriel 1971. Biometrika 58: 453-467; Chapman et al. 2002. Bioinformatics. 18: 202-204). For plotting of tissues and genes (biplot) we used column projections. The advantage of the biplot compared to a traditional PCA is that the user can visually explore associations between genes and tissues. The NIA Array Analysis tool generates 2-dimensional and 3-dimensional (based on VRML) biplots (Fig. 3).

Fig. 3. 3D biplot of mouse preimplantation stages

All biplots (including 3D) are interactive; each gene is hyperlinked with its annotation and histogram that shows details of expression pattern.

For each principal component (PC) we identify 2 clusters of genes that are positively and negatively correlated with this PC (Fig. 4). The degree of gene expression change within a specific PC is measured by the slope of regression of log-transformed gene expression versus the corresponding eigenvector multiplied by the range of values within the eigenvector. Gene is associated with the most correlated PC; however two additional conditions should be met: (a) the degree of gene expression change exceeds the user-defined threshold (the default is 2-fold change), and (b) the absolute value of correlation exceeds the user-defined threshold (the default is r=0.7).

Fig. 4. Gene clustering based on principal components

It is possible to save PCA results from one experiment and then use them for the analysis of another experiment as reference coordinate system. This method works best if both experiments used the same array platform, i.e., all probe IDs are matching. Partial matching of probe IDs is also acceptable (e.g., if one array is an extension of the earlier array version). Use "same array platform" checkbox if both experiments were done with the same array platform (in this case the software will not attempt to co-normalize data sets). If array platforms are different, then do PCA based on a common identifier which can be gene symbol (recommended), GenBank accession number, or others. After you generate PCA, click the button "Save of import". You can specify a different login name (e.g. if you use public data, the login name is "public", but you can enter your own login name so that the results will be available in your personal session). When you analyze the second data set, then saved files will appear in the select box (PCA menu). Saved files remain available for 1 day.

8. Pattern matching

Pattern matching can be used to find genes with expression pattern similar to some other gene (or group of genes). The gene of interest (or list of genes) should be pasted into the input window as probe ID or gene symbol (if annotation file is used). If multiple probes/genes are entered, then the pattern of expression is averaged for the entire list. An alternative way of defining a pattern is a table with 2 columns: tissue/experiment and log-transformed expression value. Columns can be separated either by tabs or by "=".

A user-defined threshold of the log fold-change is used to detect genes that fit to the given pattern (the default is log(3), i.e. a 3-fold change). The degree of gene expression change is measured by the slope of regression of mean log-transformed gene expression versus the given pattern multiplied by the range of values within the pattern.

The NIA Array Analysis tool identifies 2 clusters associated with a given pattern: genes positively and negatively correlated with the pattern (Fig. 5).

Fig. 5. Pattern matching. The blue line shows the pattern of the target gene (Slc12a2), gray lines are centered gene intensities, and the red line is the average intensity of found genes.

9. Gene list analysis (GO, domains, location)

Getting information on selected genes is one of most important components of microarray analysis. To get a list of genes, it is easier to use tables in text or Excel format. Text tables can be loaded into Excel for analysis. Subsequent steps of analysis depend on the organism species and annotation file that you use. If your annotation file contains columns acceptable as input in GenMAPP software, then you can save the corresponding column in a new file and use it as import into GrnMAPP.

If you work with mouse microarrays, then you can use the NIA Mouse Gene Index to explore your list of genes. From the homepage select "View a list of selected genes/transcripts". Then you can paste a list of genes (gene symbols, U-clusters, or NIA oligos) or load it as a file. After the list of genes is uploaded, you can plot it on the genome and find over-represented GO-terms or protein domains.

Terms of use

This software is provided "AS IS". NIA makes no warranties, expressed or implied, including no representation or warranty with respect to the performance of the software and derivatives or their safety, effectiveness, or commercial viability. NIA does not warrant the merchantability or fitness of the software and derivatives for any particular purpose, or that they may be exploited without infringing the copyrights, patent rights or property rights of others. NIA shall not be liable for any claim, demand or action for any loss, harm, illness or other damage or injury arising from access to or use of the software or associated information, including without limitation any direct, indirect, incidental, exemplary, special or consequential damages.

Please report problems to Alexei Sharov.