|NIA Array Analysis Tool|
For 1-color arrays, each column represents an array. For 2-color arrays, a pair of adjacent columns represents an array. The order of columns should match to the order of tissue/experiment names that you specify in the input form. First you put data from 1-st tissue/experiment (1st replication then 2nd replication, ...), then 2-nd tissue/experiment, and so on. If some replications have multiple subreplications, then subreplications should be placed together one after another.
Below are examples of input file formats. You can use column headers to automatically generate the experiment design (see details in examples).
(1) Compare 2 tissues with 1-color arrays. Several replicates are needed (we suggest
using 3 biological replicates); some replicates may have technical subreplicates.
(2) Compare multiple tissues with 1-color arrays. At least some tissues should have independent biological replicates; technical subreplicates are possible.
(3) Compare 2 tissues with 2-color arrays. Several replicates are needed. Dye swap is recommended as technical subreplicates.
(4) Compare multiple tissues that are all hybridized with universal reference on 2-color arrays. At least some tissues should have independent biological replicates; technical subreplicates are possible. Dye swap is not needed.
(5) Visualization of a data set with no replications
If the software did not recognize your experiment design or if the generated design is wrong, you can always correct it manually by entering tissue/experiment names and the number of replications/subreplications in the form.
To use the software you need to make a tab-delimited parameter file (name it as convenient for you). Two columns are required, and 2 additional columns are optional.
Column 1 indicates a dye-swap (0=no dye swap, 1=dye swap). If your
array has only one color (or radioactive label) then put 0. For example,
if you compare embryo and placenta and selected green color for embryo
and red color for placenta as a normal design, then a dye
swap would be red color for placenta and green color for embryo.
Column 2 has file names. We recommend writing full file names that include the drive, path, and file name.
Column 3 (optional) has tissue/experiment names. If no tissue name provided then column headers will be taken from file names.
Column 4 (optional) has replication numbers. Arrays with same tissue and same replication numbers will be considered sub-replications (technical replications).
The order of files in the parameter file should be the same as required by the NIA Array Analysis tool. Do not save the parameter file to your desktop! Put it either into the directory of the program or any other directory.
Example of parameter file with dye swap:
Example of parameter file without dye swap:
The program can be configured to read any tab-delimited input files. It can skip a specific number of header lines and then extract specific data columns. To configure the program correctly you need to execute it from the command line. Parameters in the command line override default settings which correspond to the file format generated by Agilent scanner. The syntax of command is:
arrayjoin.exe [-i input][-o output][-a aliasFile][-t1 error/mean][-t2 error/avr.error][-d color#][-r rowsSkip][-h headerRow][-m meanCols][-e errorCols][-id idCol][-c col,min,max]
(1) input = infput parameter file
(2) output = output file
(3) aliasFile = file with alias names for oligos. We include the file with alias names for Agilent oligos (oligo-alias.txt). The first column is a unique oligo name, and the second column is a comma-separated list of aliases.
(4) error/mean = threshold for acceptable (Error/Mean) ratio for a gene. This parameter is used only if the scanner provides an error value for gene intensity. If the ratio Error/Mean is below the threshold, then gene intensity is directly taken from the file. If it is above the threshold, then the value will be either replaced by a missing value (-9999) or by a surrogate value (explained in the following paragraph). The default value =0.6. If you wish to take all values directly from the file, then make this threshold high (e.g, 10 is high enough).
(5) error/avr.error = threshold for (gene error/expected error for this intensity) This parameter is needed to determine how to handle a gene that has Error/Mean ratio above the threshold specified in the paragraph above. First, the computer estimates the average error for genes with specific intensity level (error may change depending on gene average intensity). Then, the actual error is compared with this expected error (i.e., average error). If the ratio (actual error/expected error) is above this threshold then the mean value is replaced by a missing value (-9999). If the ratio is below the threshold but the mean is greater than the expected error value, then the mean value is left unchanged. Finally, if the mean value is below the error value, then it is replaced by a surrogate value that is equal to the expected error. Default value = 3.
(6) color# = number of colors in the array Use 1 for 1-color array and 2 for 2-color arrays
(7) rowsSkip = Number of header lines to skip The text file generated by a scanner usually include several header lines (8) headerRow = row number that contains column descriptions. The default value = 10 because the current version of Agilent output files has headers in the 10th row. Headers are used to find correct output columns. The following options override these column numbers
(9) meanCols = column number for means Specify the column number for mean intensity of a gene. If scanner software includes normalization or color balance, then use normalized (or processed) means. For 1-color array specify just one number. For 2-color arrays put 2 numbers separated by a comma.
(10) errorCols = column number for mean errors (use -1 if no errors provided) Specify the column number for error of mean intensity of a gene. If scanner software includes normalization or color balance, then use normalized (or processed) errors. For 1-color array specify just one number. For 2-color arrays put 2 numbers separated by a comma.
(11) idCol = column number for oligo ID. All arrays should have the same order of oligo (genes or clones) ID in the file. Specify the column number with oligo ID. If the alias file is specified then oligo ID will be considered an alias and replaced by the name specified in that file. If several features have the same ID) then the value is averaged for all features with the same ID.
(12) col = column number for condition that a row is included into output (use -1 if no condition). Some arrays use control features for normalization. It is often convenient to remove these features from output. If the value is in the range of (min, max), then it is included in the output file. Column number, min, and max should be comma-separated with no space.
Launch the program arrayjoin.exe, it will prompt you to enter the name of the input parameter file (if you have not specified it in the command). Enter the name of parameter file. If the parameter file is located in the same directory as the program (i.e., arrayjoin.exe) then you don't need to specify the path (directory) to the parameter file. If the parameter file is located elsewhere you need to specify the full file name that includes the path. Then the program prompts for the name of output file (if you have not specified it in the command). If you cannot find the output file in the program folder, specify the full path for the output file.
Split-normalization means that normalization is done for a subset of genes at a time. The major reason for using split-normalization is that all genes do not fit on one array. In this case the full set of genes may be represented by 2 or more arrays that require independent normalization. For split-normalization the input file must have array indexes in the second column.
The NIA Array Analysis software starts the pre-processing of data with log-transformation. Negative data are treated as 0.0000000001. User-defined cutoffs can be assigned to filter and adjust the data. If a data value is less than the minimum cutoff, then it is replaced by the minimum cutoff value. This adjustment may artificially lower the error variance for low-expressed genes. To avoid this effect, the software adjusts the averaged error variance for genes with average intensity within 2*SD from the minimum cutoff, by not letting it decrease as the average intensity decreases. The maximum cutoff simply ignores genes with the average intensity exceeding the cutoff value.
In the case of two-color arrays with a common reference, the software adjusts signal according to the reference signal. Adjustment is done on a gene by gene basis in 2 steps: (1) the mean reference signal, MR, is estimated for gene G using all arrays; (2) signal, S, for gene G in each array is replaced by adjusted value S1 = S-R+MR, where R is reference signal in the same array.
Two-color arrays with a common reference occasionally have cross-channel correlation which results in underestimation of gene expression change. Theoretically, red and green intensities should be independent, however we often observed that the intensity in the red channel, which resulted from hybridization with the same reference sample, increased with the increasing intensity in the green channel. The NIA Array Analysis tool has an option to adjust data that have cross-channel correlation. Adjustment is done on the gene-by-gene basis if the average log-intensity of the reference is by log(5) less than the average log-intensity in the other channel (i.e., at least 5-fold difference), and if the cross-channel correlation is >0.7. In this case, the reference intensity is set to its average value for this gene plus the adjustment to the average reference intensity for all genes in each array.
Outliers can be removed based on a user-selected z-threshold level. The default threshold (z = 8) removes only most deviating outliers. Estimation of the z-value is based on the error variance model (see below) which uses ANOVA results. This creates a circular reference which is resolved by iterative ANOVA with outlier removal after each iteration. The process stops if no new outliers were detected.
Table 1. Example of output table with annotations.
|Plot||Gene id||Symbol||Average logintensity||Log-ratio||Fold change||Annotation
||Gene Index4 'U' Cluster
||Hist||3.9293||1.9337||85.842||0||0||Hist||Ctsl ||4.3686||1.7769||59.827||0||0||cathepsin L
||Hist||Rbm10 ||3.4457||1.5948||39.336||0||0||RNA binding motif protein 10
||Hist||BC003965 ||3.8042||1.5754||37.618||0||0||cDNA sequence BC003965
||Hist||Slc34a2 ||3.1039||1.548||35.318||0||0||solute carrier family 34 (sodium phosphate), member 2
||Hist||5730436H21Rik ||3.0965||1.5351||34.284||0||1||RIKEN cDNA 5730436H21 gene
||Hist||Ppat ||3.0037||1.5175||32.923||0||0||phosphoribosyl pyrophosphate amidotransferase
The file with gene annotations is a tab-delimited text file with headers in the first row.
The following three columns are required:
First column is probe ID (oligo ID, or cDNA clone ID), which should match to the gene ID in the data file that you analyze. Gene ID can be either a number or a text. If possible, select gene ID that can be referenced on the web (e.g., GenBank accession#).
Second column is gene symbol. If several genes match to the probe, but a comma and space between gene symbols. If there is no space after the comma, then table columns in the output will be too wide.
Third column is gene annotation.
The file may have additional columns if necessary (e.g., gene bank accession number, Unigene, LocusLink, MGI, etc.). These columns should have headers to be displayed in all tables. You can use HTML hyperlinks in additional columns (recommended). Do not use hyperlinks in the first column!
To upload the annotation file, click on the "Add Array" button. Fill the form as specified. In the form you can provide the URL link to the Probe ID (1st column). Instead of the probe name put <NAME>. Then click the "Add/Update Array" button.
Two-color arrays are used either for pair-wise tissue comparison, or for multiple comparison if one color (e.g., red = CY5) is used for the same reference sample in all arrays. We mostly use the Stratogene (TM) Universal Mouse Reference as a reference sample. The software cannot analyze complex experiment designs with multiple reference samples.
The average error variance for genes with similar intensity is needed for adjusting the actual error variance (see error models below). It is estimated using the sliding window of adjustable size applied to genes sorted by their average intensity. Because some genes may have unusually high error variance due to outliers, a small proportion of highest variance values (1% by default) are not used for variance averaging. The error model attempts to get a better estimate for the true error variance than the error variance estimated from data, which we call here 'actual error variance'.
In this software we use 5 error models:
The statistical significance is determined using the False Discovery Rate (FDR) method which has been proposed by Benjamini and Hochberg (Benjamini, Y. & Hochberg, Y., 1995. J Roy Stat Soc B 57: 289-300) and now has become a standard for the analysis of gene expression. The FDR equals the expected proportion of false positives among genes detected as significant. There are multiple methods for FDR estimation which yield slightly different results. In the NIA Array Analysis tool we have implemented the method of Benjamini and Hochberg:
Pair-wise comparison of means between 2 tissues/experiments is based on FDR
statistics too. Z-values are estimated from the error model using equation
|z = (M1-M2)/sqrt[ErrVar*((1/n1)+(1/n2))]||(2)|
Fig. 1. Scatter-plot and log-ration plot for mouse preimplantation stages (Here and below we use the data of Hamatani et al. 2004. Developmental Cell 6: 117-131).
Results are displayed as a dendrogram which also shows the number of specific genes (Fig. 2). For example, cluster #9 (unfertilized egg + 1-cell embryo) has 1296 specific genes. By clicking on the cluster box, you will get a table of cluster-specific genes.
Fig. 2. Hierarchical clustering of mouse preimplantation stages. The number of cluster-specific genes is shown next to the cluster box.
Fig. 3. 3D biplot of mouse preimplantation stages
All biplots (including 3D) are interactive; each gene is hyperlinked with its annotation and histogram that shows details of expression pattern.
For each principal component (PC) we identify 2 clusters of genes that are positively and negatively correlated with this PC (Fig. 4). The degree of gene expression change within a specific PC is measured by the slope of regression of log-transformed gene expression versus the corresponding eigenvector multiplied by the range of values within the eigenvector. Gene is associated with the most correlated PC; however two additional conditions should be met: (a) the degree of gene expression change exceeds the user-defined threshold (the default is 2-fold change), and (b) the absolute value of correlation exceeds the user-defined threshold (the default is r=0.7).
Fig. 4. Gene clustering based on principal components
It is possible to save PCA results from one experiment and then use them for the analysis of another experiment as reference coordinate system. This method works best if both experiments used the same array platform, i.e., all probe IDs are matching. Partial matching of probe IDs is also acceptable (e.g., if one array is an extension of the earlier array version). Use "same array platform" checkbox if both experiments were done with the same array platform (in this case the software will not attempt to co-normalize data sets). If array platforms are different, then do PCA based on a common identifier which can be gene symbol (recommended), GenBank accession number, or others. After you generate PCA, click the button "Save of import". You can specify a different login name (e.g. if you use public data, the login name is "public", but you can enter your own login name so that the results will be available in your personal session). When you analyze the second data set, then saved files will appear in the select box (PCA menu). Saved files remain available for 1 day.
A user-defined threshold of the log fold-change is used to detect genes that fit to the given pattern (the default is log(3), i.e. a 3-fold change). The degree of gene expression change is measured by the slope of regression of mean log-transformed gene expression versus the given pattern multiplied by the range of values within the pattern.
The NIA Array Analysis tool identifies 2 clusters associated with a given pattern: genes positively and negatively correlated with the pattern (Fig. 5).
Fig. 5. Pattern matching. The blue line shows the pattern of the target gene (Slc12a2), gray lines are centered gene intensities, and the red line is the average intensity of found genes.
If you work with mouse microarrays, then you can use the NIA Mouse Gene Index to explore your list of genes. From the homepage select "View a list of selected genes/transcripts". Then you can paste a list of genes (gene symbols, U-clusters, or NIA oligos) or load it as a file. After the list of genes is uploaded, you can plot it on the genome and find over-represented GO-terms or protein domains.
Please report problems to Alexei Sharov.