ChromoScanGWA 1.0 User Manual
About ChromoScanGWA
ChromoScanGWA is complementary to and is intended to be used with ChromoScan 1.0. ChromoScanGWA is a command line java tool that is tuned specifically for Genome Wide Analysis, and introduces increased rigor with the inclusion of three different permutation tests. ChromoScanGWA does this by computing statistical significance of marker positions using SNP genotype data, and performs permutation testing through a variety of outcome shuffling and regional significance strategies to be covered later in this document. For an introduction to the purpose and use of ChromoScan, please refer to the ChromoScan website at: http://www.epidkardia.sph.umich.edu/software/chromoscan/
Input Data
ChromoScanGWA reads comma delimited text (CSV) files, which can be generated by most popular scientific and spreadsheet software applications. ChromoScanGWA has different input data needs than ChromoScan, and therefore needs similar but different data with different file organizations. ChromoScanGWA uses SNP genotype data and a dichotomous categorical outcome variable to test SNP association to the outcome based on Chi-Squared tests and resulting p-values. ChromoScanGWA also needs the chromosomal marker positions to assess regional significance. ChromoScanGWA requires two separate input CSV files: a marker genotype data file, and a marker chromosomal location data file. Examples of the respective file formats follow: Marker Genotype Data File SampleID SNP_A SNP_B Outcome 1 2 A_G T_C G_G ?_? 1 0 Marker Location Data File SNP_ID Marker_Location SNP_A SNP_B 1500 3000
Marker Genotype Data File: In the genotype file, there are only two necessary types of allowable data: one outcome column, and several marker genotype columns. The marker genotype data file must have certain properties: ● File must be organized with samples differentiated by rows, and variables by columns ● Outcome column must have exactly two different types of values (it doesn't matter what they are: 1/0; Yes/No; Case/Control), and missing values are not allowed for the Outcome column ● Outcome column identifier can be any unique identifier ● The marker identifiers must be unique, and must exactly (case-sensitive) match the marker identifiers in the marker location data file. ● Marker Genotype column identifiers must start with “SNP_” ● Each value in the marker genotype columns must be a properly formatted genotype ● The acceptable genotype format for ChromoScanGWA is “N_N” where N is in the set [“A”,”C”,”G”,”T”,”?”]. ● “?_?” is used to specify a missing genotype
Marker Location Data File: The marker location file must specify the chromosomal location for all markers in the genotype data file. ● The Marker ID column identifier must be unique ● The Distance column identifier must be unique ● Markers should be differentiated by rows, and variables by columns ● Markers need not be sorted in any particular order – this will be performed automatically ● None of the marker locations should be duplicated (xm – xn ≠ 0). ● Marker IDs must exactly match (case-sensitive comparison) the marker IDs in the genotype data file.
Permutation Strategies
The first phase of calculations for each of the permutation strategies is the same: 1. Marker P-values are calculated 2. The scan-statistic algorithm discovers significant regions using the intact dataset For each permutation: 1. The outcome values are shuffled with respect to the set of samples 2. Permutation Marker p-values are re-calculated The details for each permutation strategy follow: Best: In each permutation, the significant regions are discovered using the permutation marker pvalues. The alpha cutoff used to determine significance in permutations is relaxed in order to capture an appropriate number of significant regions. In each permutation the best m regions are retained, where m is the number of regions discovered using the original dataset. After all permutations are completed the permutation regions are sorted in increasing order by p-value, and significance of each original region is computed based on where its p-value falls compared to the permutation region pvalues. Fixed: In each permutation, the significance of each of the original regions is recomputed using the permutation marker p-values. The permutation p-value is computed based on the proportion of permutations where regional significance was retained compared to the total number of permutations. Floating: In each permutation, significant regions are discovered using the permutation marker pvalues. The overall p-value for the set of original regions is calculated based on the frequency of the count of permutation regions equalling or exceeding the count of original regions.
Using ChromoScanGWA
ChromoScanGWA has been developed as a command line utility to enable integration into an automated workflow. The only caveat to this is that each set of markers (each chromosome), needs to be reviewed in the ChromoScan GUI prior to use with ChromoScanGWA. ChromoScanGWA requires a “transformation_power” command line option to appropriately transform the marker location values, such that the inter-marker distance distribution roughly follows the exponential distribution. This value can be determined by setting the “Transformation” option in ChromoScan to “Power Transformation” and modifying the exponential factor until the Q-Q plot indicates a
reasonably good match to the exponential distribution (an transformation factor of 0.5 is usally a good starting point). Running ChromoScanGWA without any arguments gives a usage screen and two usage examples: Input
java -jar ChromoScanGWA_1.0.jar
Output
Incorrect command line arguments. java -jar ChromoScanGWA_1.0.jar type #permutations genotype_file.csv outcome_column marker_location.csv snp_label_column location_column transformation_power output_file.csv output_pval_file.csv AlphaCutoff PValueCutoff [BestModeLooseAlphaCutoff] The type argument can be 'fixed', 'floating', or 'best' depending on the type of permutation test desired. for example: java -jar ChromoScanGWA_1.0.jar fixed 1000 chr2.csv HYT chr2loc.csv SNP Location 0.65 out.csv out_pval.csv 0.01 0.05 java -jar ChromoScanGWA_1.0.jar best 1000 chr2.csv HTY chr2loc.csv SNP Location 0.65 out.csv out_pval.csv 0.01 0.05 0.1
Parameter Description type: permutation type can be “fixed”, “floating”, or “best” #permutations: integer value for the number of permutations to be performed (usually 1000 or 10,000) genotype_file.csv: CSV input marker genotype data file as described above outcome_column: The column identifier for the outcome variable in the genotype marker file marker_location.csv: CSV input marker location data file as described above snp_label_column: The column identifier for the marker label column in the marker location file location_column: The column identifier for the marker location column in the marker location file transformation_power: The exponential transformation factor determined using ChromoScan output_file.csv: The name of the specified output CSV file output_pval_file.csv: The name of the specified output CSV file containing the computed marker pvalues prior to shuffling AlphaCutoff: Value used to determine region significance (usually 0.01) PValueCutoff: Value used to determine marker significance (usually 0.05) BestModeLooseAlphaCutoff: Value used to determine permutation region significance in “best” mode (usually 0.1)
Operation Note:
ChromoScanGWA has been used with the Sun Java JRE version 1.5, but should work with any Java 1.5 JRE. ChromoScanGWA has been tested using both Windows and Linux version of Sun JRE 1.5.