Single Nucleotide Polymorphism Spectral Decomposition (SNPSpD)
NOTE:
SNPSpD is not designed to
analyse >100 SNPs - use SNPSpDlite or preferably SNPSpDsuperlite!
If you wish to estimate the effective number of independent markers on
larger datasets then I recommend
i) download the SNPSpD scripts from the link provided below to perform SNPSpD
analysis on your local machine,
ii) my matSpDlite
approach (which only requires users to upload a correlation matrix),
iii) download the matSpDlite.R
R script to perform matSpDlite analysis on your local machine, or
iiv) use the Genetic Type I error calculator (GEC).
Users may also be interested in the leaner version of SNPSpD, (SNPSpDlite) which ONLY calculates Meff and MeffLi (i.e., does not perform time-consuming varimax/promax rotations) thus allowing users to obtain Meff/MeffLi values for large numbers (1000s) of SNPs.
Users may also be interested in the related site (matSpD) which only requires users to upload a correlation matrix.
Please use the following reference when
reporting results based on SNPSpD:
Nyholt
DR (2004) A simple correction for multiple testing for SNPs in linkage
disequilibrium with each other. Am J Hum Genet 74(4):765-769.
To run SNPSpD using all fully genotyped
family members:
To run SNPSpD using all family members:
Notes:
These files are run through a slightly altered version of Gon?alo Abecasis' ldmax program - part of the GOLD (Command Line Tools) package [gold-1.1.0.tar.gz]. Please refer to the ldmax documentation for specific details on the format of these input files.
The following links show example input
files for a subset of the Keavney
et al. (1998) data:
keavney.pre
keavney.map
Results
from post-July 1 2005 SNPSpD analysis of Keavney data.
Results
from pre-July 1 2005 SNPSpD analysis of Keavney data.
View
original paper.
The following links show example input
files for the Moffatt
et al. (2000) data:
moffatt.pre
moffatt.map
Results
from post-July 1 2005 SNPSpD analysis of Moffatt data.
Results
from pre-July 1 2005 SNPSpD analysis of Moffatt data.
View
original paper.
As described in the program's documentation, ldmax uses the expectation-maximization (EM) based approach of Slatkin and Excoffier (1995) Mol Biol Evol 12:921-7, to estimate haplotype frequencies. Using these haplotype frequencies, ldmax outputs a table summarising pairwise linkage disequilibrium (LD) statistics to a file named LD.XT. From this output, a perl script is used to create a matrix of the disequilibrium measure delta () , where
Risch and Devlin (1995) Genomics 29:311-22, provide a summary of common LD measures and note is the correlation coefficient for a 2 x 2 table [Hill and Robertson (1968) Theor Appl Genet 38:226–231].
Eigenvalues (s) for this LD correlation matrix may be calculated by principal components (factor) analysis, or more generally by spectral decomposition (SpD). SNPSpD obtains s by spectral decomposition using the EIGEN function of R [R is a language and environment for statistical computing (R Development Core Team 2003).
The for a given factor measures the variance in all the SNP-SNP correlations which is accounted for by that factor.
Higher correlation among the SNPs leads to higher s.
For example, if all SNPs are maximally correlated, the first is equal to the number of SNPs represented in the matrix while the rest of the s are equal to zero. In this situation the variance of the s is at its maximum and equal to the number of SNPs (M) in the matrix.
Conversely, when there is no correlation among SNPs, all of the s are equal to 1 and the set ofs has no variance. Therefore, the variance of thes will range between 0, when all the SNPs are independent, and M.
Hence, the ratio of observed variance to its maximum (M) gives the proportional reduction in the number of independent SNPs in a set, and the effective number of SNPs (Meff) may be calculated as follows:
Selecting a subset of SNPs:
The SpD of the LD correlation matrix also allows investigation of which SNPs best represent a group of SNPs in high LD with eachother. That is, by examining the factor "loadings" for each , one may determine which SNP best captures the information contained in each group.
Although during the development of this web interface, Meng et al. (2003) Am J Hum Genet 73:115-130, independently came to the same conclusion, I have nonetheless extended SNPSpD to report results after orthogonal rotation. The results include the s, principal component coefficients, and factor "loadings" after orthogonal rotation utilising the VARIMAX function of R.
I have maximised interpretability of these results by flagging the SNP(s) contributing the MOST to each rotated factor (group of SNPs). These flagged SNPs may be viewed as "haplotye tagging SNPs". Indeed, even in data with strong LD the rotated factors correspond well with haplotypes obtained via traditional methods. For example, the 7 haplotypes reported in the Keavney et al. (1998) paper correspond to the 7 factors produced by SNPSpD after varimax rotation.
Finally, in order to select the minimum subset of SNPs which maximise information, I propose using the effective number of SNPs (Meff) as a guide to choosing the total number of haplotyes to tag [i.e., choose the tagging SNPs which best represent the factors with the Meff highest rotateds]. For example, the 23 SNPs in the Moffatt data have low intermarker LD, producing an Meff of 22.53, thus indicating the SNP loading highest on the factor with the smallest [i.e., factor 23; =0.0483; SNP7 (hADV14S1A)] provides the least amount of independent information [i.e., it is highly correlated with SNP 9 (hADV14S1C)]. Analogously, the 10 SNPs in the Keavney data have very high intermarker LD, producing an Meff of 4.59, indicating the 5 SNPs loading highest on the factors with the 5 largest s [i.e., either one of SNP 7 (g2215a) / SNP 8 (i/d) / SNP9 (g2350a); SNP 1 (t-5991c); SNP 3 (t-3892c); SNP 6 (t-1237c); and SNP10 (4656ct)] provide MOST of the independent information of the 10 SNPs
Alternatively and perhaps more statistically
appealing, you can choose the tagging SNPs which represent the factors
explaining a selected proportion of variance. For example, only the
last factor explains ?1% of the variance in the Moffatt data. Analogously,
for the Keavney data; the first three factors explain 95% (94.65%) of the
variance, while the first four factors explain 99% (98.74%) of the variance,
etc.
Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97-101
Abecasis GR, Cookson WO (2000) GOLD-graphical overview of linkage disequilibrium. Bioinformatics 16:182-3
Cheverud JM (2001) A simple correction for multiple comparisons in interval mapping genome scans. Heredity 87:52-8
Horne BD, Camp NJ (2004) Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation. Genet Epidemiol. 26:11-21
Keavney B, McKenzie CA, Connell JM, Julier C, Ratcliffe PJ, Sobel E, Lathrop M, Farrall M (1998) Measured haplotype analysis of the angiotensin-I converting enzyme gene. Hum Mol Genet 7:1745-51
Li J, Ji L (2005) Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95:221-227
Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG (2003) Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 73:115-30
Moffatt MF, Traherne JA, Abecasis GR, Cookson WO (2000) Single nucleotide polymorphism and linkage disequilibrium within the TCR alpha/delta locus. Hum Mol Genet 9:1011-9
R Development Core Team (2003) R: a language
and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, http://www.R-project.org
(accessed March 1, 2004)
License agreement:
SNPSpD is coded by Dale R. Nyholt.
Copyright (C) 2004, Dale R. Nyholt.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. The names of its contributors may not be used to endorse or promote products derived from this software without specific prior written permission.
4. We also request that use of this software
be cited in publications as:
Nyholt DR (2004) A simple correction for
multiple testing for SNPs in linkage disequilibrium with each other. Am
J Hum Genet 74(4):765-769.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT
HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Useful Links:
Cartographer
- a great tool to build physically-ordered marker maps from STS marker
names [also gives deCODE cM position].
Page last updated October 12, 2007.
Special thanks to David Smyth (Genepi's
IT guru) for assisting with the development of this web interface.
Tel: +61-7-3362 0258 | Find Us | |
Email: daleN@qimr.edu.au | Contact Us |