QIMR Genetic Epidemiology Laboratory Home > Dale's Homepage > SNPSpD

Single Nucleotide Polymorphism Spectral Decomposition (SNPSpD)

Users may also be interested in the leaner version of SNPSpD, (SNPSpDlite) which ONLY calculates Meff and MeffLi (i.e., does not perform time-consuming varimax/promax rotations) thus allowing users to obtain Meff/MeffLi values for large numbers (1000s) of SNPs.

Users may also be interested in the related site (matSpD) which only requires users to upload a correlation matrix.

Please use the following reference when reporting results based on SNPSpD:
Nyholt DR (2004) A simple correction for multiple testing for SNPs in linkage disequilibrium with each other. Am J Hum Genet 74(4):765-769.
 

To run SNPSpD using all fully genotyped family members:

Please specify where your "merlin.pre" file is located:

Please specify where your "merlin.map" file is located:

Please be patient, the spectral decomposition will take a few moments.

 
To run SNPSpD using fully genotyped founders only:

Please specify where your "merlin.pre" file is located:

Please specify where your "merlin.map" file is located:

Please be patient, the spectral decomposition will take a few moments.


 

To run SNPSpD using all family members:

Please specify where your "merlin.pre" file is located:

Please specify where your "merlin.map" file is located:

Please be patient, the spectral decomposition will take a few moments.

 
To run SNPSpD using founders only:

Please specify where your "merlin.pre" file is located:

Please specify where your "merlin.map" file is located:

Please be patient, the spectral decomposition will take a few moments.


 
Please note the following, paying particular attention to the first 6 points relating to common problems with input files:


Notes:

The SNPSpD interface takes two input files:
1) a MERLIN/GOLD-format pre-makeped pedigree file ("merlin.pre"), and
2) a MERLIN/GOLD-format map file ("merlin.map").

These files are run through a slightly altered version of Gon?alo Abecasis' ldmax program - part of the GOLD (Command Line Tools) package [gold-1.1.0.tar.gz].  Please refer to the ldmax documentation for specific details on the format of these input files.

The following links show example input files for a subset of the Keavney et al. (1998) data:
keavney.pre
keavney.map
Results from post-July 1 2005 SNPSpD analysis of Keavney data.
Results from pre-July 1 2005 SNPSpD analysis of Keavney data.
View original paper.

The following links show example input files for the Moffatt et al. (2000) data:
moffatt.pre
moffatt.map
Results from post-July 1 2005 SNPSpD analysis of Moffatt data.
Results from pre-July 1 2005 SNPSpD analysis of Moffatt data.
View original paper.

As described in the program's documentation, ldmax uses the expectation-maximization (EM) based approach of Slatkin and Excoffier (1995) Mol Biol Evol 12:921-7, to estimate haplotype frequencies.  Using these haplotype frequencies, ldmax outputs a table summarising pairwise linkage disequilibrium (LD) statistics to a file named LD.XT.  From this output, a perl script is used to create a matrix of the disequilibrium measure delta () , where

, and
the notation for estimated haplotype and marker allele frequencies in a 22 table are

Risch and Devlin (1995) Genomics 29:311-22, provide a summary of common LD measures and note  is the correlation coefficient for a 2 x 2 table [Hill and Robertson (1968) Theor Appl Genet 38:226231].

Eigenvalues (s) for this LD correlation matrix may be calculated by principal components (factor) analysis, or more generally by spectral decomposition (SpD).  SNPSpD obtains s by spectral decomposition using the EIGEN function of R [R is a language and environment for statistical computing (R Development Core Team 2003).

The  for a given factor measures the variance in all the SNP-SNP correlations which is accounted for by that factor.

Higher correlation among the SNPs leads to higher s.

For example, if all SNPs are maximally correlated, the first  is equal to the number of SNPs represented in the matrix while the rest of the s are equal to zero.  In this situation the variance of the s is at its maximum and equal to the number of SNPs (M) in the matrix.

Conversely, when there is no correlation among SNPs, all of the s are equal to 1 and the set ofs has no variance.  Therefore, the variance of thes will range between 0, when all the SNPs are independent, and M.

Hence, the ratio of observed  variance to its maximum (M) gives the proportional reduction in the number of independent SNPs in a set, and the effective number of SNPs (Meff) may be calculated as follows:

Indeed, Cheverud (2001) Heredity 87:52-58 utilised this approach to provide a simple correction for multiple comparisons in interval mapping genome scans.
 

Selecting a subset of SNPs:

The SpD of the LD correlation matrix also allows investigation of which SNPs best represent a group of SNPs in high LD with eachother.  That is, by examining the factor "loadings" for each , one may determine which SNP best captures the information contained in each group.

Although during the development of this web interface, Meng et al. (2003) Am J Hum Genet 73:115-130, independently came to the same conclusion, I have nonetheless extended SNPSpD to report results after orthogonal rotation.  The results include the s, principal component coefficients, and factor "loadings" after orthogonal rotation utilising the VARIMAX function of R.

I have maximised interpretability of these results by flagging the SNP(s) contributing the MOST to each rotated factor (group of SNPs).  These flagged SNPs may be viewed as "haplotye tagging SNPs".  Indeed, even in data with strong LD the rotated factors correspond well with haplotypes obtained via traditional methods.  For example, the 7 haplotypes reported in the Keavney et al. (1998) paper correspond to the 7 factors produced by SNPSpD after varimax rotation.

Finally, in order to select the minimum subset of SNPs which maximise information, I propose using the effective number of SNPs (Meff) as a guide to choosing the total number of haplotyes to tag [i.e., choose the tagging SNPs which best represent the factors with the Meff highest rotateds].  For example, the 23 SNPs in the Moffatt data have low intermarker LD, producing an Meff of 22.53, thus indicating the SNP loading highest on the factor with the smallest  [i.e., factor 23; =0.0483; SNP7 (hADV14S1A)] provides the least amount of independent information [i.e., it is highly correlated with SNP 9 (hADV14S1C)].  Analogously, the 10 SNPs in the Keavney data have very high intermarker LD, producing an Meff of 4.59, indicating the 5 SNPs loading highest on the factors with the 5 largest s [i.e., either one of SNP 7 (g2215a) / SNP 8 (i/d) / SNP9 (g2350a); SNP 1 (t-5991c); SNP 3 (t-3892c); SNP 6 (t-1237c); and SNP10 (4656ct)] provide MOST of the independent information of the 10 SNPs

Alternatively and perhaps more statistically appealing, you can choose the tagging SNPs which represent the factors explaining a selected proportion of variance.  For example, only the last factor explains ?1% of the variance in the Moffatt data.  Analogously, for the Keavney data; the first three factors explain 95% (94.65%) of the variance, while the first four factors explain 99% (98.74%) of the variance, etc.
 

References:

Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97-101

Abecasis GR, Cookson WO (2000) GOLD-graphical overview of linkage disequilibrium. Bioinformatics 16:182-3

Cheverud JM (2001) A simple correction for multiple comparisons in interval mapping genome scans. Heredity 87:52-8

Horne BD, Camp NJ (2004) Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation. Genet Epidemiol. 26:11-21

Keavney B, McKenzie CA, Connell JM, Julier C, Ratcliffe PJ, Sobel E, Lathrop M, Farrall M (1998) Measured haplotype analysis of the angiotensin-I converting enzyme gene. Hum Mol Genet 7:1745-51

Li J, Ji L (2005) Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95:221-227

Meng Z, Zaykin DV, Xu CF, Wagner M, Ehm MG (2003) Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 73:115-30

Moffatt MF, Traherne JA, Abecasis GR, Cookson WO (2000) Single nucleotide polymorphism and linkage disequilibrium within the TCR alpha/delta locus. Hum Mol Genet 9:1011-9

R Development Core Team (2003) R: a language and environment for statistical computing. R Foundation for  Statistical Computing, Vienna, http://www.R-project.org (accessed March 1, 2004)
 

License agreement:

SNPSpD is coded by Dale R. Nyholt.

Copyright (C) 2004, Dale R. Nyholt.
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. The names of its contributors may not be used to endorse or promote products derived from this software without specific prior written permission.

4. We also request that use of this software be cited in publications as:
Nyholt DR (2004) A simple correction for multiple testing for SNPs in linkage disequilibrium with each other. Am J Hum Genet 74(4):765-769.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 

Useful Links:

Cartographer - a great tool to build physically-ordered marker maps from STS marker names [also gives deCODE cM position].
 

Page last updated October 12, 2007.

Special thanks to David Smyth (Genepi's IT guru) for assisting with the development of this web interface.
 

Tel: +61-7-3362 0258 Find Us
The Genetic Epidemiology Laboratory
Email: daleN@qimr.edu.au Contact Us
The Queensland Institute of Medical Research