Author:
Hahn Georg,Lutz Sharon M.,Hecker Julian,Prokopenko Dmitry,Cho Michael H.,Silverman Edwin K.,Weiss Scott T.,Lange Christoph
Abstract
AbstractThe computation of a similarity measure for genomic data is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases due to confounding by population stratification, for instance in linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (called the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for the application of a randomized SVD algorithm, which is faster than the traditional computation. The fast SVD algorithm we present is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. The algorithm only assumes that row-wise and column-wise subtraction and multiplication of a vector with a sparse matrix is available, an operation that is efficiently implemented in common sparse matrix packages. An exception is the so-called Jaccard matrix, which does not have a structure applicable for the fast SVD algorithm. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the accuracy (in $$L_2$$
L
2
norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation, and derive the theoretical runtime of our algorithm. We illustrate that the approximation error is low in practice and empirically verify the theoretical runtime scalings on both simulated data and data of the 1000 Genome Project.
Funder
Cure Alzheimer's Fund
National Institutes of Health
National Science Foundation
NIH Center grant
Publisher
Springer Science and Business Media LLC
Reference30 articles.
1. Abraham G, Inouye M. Fast principal component analysis of large-scale genome-wide data. PLoS ONE. 2014;9(4):e93766.
2. Bates D, Maechler M, Jagan M, Davis TA, Oehlschlägel J, and Riedy J. Matrix: sparse and dense matrix classes and methods, 2023. R-package version 1.5-4.1: https://cran.r-project.org/package=Matrix.
3. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, Groop LC, Altshuler D, Ardlie KG, Hirschhorn JN. Demonstrating stratification in a European American population. Nat Genet. 2005;37(8):868–72.
4. Davis C, Kahan WM. The rotation of eigenvectors by a perturbation. III SIAM J Numer Anal. 1970;7(1):1–46.
5. Epstein MP, Allen AS, Satten GA. A simple and improved correction for population stratification in case-control studies. Am J Hum Genet. 2007;80(5):921–30.