Integrative analysis of individual-level data and high-dimensional summary statistics-Reference-Cited by-同舟云学术

Integrative analysis of individual-level data and high-dimensional summary statistics

Published:2023-03-25 Issue:4 Volume:39 Page:
ISSN:1367-4811
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Fu Sheng¹,Deng Lu²,Zhang Han¹,Wheeler William³,Qin Jing⁴,Yu Kai¹^ORCID

Affiliation:

1. Division of Cancer Epidemiology and Genetics, National Cancer Institute , Bethesda, MD 20892, USA

2. School of Statistics and Data Science, Nankai University , Tianjin 300071, China

3. Information Management Services, Inc , Bethesda, MD 20892, USA

4. National Institute of Allergy and Infectious Diseases, National Institutes of Health , Bethesda, MD 20892, USA

Abstract

Abstract Motivation Researchers usually conduct statistical analyses based on models built on raw data collected from individual participants (individual-level data). There is a growing interest in enhancing inference efficiency by incorporating aggregated summary information from other sources, such as summary statistics on genetic markers’ marginal associations with a given trait generated from genome-wide association studies. However, combining high-dimensional summary data with individual-level data using existing integrative procedures can be challenging due to various numeric issues in optimizing an objective function over a large number of unknown parameters. Results We develop a procedure to improve the fitting of a targeted statistical model by leveraging external summary data for more efficient statistical inference (both effect estimation and hypothesis testing). To make this procedure scalable to high-dimensional summary data, we propose a divide-and-conquer strategy by breaking the task into easier parallel jobs, each fitting the targeted model by integrating the individual-level data with a small proportion of summary data. We obtain the final estimates of model parameters by pooling results from multiple fitted models through the minimum distance estimation procedure. We improve the procedure for a general class of additive models commonly encountered in genetic studies. We further expand these two approaches to integrate individual-level and high-dimensional summary data from different study populations. We demonstrate the advantage of the proposed methods through simulations and an application to the study of the effect on pancreatic cancer risk by the polygenic risk score defined by BMI-associated genetic markers. Availability and implementation R package is available at https://github.com/fushengstat/MetaGIM.

Funder

National Cancer Institute, Division of Cancer Epidemiology and Genetics

Fundamental Research Funds for the Central Universities

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad156/49631362/btad156.pdf

Reference33 articles.

1. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer;Amundadottir;Nat Genet,2009

2. LD score regression distinguishes confounding from polygenicity in genome-wide association studies;Bulik-Sullivan;Nat Genet,2015

3. The UK biobank resource with deep phenotyping and genomic data;Bycroft;Nature,2018

4. Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources;Chatterjee;J Am Stat Assoc,2016

5. Generalised linear models incorporating population level information: An empirical likelihood based approach;Chaudhuri;J R Stat Soc Series B Stat Methodol,2008

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The goldmine of GWAS summary statistics: a systematic review of methods and tools;BioData Mining;2024-09-05

2. Improve the model of disease subtype heterogeneity by leveraging external summary data;PLOS Computational Biology;2023-07-12

3. Correction to: Integrative analysis of individual-level data and high-dimensional summary statistics;Bioinformatics;2023-05-01