Identifying correlations driven by influential observations in large datasets

Author:

Bu Kevin1ORCID,Wallach David S1,Wilson Zach1,Shen Nan1,Segal Leopoldo N2,Bagiella Emilia3,Clemente Jose C14

Affiliation:

1. Department of Genetics and Data Science, Icahn School of Medicine at Mount Sinai. New York, NY, USA

2. Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, New York University School of Medicine, New York, NY, USA

3. Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA

4. Immunology Institute, Icahn School of Medicine at Mount Sinai. New York, NY, USA

Abstract

Abstract Although high-throughput data allow researchers to interrogate thousands of variables simultaneously, it can also introduce a significant number of spurious results. Here we demonstrate that correlation analysis of large datasets can yield numerous false positives due to the presence of outliers that canonical methods fail to identify. We present Correlations Under The InfluencE (CUTIE), an open-source jackknifing-based method to detect such cases with both parametric and non-parametric correlation measures, and which can also uniquely rescue correlations not originally deemed significant or with incorrect sign. Our approach can additionally be used to identify variables or samples that induce these false correlations in high proportion. A meta-analysis of various omics datasets using CUTIE reveals that this issue is pervasive across different domains, although microbiome data are particularly susceptible to it. Although the significance of a correlation eventually depends on the thresholds used, our approach provides an efficient way to automatically identify those that warrant closer examination in very large datasets.

Funder

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3