Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

Author:

Ambroise Christophe,Dehman Alia,Neuvial PierreORCID,Rigaill Guillem,Vialaneix NathalieORCID

Abstract

Abstract Background Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of $$10^4$$104 to $$10^5$$105 for each chromosome. Results By assuming that the similarity between physically distant objects is negligible, we are able to propose an implementation of adjacency-constrained HAC with quasi-linear complexity. This is achieved by pre-calculating specific sums of similarities, and storing candidate fusions in a min-heap. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds. Availability and implementation Software and sample data are available as an package, adjclust, that can be downloaded from the Comprehensive R Archive Network (CRAN).

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computational Theory and Mathematics,Molecular Biology,Structural Biology

Reference36 articles.

1. Lecture Notes in Computer Science,2016

2. Arlot S, Brault V, Baudry J-P, Maugis C, Michel B. capushe: CAlibrating Penalities Using Slope HEuristics, 2016. https://CRAN.R-project.org/package=capushe. R package version 1.1.1.

3. Arlot S, Celisse A, Harchaoui Z. A kernel multiple change-point algorithm via model selection. Preprint arXiv: 1202.3878, 2016.

4. Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. 1950;68(3):337–404.

5. Baker FB. Stability of two hierarchical grouping techniques case I: sensitivity to data errors. J Am Stat Assoc. 1974;69(346):440–5. https://doi.org/10.1080/01621459.1974.10482971.

Cited by 20 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3