Author:
Luo Xizhi,Qin Fei,Cai Guoshuai,Xiao Feifei
Abstract
AbstractCopy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e., boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we therefore proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic correlation (i.e., LD). To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presents high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. We also theoretically demonstrated the correlation structure of CNV data, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. This new segmentation algorithm has a wide scope of application with next-generation sequencing data analysis and single-cell sequencing analysis.Author SummaryCopy number variants (CNVs) refers to gains or losses of the DNA segments in comparison to a reference genome. CNVs have garnered extensive interests in recent years as they play an important role susceptibility to disorders and diseases such as autism, schizophrenia and cancer [1-7]. Although innovation in modern technology is promoting the discoveries related to CNVs, the methodology for CNV detection is still lagging, which limits the novel discoveries regarding the role of CNVs in complex diseases. In this study, we are proposing a novel segmentation algorithm, LDcnv, to accurately locate the breakpoints or boundaries of CNVs in the human genome. Instead of utilizing an independent assumption of the signal intensities as has been used in traditional segmentation algorithms, LDcnv models the correlation structure in the genome in a change-point CNV detection model, which allows for accurate and fast computation with a whole genome scan. Our study showed strong theoretical evidence of the existence of correlation structure in real CNV data, and we believe that taking this evidence into consideration will improve the power of CNV detection. Extensive simulation studies have demonstrated the advantage of the LDcnv algorithm in stability, robustness and accuracy over existing methods. We also used high-quality CNV profiles to further support the superior performance of the LDcnv algorithm over existing methods. The development of the LDcnv algorithm provides great insights for new directions in developing CNV detection tools.
Publisher
Cold Spring Harbor Laboratory