XSI—a genotype compression tool for compressive genomics in large biobanks


Wertenbroek Rick12ORCID,Rubinacci Simone2,Xenarios Ioannis2,Thoma Yann1,Delaneau Olivier2


1. School of Management and Engineering Vaud (HEIG-VD), HES-SO University of Applied Sciences and Arts Western Switzerland , Yverdon-les-Bains 1401, Switzerland

2. Department of Computational Biology, University of Lausanne , Lausanne 1015, Switzerland


Abstract Motivation Generation of genotype data has been growing exponentially over the last decade. With the large size of recent datasets comes a storage and computational burden with ever increasing costs. To reduce this burden, we propose XSI, a file format with reduced storage footprint that also allows computation on the compressed data and we show how this can improve future analyses. Results We show that xSqueezeIt (XSI) allows for a file size reduction of 4-20× compared with compressed BCF and demonstrate its potential for ‘compressive genomics’ on the UK Biobank whole-genome sequencing genotypes with 8× faster loading times, 5× faster run of homozygozity computation, 30× faster dot products computation and 280× faster allele counts. Availability and implementation The XSI file format specifications, API and command line tool are released under open-source (MIT) license and are available at https://github.com/rwk-unil/xSqueezeIt Supplementary information Supplementary data are available at Bioinformatics online.


School of Management and Engineering Vaud


Oxford University Press (OUP)


Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Reference32 articles.

1. Computational biology in the 21st century: Scaling with compressive algorithms;Berger;Commun. ACM,2016

2. The UK biobank resource with deep phenotyping and genomic data;Bycroft;Nature,2018

3. Second-generation PLINK: rising to the challenge of larger and richer datasets;Chang;Gigascience,2015

4. The variant call format and VCFtools;Danecek;Bioinformatics,2011








Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3