Abstract
AbstractMotivationRecent advancements in sequencing technologies have led to the discovery of numerous variants in the human genome. However, understanding their precise roles in diseases remains challenging due to their complex functional mechanisms. Various methodologies have emerged to predict the pathogenic significance of these genetic variants. Typically, these methods employ an integrative approach, leveraging diverse data sources that provide critical insights into genomic function. Despite the abundance of publicly available data sources and databases, the process of navigating, extracting, and pre-processing features for machine learning models can be daunting. Furthermore, researchers often invest substantial effort in feature extraction, only to later discover that these features lack informativeness.ResultsIn this paper, we presentDrivR-Base, an innovative resource that efficiently extracts and integrates molecular information (features) for single nucleotide variants from a wide range of databases and tools, including AlphaFold, ENCODE, andVariant Effect Predictor. The resulting features can be used as input for machine learning models designed to predict the pathogenic impact of human genome variants in disease. Moreover, these feature sets have applications beyond this, including haploinsufficiency prediction and the development of drug repurposing tools. We describe the resource’s development, practical applications, and potential for future expansion and enhancement.Availability and ImplementationDrivR-Basesource code is available athttps://github.com/amyfrancis97/DrivR-Base.
Publisher
Cold Spring Harbor Laboratory