A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data
-
Published:2023-07-27
Issue:1
Volume:10
Page:
-
ISSN:2052-4463
-
Container-title:Scientific Data
-
language:en
-
Short-container-title:Sci Data
Author:
Breger Anna, Selby Ian, Roberts MichaelORCID, Babar Judith, Gkrania-Klotsas Effrossyni, Preller JacobusORCID, Escudero Sánchez LorenaORCID, Dittmer Sören, Thorpe Matthew, Gilbey Julian, Korhonen Anna, Jefferson Emily, Langs Georg, Yang Guang, Xing Xiaodan, Nan Yang, Li Ming, Prosch Helmut, Jan Stanczuk , Tang Jing, Teare Philip, Patel Mishal, Wassink Marcel, Holzer Markus, Solares Eduardo González, Walton Nicholas, Liò Pietro, Shadbahr Tolou, Rudd James H. F., Aston John A. D., Weir-McCall Jonathan R., Sala Evis, Schönlieb Carola-Bibiane,
Abstract
AbstractThe National COVID-19 Chest Imaging Database (NCCID) is a centralized UK database of thoracic imaging and corresponding clinical data. It is made available by the National Health Service Artificial Intelligence (NHS AI) Lab to support the development of machine learning tools focused on Coronavirus Disease 2019 (COVID-19). A bespoke cleaning pipeline for NCCID, developed by the NHSx, was introduced in 2021. We present an extension to the original cleaning pipeline for the clinical data of the database. It has been adjusted to correct additional systematic inconsistencies in the raw data such as patient sex, oxygen levels and date values. The most important changes will be discussed in this paper, whilst the code and further explanations are made publicly available on GitLab. The suggested cleaning will allow global users to work with more consistent data for the development of machine learning tools without being an expert. In addition, it highlights some of the challenges when working with clinical multi-center data and includes recommendations for similar future initiatives.
Publisher
Springer Science and Business Media LLC
Subject
Library and Information Sciences,Statistics, Probability and Uncertainty,Computer Science Applications,Education,Information Systems,Statistics and Probability
Reference36 articles.
1. Cushnan, D. et al. An overview of the National COVID-19 Chest Imaging Database: data quality and cohort analysis. GigaScience 10, https://doi.org/10.1093/gigascience/giab076. Giab076 (2021). 2. Geis, J. R. et al. Ethics of Artificial Intelligence in Radiology: Summary of the Joint European and North American Multisociety Statement. Radiology 293, 436–440, https://doi.org/10.1148/RADIOL.2019191586 (2019). 3. Rouzrokh, P. et al. Mitigating Bias in Radiology Machine Learning: 1. Data Handling. Radiology 4, https://doi.org/10.1148/RYAI.210290 (2022). 4. Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217, https://doi.org/10.1038/s42256-021-00307-0 (2021). 5. Mukherjee, P. et al. Confounding factors need to be accounted for in assessing bias by machine learning algorithms. Nature Medicine 2022 28:6 28, 1159–1160, https://doi.org/10.1038/S41591-022-01847-7 (2022).
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|