Performance deterioration of deep learning models after clinical deployment: a case study with auto-segmentation for definitive prostate cancer radiotherapy-Reference-Cited by-同舟云学术

Performance deterioration of deep learning models after clinical deployment: a case study with auto-segmentation for definitive prostate cancer radiotherapy

Published:2024-06-01 Issue:2 Volume:5 Page:025077
ISSN:2632-2153
Container-title:Machine Learning: Science and Technology
language:
Short-container-title:Mach. Learn.: Sci. Technol.

Author:

Wang Biling^ORCID,Dohopolski Michael,Bai Ti^ORCID,Wu Junjie,Hannan Raquibul,Desai Neil,Garant Aurelie,Yang Daniel,Nguyen Dan^ORCID,Lin Mu-Han,Timmerman Robert,Wang Xinlei^ORCID,Jiang Steve B^ORCID

Abstract

Abstract Our study aims to explore the long-term performance patterns for deep learning (DL) models deployed in clinic and to investigate their efficacy in relation to evolving clinical practices. We conducted a retrospective study simulating the clinical implementation of our DL model involving 1328 prostate cancer patients treated between January 2006 and August 2022. We trained and validated a U-Net-based auto-segmentation model on data obtained from 2006 to 2011 and tested on data from 2012 to 2022, simulating the model’s clinical deployment starting in 2012. We visualized the trends of the model performance using exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank Sum Test and multiple linear regression to investigate Dice similarity coefficient (DSC) variations across distinct periods and the impact of clinical factors, respectively. Initially, from 2012 to 2014, the model showed high performance in segmenting the prostate, rectum, and bladder. Post-2015, a notable decline in EMA DSC was observed for the prostate and rectum, while bladder contours remained stable. Key factors impacting the prostate contour quality included physician contouring styles, using various hydrogel spacers, CT scan slice thickness, MRI-guided contouring, and intravenous (IV) contrast (p < 0.0001, p < 0.0001, p = 0.0085, p = 0.0012, p < 0.0001, respectively). Rectum contour quality was notably influenced by factors such as slice thickness, physician contouring styles, and the use of various hydrogel spacers. The quality of the bladder contour was primarily affected by IV contrast. The deployed DL model exhibited a substantial decline in performance over time, aligning with the evolving clinical settings.

Funder

NIH- NCI - National Cancer institute

Publisher

IOP Publishing

Link

https://iopscience.iop.org/article/10.1088/2632-2153/ad580f/pdf

Reference44 articles.

1. Responsible artificial intelligence as a secret ingredient for digital health: bibliometric analysis, insights, and research directions;Fosso Wamba;Inf. Syst. Front.,2021

2. AI in health and medicine;Rajpurkar;Nat. Med.,2022

3. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence;Group D-AS;Nat. Med.,2021

4. AI in medicine must be explainable;Kundu;Nat. Med.,2021

5. Deep learning in medicine-promise, progress, and challenges;Wang;JAMA Intern. Med.,2019