Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules-Reference-Cited by-同舟云学术

Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules

Published:2023-12-25 Issue: Volume: Page:
ISSN:0846-5371
Container-title:Canadian Association of Radiologists Journal
language:en
Short-container-title:Can Assoc Radiol J

Author:

Gamble Joel L.¹^ORCID,Ferguson Duncan¹,Yuen Joanna¹,Sheikh Adnan¹

Affiliation:

1. Department of Radiology, University of British Columbia, Vancouver, BC, Canada

Abstract

Purpose: To evaluate the accuracy of GPT-3.5, GPT-4, and a fine-tuned GPT-3.5 model in applying Fleischner Society recommendations to lung nodules. Methods: We generated 10 lung nodule descriptions for each of the 12 nodule categories from the Fleischner Society guidelines, incorporating them into a single fictitious report (n = 120). GPT-3.5 and GPT-4 were prompted to make follow-up recommendations based on the reports. We then incorporated the full guidelines into the prompts and re-submitted them. Finally, we re-submitted the prompts to a fine-tuned GPT-3.5 model. Results were analyzed using binary accuracy analysis in R. Results: GPT-3.5 accuracy in applying Fleischner Society guidelines was 0.058 (95% CI: 0.02, 0.12). GPT-4 accuracy was improved at 0.15 (95% CI: 0.09, 0.23; P = .02 for accuracy comparison). In recommending PET-CT and/or biopsy, both GPT-3.5 and GPT-4 had an F-score of 0.00. After explicitly including the Fleischner Society guidelines in the prompt, GPT-3.5 and GPT-4 significantly improved their accuracy to 0.42 (95% CI: 0.33, 0.51; P < .001) and to 0.66 (95% CI: 0.57, 0.74; P < .001), respectively. GPT-4 remained significantly better than GPT-3.5 ( P < .001). The fine-tuned GPT-3.5 model accuracy was 0.46 (95% CI: 0.37, 0.55), not different from the GPT-3.5 model with guidelines included ( P = .53). Conclusion: GPT-3.5 and GPT-4 performed poorly in applying widely known guidelines and never correctly recommended biopsy. Flawed knowledge and reasoning both contributed to their poor performance. While GPT-4 was more accurate than GPT-3.5, its inaccuracy rate was unacceptable for clinical practice. These results underscore the limitations of large language models for knowledge and reasoning-based tasks.

Publisher

SAGE Publications

Subject

Radiology, Nuclear Medicine and imaging,General Medicine

Link

http://journals.sagepub.com/doi/pdf/10.1177/08465371231218250

Reference20 articles.

1. Peng A, Wu M, Allard J, Kilpatrick L, Heidel S. GPT-3.5 Turbo fine-tuning and API updates. 2023. Accessed August 29, 2023. https://openai.com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates

2. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations

3. GPT-4 in Radiology: Improvements in Advanced Reasoning