ChatGPT-4 Surpasses Residents: A Study of Artificial Intelligence Competency in Plastic Surgery In-service Examinations and Its Advancements from ChatGPT-3.5-Reference-Cited by-同舟云学术

ChatGPT-4 Surpasses Residents: A Study of Artificial Intelligence Competency in Plastic Surgery In-service Examinations and Its Advancements from ChatGPT-3.5

Published:2024-09 Issue:9 Volume:12 Page:e6136
ISSN:2169-7574
Container-title:Plastic and Reconstructive Surgery - Global Open
language:en
Short-container-title:

Author:

Hubany Shannon S.¹²,Scala Fernanda D.²,Hashemi Kiana¹²,Kapoor Saumya¹²,Fedorova Julia R.¹²,Vaccaro Matthew J.¹²,Ridout Rees P.¹²,Hedman Casey C.¹²,Kellogg Brian C.²,Leto Barone Angelo A.²

Affiliation:

1. University of Central Florida College of Medicine, Orlando, Fla.

2. Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children’s Hospital, Orlando, Fla.

Abstract

Background: ChatGPT, launched in 2022 and updated to Generative Pre-trained Transformer 4 (GPT-4) in 2023, is a large language model trained on extensive data, including medical information. This study compares ChatGPT’s performance on Plastic Surgery In-Service Examinations with medical residents nationally as well as its earlier version, ChatGPT-3.5. Methods: This study reviewed 1500 questions from the Plastic Surgery In-service Examinations from 2018 to 2023. After excluding image-based, unscored, and inconclusive questions, 1292 were analyzed. The question stem and each multiple-choice answer was inputted verbatim into ChatGPT-4. Results: ChatGPT-4 correctly answered 961 (74.4%) of the included questions. Best performance by section was in core surgical principles (79.1% correct) and lowest in craniomaxillofacial (69.1%). ChatGPT-4 ranked between the 61st and 97th percentiles compared with all residents. Comparatively, ChatGPT-4 significantly outperformed ChatGPT-3.5 in 2018–2022 examinations (P < 0.001). Although ChatGPT-3.5 averaged 55.5% correctness, ChatGPT-4 averaged 74%, a mean difference of 18.54%. In 2021, ChatGPT-3.5 ranked in the 23rd percentile of all residents, whereas ChatGPT-4 ranked in the 97th percentile. ChatGPT-4 outperformed 80.7% of residents on average and scored above the 97th percentile among first-year residents. Its performance was comparable with sixth-year integrated residents, ranking in the 55.7th percentile, on average. These results show significant improvements in ChatGPT-4’s application of medical knowledge within six months of ChatGPT-3.5’s release. Conclusion: This study reveals ChatGPT-4’s rapid developments, advancing from a first-year medical resident’s level to surpassing independent residents and matching a sixth-year resident’s proficiency.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Reference13 articles.

1. GPT-4.

2. Assessing ChatGPT’s orthopedic in-service training examination performance and applicability in the field.;Jain;J Orthop Surg Res,2024

3. ChatGPT’s performance on the hand surgery self-assessment exam: a critical analysis.;Han;J Hand Surg Glob Online,2024

4. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access examination to specialized medical training.;Madrid-García;Sci Rep,2023

5. Performance of ChatGPT on the plastic surgery inservice training examination.;Gupta;Aesthet Surg J,2023