An Exploratory Analysis of ChatGPT Compared to Human Performance With the Anesthesiology Oral Board Examination: Initial Insights and Implications

Author:

Blacker Samuel N.1,Chen Fei1,Winecoff Daniel2,Antonio Benjamin L.1,Arora Harendra3,Hierlmeier Bryan J.3,Kacmar Rachel M.4,Passannante Anthony N.1,Plunkett Anthony R.5,Zvara David1,Cobb Benjamin1,Doyal Alexander1,Rosenkrans Daniel1,Brown Kenneth Bradbury1,Gonzalez Michael A.1,Hood Courtney1,Pham Tiffany T.1,Lele Abhijit V.6,Hall Lesley7,Ali Ameer7,Isaak Robert S.1

Affiliation:

1. Department of Anesthesiology

2. School of Medicine, University of North Carolina, Chapel Hill, North Carolina

3. Department of Anesthesiology, University of Mississippi Medical Center, Jackson, Mississippi

4. Department of Anesthesiology, University of Colorado, Aurora, Colorado

5. Department of Anesthesia & Operative Services, Womack Army Medical Center, Ft. Liberty, North Carolina

6. Department of Anesthesiology and Pain Medicine, University of Washington, Harborview Medical Center, Seattle, Washington

7. University of North Carolina Health Enterprises University of North Carolina Health Care System, Chapel Hill, North Carolina.

Abstract

BACKGROUND: Chat Generative Pre-Trained Transformer (ChatGPT) has been tested and has passed various high-level examinations. However, it has not been tested on an examination such as the American Board of Anesthesiology (ABA) Standardized Oral Examination (SOE). The SOE is designed to assess higher-level competencies, such as judgment, organization, adaptability to unexpected clinical changes, and presentation of information. METHODS: Four anesthesiology fellows were examined on 2 sample ABA SOEs. Their answers were compared to those produced by the same questions asked to ChatGPT. All human and ChatGPT responses were transcribed, randomized by module, and then reproduced as complete examinations, using a commercially available software-based human voice replicator. Eight ABA applied examiners listened to and scored the topic and modules from 1 of the 4 versions of each of the 2 sample examinations. The ABA did not provide any support or collaboration with any authors. RESULTS: The anesthesiology fellow’s answers were found to have a better median score than ChatGPT, for the module topics scores (P = .03). However, there was no significant difference in the median overall global module scores between the human and ChatGPT responses (P = .17). The examiners were able to identify the ChatGPT-generated answers for 23 of 24 modules (95.83%), with only 1 ChatGPT response perceived as from a human. In contrast, the examiners thought the human (fellow) responses were artificial intelligence (AI)-generated in 10 of 24 modules (41.67%). Examiner comments explained that ChatGPT generated relevant content, but were lengthy answers, which at times did not focus on the specific scenario priorities. There were no comments from the examiners regarding Chat GPT fact “hallucinations.” CONCLUSIONS: ChatGPT generated SOE answers with comparable module ratings to anesthesiology fellows, as graded by 8 ABA oral board examiners. However, the ChatGPT answers were deemed subjectively inferior due to the length of responses and lack of focus. Future curation and training of an AI database, like ChatGPT, could produce answers more in line with ideal ABA SOE answers. This could lead to higher performance and an anesthesiology-specific trained AI useful for training and examination preparation.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Reference11 articles.

1. The assessment of clinical skills/competence/performance.;Miller;Acad Med,1990

2. The American Board of Anesthesiology’s Standardized Oral Examination for Initial Board Certification.;Sun;Anesth Analg,2019

3. Capabilities of gpt-4 on medical challenge problems.;Nori;arXiv Preprint arXiv,2023

4. Performance of ChatGPT on a Radiology Board-style Examination: insights into current strengths and limitations. Published online May 16, 2023.;Bhayana;Radiology,2023

5. ChatGPT fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Published online May 22, 2023.;Suchman;Am J Gastroenterol,2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3