Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries

Author:

Williams Christopher Y.K.ORCID,Bains Jaskaran,Tang Tianyu,Patel Kishan,Lucas Alexa N.,Chen Fiona,Miao Brenda Y.,Butte Atul J.,Kornblith Aaron E.

Abstract

AbstractImportanceLarge language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed.ObjectiveTo investigate the performance of GPT-4 and GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary.DesignCross-sectional study.SettingUniversity of California, San Francisco ED.ParticipantsWe identified all adult ED visits from 2012 to 2023 with an ED clinician note. We randomly selected a sample of 100 ED visits for GPT-summarization.ExposureWe investigate the potential of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, to summarize the full ED clinician note into a discharge summary.Main Outcomes and MeasuresGPT-3.5-turbo and GPT-4-generated discharge summaries were evaluated by two independent Emergency Medicine physician reviewers across three evaluation criteria: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was manually classified into subgroups of errors.ResultsFrom 202,059 eligible ED visits, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of GPT-generated summaries, while clinical omissions were concentrated in text describing patients’ Physical Examination findings or History of Presenting Complaint.Conclusions and RelevanceIn this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. A comprehensive understanding of the location and type of errors found in GPT-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.

Publisher

Cold Spring Harbor Laboratory

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3