Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls-Reference-Cited by-同舟云学术

Evaluation of a Digital Scribe: Conversation Summarization for Emergency Department Consultation Calls

Published:2024-05 Issue:03 Volume:15 Page:600-611
ISSN:1869-0327
Container-title:Applied Clinical Informatics
language:en
Short-container-title:Appl Clin Inform

Author:

Sezgin Emre,Sirrianni Joseph W.¹,Kranz Kelly²

Affiliation:

1. IT Research and Innovation, The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, Ohio, United States

2. Physician Consult and Transfer Center, Nationwide Children's Hospital, Columbus, Ohio, United States

Abstract

Abstract Objectives We present a proof-of-concept digital scribe system as an emergency department (ED) consultation call-based clinical conversation summarization pipeline to support clinical documentation and report its performance. Methods We use four pretrained large language models to establish the digital scribe system: T5-small, T5-base, PEGASUS-PubMed, and BART-Large-CNN via zero-shot and fine-tuning approaches. Our dataset includes 100 referral conversations among ED clinicians and medical records. We report the ROUGE-1, ROUGE-2, and ROUGE-L to compare model performance. In addition, we annotated transcriptions to assess the quality of generated summaries. Results The fine-tuned BART-Large-CNN model demonstrates greater performance in summarization tasks with the highest ROUGE scores (F1ROUGE-1 = 0.49, F1ROUGE-2 = 0.23, F1ROUGE-L = 0.35) scores. In contrast, PEGASUS-PubMed lags notably (F1ROUGE-1 = 0.28, F1ROUGE-2 = 0.11, F1ROUGE-L = 0.22). BART-Large-CNN's performance decreases by more than 50% with the zero-shot approach. Annotations show that BART-Large-CNN performs 71.4% recall in identifying key information and a 67.7% accuracy rate. Conclusion The BART-Large-CNN model demonstrates a high level of understanding of clinical dialogue structure, indicated by its performance with and without fine-tuning. Despite some instances of high recall, there is variability in the model's performance, particularly in achieving consistent correctness, suggesting room for refinement. The model's recall ability varies across different information categories. The study provides evidence toward the potential of artificial intelligence-assisted tools in assisting clinical documentation. Future work is suggested on expanding the research scope with additional language models and hybrid approaches and comparative analysis to measure documentation burden and human factors.

Funder

U.S. Department of Health and Human Services

Publisher

Georg Thieme Verlag KG

Link

http://www.thieme-connect.de/products/ejournals/pdf/10.1055/a-2327-4121.pdf

Reference47 articles.

1. Challenges of developing a digital scribe to reduce clinical documentation burden;J C Quiroz;NPJ Digit Med,2019

2. Burnout in clinicians;A Chandawarkar;Curr Probl Pediatr Adolesc Health Care,2021

3. Time spent on dedicated patient care and documentation tasks before and after the introduction of a structured and standardized electronic health record;E Joukes;Appl Clin Inform,2018

4. Burnout syndrome among emergency department staff: prevalence and associated factors;A Moukarzel;BioMed Res Int,2019

5. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments;A J Moy;J Am Med Inform Assoc,2023