Evaluating ChatGPT in Information Extraction: A Case Study of Extracting Cognitive Exam Dates and Scores
Author:
Jethani NeilORCID, Jones Simon, Genes NicholasORCID, Major Vincent J.ORCID, Jaffe Ian S.ORCID, Cardillo Anthony B.ORCID, Heilenbach NoahORCID, Ali Nadia FazalORCID, Bonanni Luke J.ORCID, Clayburn Andrew J.ORCID, Khera ZainORCID, Sadler Erica C.ORCID, Prasad JaideepORCID, Schlacter JamieORCID, Liu KevinORCID, Silva BenjaminORCID, Montgomery SophieORCID, Kim Eric J.ORCID, Lester JacobORCID, Hill Theodore M.ORCID, Avoricani AlbaORCID, Chervonski EthanORCID, Davydov James, Small William, Chakravartty EeshaORCID, Grover Himanshu, Dodson John A., Brody Abraham A.ORCID, Aphinyanaphongs YindalonORCID, Razavian NargesORCID
Abstract
AbstractBackgroundLarge language models (LLMs) provide powerful natural language processing (NLP) capabilities in medical and clinical tasks. Evaluating LLM performance is crucial due to potential false results. In this study, we assessed ChatGPT, a state-of-the-art LLM, in extracting information from clinical notes, focusing on cognitive tests, specifically the Mini Mental State Exam (MMSE) and the Cognitive Dementia Rating (CDR). We tasked ChatGPT with extracting MMSE and CDR scores and corresponding dates from clinical notes.MethodsOur cohort had 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or Montreal Cognitive Assessment (MoCA). After applying inclusion criteria and excluding notes with only MoCA, 34,465 notes remained. Among them, 765 were randomly selected and underwent analysis by ChatGPT. 22 medically-trained experts reviewed ChatGPT’s responses and provided ground truth. ChatGPT (GPT-4, version "2023-03-15-preview") was used on the 765 notes to extract MMSE and CDR instances with corresponding dates. Inference was successful for 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss’ Kappa), precision, recall, true/false negative rates, and accuracy were calculated.ResultsFor MMSE information extraction, ChatGPT achieved 83% accuracy. It demonstrated high sensitivity with a Macro-recall of 89.7% and outstanding true-negative rates of 96%. The precision for MMSE was also high at 82.7%. In the case of CDR information extraction, ChatGPT achieved 89% accuracy. It showed excellent sensitivity with a Macro-recall of 91.3% and a perfect true-negative rate of 100%. However, the precision for CDR was lower at 57%. Analyzing the ground truth data, it was found that 89.1% of the notes included an MMSE documentation, whereas only 14.3% had a CDR documentation, which affected the precision of CDR extraction. Inter-rater-agreement was substantial, supporting the validity of our findings. Reviewers considered ChatGPT’s responses correct (96% for MMSE, 98% for CDR) and complete (84% for MMSE, 83% for CDR).ConclusionChatGPT demonstrates overall accuracy in extracting MMSE and CDR scores and dates, potentially benefiting dementia research and clinical care. Prior probability of the information appearing in the text impacted ChatGPT’s precision. Rigorous evaluation of LLMs for diverse medical tasks is crucial to understand their capabilities and limitations.
Publisher
Cold Spring Harbor Laboratory
Reference100 articles.
1. OpenAI. ChatGPT. 2023 [cited 3 Jul 2023]. Available: http://openai.com/chatgpt (accessed June 2023) 2. OpenAI. GPT-4 Technical Report. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2303.08774 3. Singhal K , Tu T , Gottweis J , Sayres R , Wulczyn E , Hou L , et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2305.09617 4. Touvron H , Lavril T , Izacard G , Martinet X , Lachaux M-A , Lacroix T , et al. LLaMA: Open and Efficient Foundation Language Models. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2302.13971 5. Bubeck S , Chandrasekaran V , Eldan R , Gehrke J , Horvitz E , Kamar E , et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv [cs.CL]. 2023. Available: http://arxiv.org/abs/2303.12712
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|