Abstract
AbstractBACKGROUNDChatGPT is a large language model with promising healthcare applications. However, its ability to analyze complex clinical data and provide consistent results is poorly known. This study evaluated ChatGPT-4’s risk stratification of simulated patients with acute nontraumatic chest pain compared to validated tools.METHODSThree datasets of simulated case studies were created: one based on the TIMI score variables, another on HEART score variables, and a third comprising 44 randomized variables related to non-traumatic chest pain presentations. ChatGPT independently scored each dataset five times. Its risk scores were compared to calculated TIMI and HEART scores. A model trained on 44 clinical variables was evaluated for consistency.RESULTSChatGPT showed a high correlation with TIMI and HEART scores (r = 0.898 and 0.928, respectively), but the distribution of individual risk assessments was broad. ChatGPT gave a different risk 45-48% of the time for a fixed TIMI or HEART score. On the 44 variable model, a majority of the five ChatGPT models agreed on a diagnosis category only 56% of the time, and risk scores were poorly correlated (r = 0.605). ChatGPT assigned higher risk scores to males and African Americans.CONCLUSIONWhile ChatGPT correlates closely with established risk stratification tools regarding mean scores, its inconsistency when presented with identical patient data on separate occasions raises concerns about its reliability. The findings suggest that while large language models like ChatGPT hold promise for healthcare applications, further refinement and customization are necessary, particularly in the clinical risk assessment of atraumatic chest pain patients.
Publisher
Cold Spring Harbor Laboratory
Reference26 articles.
1. A logical calculus of the ideas immanent in nervous activity
2. Biever C . ChatGPT broke the Turing test - the race is on for new ways to assess AI. Nature. 2023 Jul;619(7971):686–9.
3. Ashish V . Attention is all you need. Advances in neural information processing systems. 2017;30.
4. Radford A , Narasimhan K , Salimans T , Sutskever I. Improving language understanding by generative pre-training. [Internet]. 2018 [cited 2023 Jun 20]. Available from: https://web.archive.org/web/20230622213848/https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
5. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献