A machine learning model for predicting congenital heart defects from administrative data

Author:

Shi Haoming1ORCID,Book Wendy23,Raskind‐Hood Cheryl3,Downing Karrie F.4ORCID,Farr Sherry L.4,Bell Mary N.1,Sameni Reza5,Rodriguez Fred H.26,Kamaleswaran Rishikesan15

Affiliation:

1. Department of Biomedical Engineering Georgia Institute Technology Atlanta Georgia USA

2. Division of Cardiology Emory University School of Medicine Atlanta Georgia USA

3. Department of Epidemiology Emory University, Rollins School of Public Health Atlanta Georgia USA

4. National Center on Birth Defects and Developmental Disabilities Centers for Disease Control and Prevention Atlanta Georgia USA

5. Department of Biomedical Informatics Emory University School of Medicine Atlanta Georgia USA

6. Children's Healthcare of Atlanta Atlanta Georgia USA

Abstract

AbstractIntroductionInternational Classification of Diseases (ICD) codes recorded in administrative data are often used to identify congenital heart defects (CHD). However, these codes may inaccurately identify true positive (TP) CHD individuals. CHD surveillance could be strengthened by accurate CHD identification in administrative records using machine learning (ML) algorithms.MethodsTo identify features relevant to accurate CHD identification, traditional ML models were applied to a validated dataset of 779 patients; encounter level data, including ICD‐9‐CM and CPT codes, from 2011 to 2013 at four US sites were utilized. Five‐fold cross‐validation determined overlapping important features that best predicted TP CHD individuals. Median values and 95% confidence intervals (CIs) of area under the receiver operating curve, positive predictive value (PPV), negative predictive value, sensitivity, specificity, and F1‐score were compared across four ML models: Logistic Regression, Gaussian Naive Bayes, Random Forest, and eXtreme Gradient Boosting (XGBoost).ResultsBaseline PPV was 76.5% from expert clinician validation of ICD‐9‐CM CHD‐related codes. Feature selection for ML decreased 7138 features to 10 that best predicted TP CHD cases. During training and testing, XGBoost performed the best in median accuracy (F1‐score) and PPV, 0.84 (95% CI: 0.76, 0.91) and 0.94 (95% CI: 0.91, 0.96), respectively. When applied to the entire dataset, XGBoost revealed a median PPV of 0.94 (95% CI: 0.94, 0.95).ConclusionsApplying ML algorithms improved the accuracy of identifying TP CHD cases in comparison to ICD codes alone. Use of this technique to identify CHD cases would improve generalizability of results obtained from large datasets to the CHD patient population, enhancing public health surveillance efforts.

Funder

Centers for Disease Control and Prevention

Publisher

Wiley

Subject

Health, Toxicology and Mutagenesis,Developmental Biology,Toxicology,Embryology,Pediatrics, Perinatology and Child Health

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3