Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Author:

Dorn Marcio123ORCID,Grisci Bruno Iochins1ORCID,Narloch Pedro Henrique1,Feltes Bruno César14ORCID,Avila Eduardo35,Kahmann Alessandro6ORCID,Alho Clarice Sampaio35

Affiliation:

1. Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil

2. Center of Biotechnology, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil

3. Forensic Science, National Institute of Science and Technology, Porto Alegre, RS, Brazil

4. Department of Genetics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil

5. School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, RS, Brazil

6. Institute of Mathematics, Statistics and Physics, Federal University of Rio Grande, Rio Grande, RS, Brazil

Abstract

The Coronavirus pandemic caused by the novel SARS-CoV-2 has significantly impacted human health and the economy, especially in countries struggling with financial resources for medical testing and treatment, such as Brazil’s case, the third most affected country by the pandemic. In this scenario, machine learning techniques have been heavily employed to analyze different types of medical data, and aid decision making, offering a low-cost alternative. Due to the urgency to fight the pandemic, a massive amount of works are applying machine learning approaches to clinical data, including complete blood count (CBC) tests, which are among the most widely available medical tests. In this work, we review the most employed machine learning classifiers for CBC data, together with popular sampling methods to deal with the class imbalance. Additionally, we describe and critically analyze three publicly available Brazilian COVID-19 CBC datasets and evaluate the performance of eight classifiers and five sampling techniques on the selected datasets. Our work provides a panorama of which classifier and sampling methods provide the best results for different relevant metrics and discuss their impact on future analyses. The metrics and algorithms are introduced in a way to aid newcomers to the field. Finally, the panorama discussed here can significantly benefit the comparison of the results of new ML algorithms.

Funder

Fundacao de Amparo a Pesquisa do Estado do Rio Grande do Sul - FAPERGS

Conselho Nacional de Desenvolvimento Cientifico e Tecnologico - CNPq

Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior - STICAMSUD

DAAD/CAPES PROBRAL

Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior - Brasil

Publisher

PeerJ

Subject

General Computer Science

Reference110 articles.

1. Artificial intelligence and machine learning to fight COVID-19;Alimadadi;Physiological Genomics,2020

2. Ensemble learning model for diagnosing COVID-19 from routine blood tests;AlJame;Informatics in Medicine Unlocked,2020

3. Explaining machine learning based diagnosis of COVID-19 from routine blood tests with decision trees and criteria graphs;Alves;Computers in Biology and Medicine,2021

4. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection;Ang;IEEE/ACM Transactions on Computational Biology and Bioinformatics,2016

5. Multicriteria wavenumber selection in cocaine classification;Anzanello;Journal of Pharmaceutical and Biomedical Analysis,2015

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3