The use of gene expression datasets in feature selection research: 20 years of inherent bias?

Author:

Grisci Bruno I.12ORCID,Feltes Bruno César3,de Faria Poloni Joice3,Narloch Pedro H.1,Dorn Márcio145ORCID

Affiliation:

1. Institute of Informatics Federal University of Rio Grande do Sul Porto Alegre Rio Grande do Sul Brazil

2. Faculty of Computer Science Dalhousie University Halifax Nova Scotia Canada

3. Institute of Biosciences Federal University of Rio Grande do Sul Porto Alegre Rio Grande do Sul Brazil

4. National Institute of Science and Technology ‐ Forensic Science Porto Alegre Rio Grande do Sul Brazil

5. Center for Biotechnology Federal University of Rio Grande do Sul Porto Alegre Rio Grande do Sul Brazil

Abstract

AbstractFeature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.This article is categorized under: Algorithmic Development > Biological Data Mining Technologies > Machine Learning

Funder

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior

Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul

Global Affairs Canada

Publisher

Wiley

Subject

General Computer Science

Reference113 articles.

1. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

2. Microarray data analysis: from disarray to consolidation and consensus

3. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

4. Supervised, unsupervised, and semi‐supervised feature selection: A review on gene selection;Ang J. C.;Institute of Electrical and Electronics Engineers/Association for Computing Machinery Transactions on Computational Biology and Bioinformatics,2016

5. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3