Abstract
AbstractBackgroundThe routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.ResultsWe found that PCA performs systematically better than the UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.ConclusionsWe conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.
Publisher
Cold Spring Harbor Laboratory
Reference47 articles.
1. UMAP: Uniform Manifold Approximation and Projection;Journal of Open Source Software,2018
2. Yingfan Wang , et al. “Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMAP, and PaCMAP for Data Visualization”. In: CoRR abs/2012.04456 (2020). arXiv: 2012.04456. url: https://arxiv.org/abs/2012.04456.
3. TriMap: Large-scale Dimensionality Reduction Using Triplets;CoRR,2019
4. Data processing workflow for large-scale immune monitoring studies by mass cytometry
5. Dimensionality reduction by UMAP for visualizing and aiding in classification of imaging flow cytometry data