Nonparametric clustering of <scp>RNA</scp>‐sequencing data-Reference-Cited by-同舟云学术

Nonparametric clustering of RNA‐sequencing data

Published:2023-07-30 Issue:6 Volume:16 Page:547-559
ISSN:1932-1864
Container-title:Statistical Analysis and Data Mining: The ASA Data Science Journal
language:en
Short-container-title:Statistical Analysis

Author:

Lozano Gabriel¹,Atallah Nadia²,Levine Michael³^ORCID

Affiliation:

1. Computer Systems Engineering, National University of Colombia Bogota Colombia

2. Department of Comparative Pathobiology Purdue Univesity West Lafayette Indiana USA

3. Department of Statistics Purdue University West Lafayette Indiana USA

Abstract

AbstractIdentification of clusters of co‐expressed genes in transcriptomic data is a difficult task. Most algorithms used for this purpose can be classified into two broad categories: distance‐based or model‐based approaches. Distance‐based approaches typically utilize a distance function between pairs of data objects and group similar objects together into clusters. Model‐based approaches are based on using the mixture‐modeling framework. Compared to distance‐based approaches, model‐based approaches offer better interpretability because each cluster can be explicitly characterized in terms of the proposed model. However, these models present a particular difficulty in identifying a correct multivariate distribution that a mixture can be based upon. In this manuscript, we review some of the approaches used to select a distribution for the needed mixture model first. Then, we propose avoiding this problem altogether by using a nonparametric MSL (maximum smoothed likelihood) algorithm. This algorithm was proposed earlier in statistical literature but has not been, to the best of our knowledge, applied to transcriptomics data. The salient feature of this approach is that it avoids explicit specification of distributions of individual biological samples altogether, thus making the task of a practitioner easier. We performed both a simulation study and an application of the proposed algorithm to two different real datasets. When used on a real dataset, the algorithm produces a large number of biologically meaningful clusters and performs at least as well as several other mixture‐based algorithms commonly used for RNA‐seq data clustering. Our results also show that this algorithm is capable of uncovering clustering solutions that may go unnoticed by several other model‐based clustering algorithms. Our code is publicly available on Github at https://github.com/Matematikoi/non_parametric_clustering

Funder

Purdue University Center for Cancer Research

Walther Cancer Foundation

Publisher

Wiley

Subject

Computer Science Applications,Information Systems,Analysis

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/sam.11638

Reference28 articles.

1. Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models

2. Model-based clustering for RNA-seq data

3. A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data

4. A family of parsimonious mixtures of multivariate Poisson‐lognormal distributions for clustering multivariate count data

5. An EM-Like Algorithm for Semi- and Nonparametric Estimation in Multivariate Mixtures