NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning-Reference-Cited by-同舟云学术

NmTHC: a hybrid error correction method based on a generative neural machine translation model with transfer learning

Published:2024-06-07 Issue:1 Volume:25 Page:
ISSN:1471-2164
Container-title:BMC Genomics
language:en
Short-container-title:BMC Genomics

Author:

Wang Rongshu,Chen Jianhua

Abstract

Abstract Backgrounds The single-pass long reads generated by third-generation sequencing technology exhibit a higher error rate. However, the circular consensus sequencing (CCS) produces shorter reads. Thus, it is effective to manage the error rate of long reads algorithmically with the help of the homologous high-precision and low-cost short reads from the Next Generation Sequencing (NGS) technology. Methods In this work, a hybrid error correction method (NmTHC) based on a generative neural machine translation model is proposed to automatically capture discrepancies within the aligned regions of long reads and short reads, as well as the contextual relationships within the long reads themselves for error correction. Akin to natural language sequences, the long read can be regarded as a special “genetic language” and be processed with the idea of generative neural networks. The algorithm builds a sequence-to-sequence(seq2seq) framework with Recurrent Neural Network (RNN) as the core layer. The before and post-corrected long reads are regarded as the sentences in the source and target language of translation, and the alignment information of long reads with short reads is used to create the special corpus for training. The well-trained model can be used to predict the corrected long read. Results NmTHC outperforms the latest mainstream hybrid error correction methods on real-world datasets from two mainstream platforms, including PacBio and Nanopore. Our experimental evaluation results demonstrate that NmTHC can align more bases with the reference genome without any segmenting in the six benchmark datasets, proving that it enhances alignment identity without sacrificing any length advantages of long reads. Conclusion Consequently, NmTHC reasonably adopts the generative Neural Machine Translation (NMT) model to transform hybrid error correction tasks into machine translation problems and provides a novel perspective for solving long-read error correction problems with the ideas of Natural Language Processing (NLP). More remarkably, the proposed methodology is sequencing-technology-independent and can produce more precise reads.

Funder

National Natural Science Foundation of China

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s12864-024-10446-4.pdf

Reference38 articles.

1. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2012;13(1):36–46.

2. Kim K-R, Yu J-N, Hong JM, Kim S-Y, Park SY. Genome assembly and microsatellite marker development using illumina and PacBio sequencing in the Carex pumila (Cyperaceae) from Korea. Genes (Basel). 2023;14(11):2063.

3. Wang S, Zhang X, Qiang G, Wang J. DelInsCaller: an efficient algorithm for identifying Delins and estimating haplotypes from long reads with high level of sequencing errors. Genes (Basel). 2023;14(1):4.

4. Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13(1):1–13.

5. Foord C, Hsu J, Jarroux J, Hu W, Belchikov N, Pollard S, He Y, Joglekar A, Tilgner HU. The variables on RNA molecules: concert or cacophony? Answers in long-read sequencing. Nat Methods. 2023;20(1):20–4.