Efficient computation of absent words in genomic sequences-Reference-Cited by-同舟云学术

Efficient computation of absent words in genomic sequences

Published:2008-03-26 Issue:1 Volume:9 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Herold Julia,Kurtz Stefan,Giegerich Robert

Abstract

Abstract Background Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. Results We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. Conclusion The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-9-167.pdf

Reference24 articles.

1. Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signatures; from dinucleotides to chaps game representation. Gene 2005, 346: 173–185.

2. Workman C, Krogh A: No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res 1999, 27(24):4816–4822.

3. Krause L, McHardy A, Nattkemper T, Pühler A, Stoye J, Meyer F: GISMO – gene identification using a support vector machine for ORF classification. Nucleic Acids Res 2007, 35(2):540–549.

4. Pingoud A, Jeltsch A: Structure and function of type II restriction endonucleases. Nucleic Acids Res 2001, 29: 3705–3727.

5. Apostolico A, Bock ME, Lonardi S: Monotony of Surprise And Large-Scale Quest for Unusual Words. Proceedings of the Sixth Annual International Conference on Computional Biology (RECOMB 2002) 2002, 22–31.

Cited by 61 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species;Computational and Structural Biotechnology Journal;2024-12

2. Enhanced expression of the myogenic factor Myocyte enhancer factor-2 in imaginal disc myoblasts activates a partial, but incomplete, muscle development program;Developmental Biology;2024-12

3. Cell sorting based on single nucleotide variation enables characterization of mutation-dependent transcriptome and chromatin states;2024-07-08

4. kmerDB: A Database Encompassing the Set of Genomic and Proteomic Sequence Information for Each Species;2023-11-16

5. Linear-time computation of DAWGs, symmetric indexing structures, and MAWs for integer alphabets;Theoretical Computer Science;2023-09