Abstract
ABSTRACTEfficientde novomotif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the ‘synthetic’ and ‘genomic’ approaches to generate background sequences forde novomotif discovery. The ‘synthetic’ approach shuffled nucleotides in peaks, while in the ‘genomic’ approach randomly selected sequences from the reference genome or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mammalian and Arabidopsis, and performedde novomotif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic one was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) which implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.
Publisher
Cold Spring Harbor Laboratory