Seqrutinator: Non-Functional Homologue Sequence Scrutiny for the Generation of large Datatsets for Protein Superfamily Analysis-Reference-Cited by-同舟云学术

Seqrutinator: Non-Functional Homologue Sequence Scrutiny for the Generation of large Datatsets for Protein Superfamily Analysis

Published:2022-03-25 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Amalfitano Agustín,Stocchi Nicolás,Atencio Hugo Marcelo,Villarreal Fernando^ORCID,ten Have Arjen^ORCID

Abstract

AbstractBackgroundIn recent years protein bioinformatics has resulted in many good algorithms for multiple sequence alignment (MSA) and phylogeny. Little attention has been paid to sequence selection whereas notably recently published complete proteomes often have many sequences that are partial or derive from pseudogenes. Not only do these sequences add noise to the MSA, phylogeny and other downstream computational analyses, they also instigate many errors in the processing of the MSAs and downstream analyses, including the phylogeny.ObjectiveThis work aims to provide and test an objective, automated but flexible pipeline for the scrutiny of sequence sets from large, complex, eukaryotic protein superfamilies. The pipeline should classify sequences with high precision and recall as either functional or non-functional. The pipeline should classify no or only a few SwissProt sequences as non-functional (high precision) and sequences from other related superfamilies as non-functional (high recall) and result in a demonstrably much improved MSA (high performance).ResultsSeqrutinator is a pipeline that consists of five modules written in Python3 that identify and remove sequences that are likely Non-Functional Homologues (NFH). Here we tested the pipeline using three complex plant superfamilies (BAHD, CYP and UGT) that act in specialized metabolism, using the complete proteomes of 16 plant species as input and SwissProt as a control. Only 1.94% of SwissProt sequences with wetlab evidence were identified as NFH and all sequences from other related superfamilies were removed. Most NFH sequences are partial but, interestingly, their removal results in highly improved MSAs. a few but significant sequences that instigate large gaps were found. The five modules show similar behaviour when applied to the 16 sequence sets of the three analysed superfamilies. Pipelines with different module orders result in similar classifications and, moreover, show that different modules often detect the same sequences.Conclusion and perspectiveSeqrutinator forms a consistent pipeline for sequence scrutiny that does result in sequence sets that generate high fidelity MSAs. Recovery analyses show the method has high precision and recall.

Publisher

Cold Spring Harbor Laboratory

Reference70 articles.

1. Functional Classification and Characterization of the Fungal Glycoside Hydrolase 28 Protein Family

2. Evolution and functional diversification of the small heat shock protein/α-crystallin family in higher plants

3. Evolutionary and Functional Relationships in the Truncated Hemoglobin Family

4. Identification of the functions of 4-coumarate-CoA ligase/ acyl-CoA synthetase paralogs in potato

5. Extensive Expansion of A1 Family Aspartic Proteinases in Fungi Revealed by Evolutionary Analyses of 107 Complete Eukaryotic Proteomes

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comprehensive characterization of the complex BAHD acyltransferase family from 218 land plants species: phylogenomic analysis and identification of specificity determinant positions;2023-03-16