HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised Classification of Superfamily Protein Sequences with a reliable Cut-off Threshold

Author:

Pagnuco Inti Anabela,Revuelta María Victoria,Bondino Hernán Gabriel,Brun Marcel,ten Have ArjenORCID

Abstract

AbstractProtein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific.HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and threshold are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R. In three presented case studies of protein superfamilies, classification of large datasets with 100% P&R was achieved with over 95% coverage. Limits and caveats are presented and explained.HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. A package containing source code and full dataset will be deposited at Github and is available for reviewers at: https://www.dropbox.com/s/aacao6ggcak30bg/Repo.tar.gz?dl=0Author summaryThe enormous amount of genome sequences made available in the last decade provide new challenges for scientists. An important step in genome sequence processing is function assignation of the encoded protein sequences, typically based on the similarity principle: The more similar sequences are, the more likely they encode the same function. However, evolution generated many protein superfamilies that consist of various subfamilies with different functional characteristics, such as substrate specificity, optimal activity conditions or the catalyzed reaction. The classification of superfamily sequences to their respective subfamilies can be performed based on similarity but since the different subfamilies also remain similar, it requires a reliable similarity score cut-off.We present a tool that clusters training sequences and describes them in profiles that identify cluster members with higher similarity scores than non-cluster members, i.e. with 100% precision and recall. This defines a score cut-off threshold. Profiles and thresholds are then used to classify other sequences. Classified sequences are included in the profiles in order to improve sensitivity while maintaining specificity by imposing 100% precision and recall. Results on three case studies show that the tool can correctly classify complex superfamilies with over 95% coverage.HMMERCTTER is meant as a decision support system for the expert biologist rather than the computational biologist.

Publisher

Cold Spring Harbor Laboratory

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3