Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls

Author:

Arnaud VincentORCID,Pellegrino FrançoisORCID,Keenan Sumir,St-Gelais Xavier,Mathevon Nicolas,Levréro Florence,Coupé ChristopheORCID

Abstract

Despite the accumulation of data and studies, deciphering animal vocal communication remains challenging. In most cases, researchers must deal with the sparse recordings composing Small, Unbalanced, Noisy, but Genuine (SUNG) datasets. SUNG datasets are characterized by a limited number of recordings, most often noisy, and unbalanced in number between the individuals or categories of vocalizations. SUNG datasets therefore offer a valuable but inevitably distorted vision of communication systems. Adopting the best practices in their analysis is essential to effectively extract the available information and draw reliable conclusions. Here we show that the most recent advances in machine learning applied to a SUNG dataset succeed in unraveling the complex vocal repertoire of the bonobo, and we propose a workflow that can be effective with other animal species. We implement acoustic parameterization in three feature spaces and run a Supervised Uniform Manifold Approximation and Projection (S-UMAP) to evaluate how call types and individual signatures cluster in the bonobo acoustic space. We then implement three classification algorithms (Support Vector Machine, xgboost, neural networks) and their combination to explore the structure and variability of bonobo calls, as well as the robustness of the individual signature they encode. We underscore how classification performance is affected by the feature set and identify the most informative features. In addition, we highlight the need to address data leakage in the evaluation of classification performance to avoid misleading interpretations. Our results lead to identifying several practical approaches that are generalizable to any other animal communication system. To improve the reliability and replicability of vocal communication studies with SUNG datasets, we thus recommend: i) comparing several acoustic parameterizations; ii) visualizing the dataset with supervised UMAP to examine the species acoustic space; iii) adopting Support Vector Machines as the baseline classification approach; iv) explicitly evaluating data leakage and possibly implementing a mitigation strategy.

Funder

Ministère de l’Enseignement Supérieur et de la Recherche

Ecole Doctorale SIS of the University of Saint-Etienne

Université du Québec à Chicoutimi

Université Jean Monnet Saint-Etienne

LABEX ASLAN

Agence Nationale de la Recherche

Institut Universitaire de France

Social Sciences and Humanities Research Council of Canada

Publisher

Public Library of Science (PLoS)

Subject

Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modeling and Simulation,Ecology, Evolution, Behavior and Systematics

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3