Can We Standardize Name Reconciliaton via OpenRefine?

Author:

Mozzherin DmitryORCID,Paul DeborahORCID,Whitmire AmandaORCID

Abstract

Scientific names in biodiversity represent one of the oldest identifiers used in science. As a result, a common repetitive task is being able to reconcile a list of scientific names against curated data sources. Reconciliation allows one to determine if names in a list are spelled correctly, whether they are currently accepted, and their nomenclatural status. There are several online and local resources that provide reconciliation services. We share here the potential in interoperability across reconciliation tools. Global Names Verifier (GNverifier), Catalogue of Life, Global Biodiversity Information Facility (GBIF), Taxonomic Name Resolution Service (TNRS), LifeWatch, National Center for Biotechnology Information (NCBI), World Flora Online, Global Biotic Interactions (GloBI), Nomer, Wikidata, and others provide their own tools for name reconciliation. All these tools have their scope, design decisions, input, and output formats. It is often useful to do reconciliation using several such services, because they often include complementary data. However, with all the idiosyncrasies of services and lack of standardization, it is not an easy task (Islam et al. 2024). It would be great for researchers if all existing and future tools could be standardized. Then moving from one resource to another would be as easy as changing the URL. Implementing elements of Findable, Accessible, Interoperable, and Reusable (FAIR) data management principles would help to create such standards. However, standardizing all existing and future resources to a common interface would be difficult. Some of them have no monetary or programmatic means to modify their code, while others have more urgent priorities. Some resources support a specific research path where adhering to a rigid standard might hinder their innovation. In this paper we suggest interoperability between reconciliation tools by implementing the OpenRefine Reconciliation Service. OpenRefine is a popular and powerful reconciliation and data cleaning application. It is used by many researchers for data transformation and normalization. Any service that implements the OpenRefine Service can be incorporated into data-management workflows just by providing the service's OpenRefine-compatible URL. Such compatible services can easily be discovered by providing their metadata in the OpenRefine Services Registry. In this paper we discuss our implementation of the OpenRefine Service with the Global Names Verifier (GNverifier) reconciliation tool. GNverifier is developed at the Species File Group as a part of the Global Names Architecture initiative. It offers a powerful, configurable, fast way to reconcile scientific names. GNverifier software aggregates data from more than 100 source datasets. Queries return currently accepted names when provided in a dataset. It allows finding matches for names that historically had several suffixes and can do fuzzy and partial matches. It sorts data by many factors to reliably provide the best available results. With a strong focus on software optimization and a sophisticated matching algorithm, it can process 2000 names a second, making it one of the fastest services available. OpenRefine can use GNverifier directly because it is compatible with the OpenRefine protocol. As shown in Fig. 1, switching between GNverifier and Wikidata reconciliation of scientific names requires only change of a service URL. Implementation of the OpenRefine protocol might solve many standardization problems. Some resources already have it implemented (e.g., Wikidata, GNverifier Whitmire and Mozzherin 2023, WFO Plant List, IPNI). Many people already use OpenRefine for their other reconciliation needs. For them, the incorporation of name reconciliation would be especially beneficial because it will fit into their existing data-management workflow Fig. 2. Basic reconciliation by itself is standard by design and a big step forward. Beyond the basic reconciliation (as seen in Fig. 2) there are more data that researchers are interested in. The Service Protocol allows one to add optional "extended" features. For example, for scientific names, we provide "currently accepted" names, data sources where a name was found, taxonomic classification, etc. Fig. 3. To make these fields standardized, we would need a recommendation document that describes additional fields and their format. We need interested parties to participate in its creation and agree on its usage. The same would apply to optional input filters, for example for restricting reconciliation to certain data sources or higher taxonomic entities. We think OpenRefine would be a significant step forward for standardization between name-reconciliation tools.

Publisher

Pensoft Publishers

Reference2 articles.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3