Categorizing drug approval populations and matching their clinical trials using natural language processing: a practical case study fine-tuning BERT (Preprint)-Reference-Cited by-同舟云学术

Categorizing drug approval populations and matching their clinical trials using natural language processing: a practical case study fine-tuning BERT (Preprint)

Published:2022-12-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Gendrin Aline,Souliotis Leonidas,Loudon-Griffiths James,Aggarwal Ravisha,Amoako Daniel,Desouza Gregory,Dimitrievska Sashka,Metcalfe Paul,Louvet Emilie,Sahni Harpreet

Abstract

BACKGROUND

New drug treatments are regularly approved and it is challenging to remain up-to-date in this rapidly changing environment. A fast and accurate understanding is important to allow a global understanding of the drug market; automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time.

OBJECTIVE

We apply NLP methods to classify disease populations within the free text of oncology drug approval descriptions from the BioMedTracker database, and extract the clinical trial entities that provide evidence for these approvals.

METHODS

We fine-tune a BERT model. This methodology has demonstrated state of the art results on a wide variety of NLP tasks. Therefore, we also expect it to be stable or improve over time as we increase the amount of input data. BERT’s performance is validated against a rule-based text mining approach.

RESULTS

By utilizing our fine-tuned BERT models, we achieve 61% and 56% 5-fold cross-validated accuracies for the line of therapy and stage of cancer classification tasks, respectively; with five classes each, this is a marked increase when compared to random classification. For the trial identification named entity recognition (NER) task, the 5-fold cross-validated F1 score is currently 87%. The training dataset is small (~400 entries) and both classification and NER task scores are expected to improve over time with the availability of additional data. For clinical validation of the model, the results were corrected by a subject matter expert before usage. The subject matter expert leveraged the results for further analysis as a helpful starting point in a crowded clinical environment such as oncology.

CONCLUSIONS

We developed a NLP algorithm that is currently assisting subject matter experts to extract stage of cancer, line of therapy and the relevant clinical trials that support these Health Authority approvals, from a free, unstructured text source. The increased structure these results bring can be further utilized in downstream applications, aiding searchability of relevant content against related drug project sources.

Publisher

JMIR Publications Inc.

Reference23 articles.

1. arXiv

2. Transfer Learning

3. BioBERT: a pre-trained biomedical language representation model for biomedical text mining

4. Transformers: State-of-the-Art Natural Language Processing

5. Datasets: A Community Library for Natural Language Processing