BACKGROUND
New drug treatments are regularly approved and it is challenging to remain up-to-date in this rapidly changing environment. A fast and accurate understanding is important to allow a global understanding of the drug market; automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time.
OBJECTIVE
We apply NLP methods to classify disease populations within the free text of oncology drug approval descriptions from the BioMedTracker database, and extract the clinical trial entities that provide evidence for these approvals.
METHODS
We fine-tune a BERT model. This methodology has demonstrated state of the art results on a wide variety of NLP tasks. Therefore, we also expect it to be stable or improve over time as we increase the amount of input data. BERT’s performance is validated against a rule-based text mining approach.
RESULTS
By utilizing our fine-tuned BERT models, we achieve 61% and 56% 5-fold cross-validated accuracies for the line of therapy and stage of cancer classification tasks, respectively; with five classes each, this is a marked increase when compared to random classification. For the trial identification named entity recognition (NER) task, the 5-fold cross-validated F1 score is currently 87%. The training dataset is small (~400 entries) and both classification and NER task scores are expected to improve over time with the availability of additional data. For clinical validation of the model, the results were corrected by a subject matter expert before usage. The subject matter expert leveraged the results for further analysis as a helpful starting point in a crowded clinical environment such as oncology.
CONCLUSIONS
We developed a NLP algorithm that is currently assisting subject matter experts to extract stage of cancer, line of therapy and the relevant clinical trials that support these Health Authority approvals, from a free, unstructured text source. The increased structure these results bring can be further utilized in downstream applications, aiding searchability of relevant content against related drug project sources.