German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation-Reference-Cited by-同舟云学术

German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation

Published:2023-02-28 Issue: Volume:7 Page:e39077
ISSN:2561-326X
Container-title:JMIR Formative Research
language:en
Short-container-title:JMIR Form Res

Author:

Frei Johann^ORCID,Kramer Frank^ORCID

Abstract

Background Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data. Objective We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication. Methods The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements. Results The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency–averaged F1 score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available. Conclusions We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.

Publisher

JMIR Publications Inc.

Subject

Health Informatics,Medicine (miscellaneous)

Reference62 articles.

1. Clinical Text Data in Machine Learning: Systematic Review

2. Modern Clinical Text Mining: A Guide and Review

3. Digitalizing Health Services by Implementing a Personal Electronic Health Record in Germany: Qualitative Analysis of Fundamental Prerequisites From the Perspective of Selected Experts

4. CarliniNTramèrFWallaceEJagielskiMHerbert-VossALeeKRobertsABrownTSongDErlingssonUOpreaARaffelCExtracting training data from large language modelsUsenix Association20212023-01-27https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Visualization Method of Knowledge Graphs for the Computation and Comprehension of Ultrasound Reports;Biomimetics;2023-11-21

2. GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment;Journal of Biomedical Informatics;2023-11

3. Task-Specific Transformer-Based Language Models in Medicine: A Survey (Preprint);2023-06-07

4. Development and Application of Teaching Model for Medical Humanities Education using Artificial Intelligence and Digital Humans Technologies;2023 IEEE 6th Eurasian Conference on Educational Innovation (ECEI);2023-02-03