SimKG-BERT: A Security Enhancement Approach for Healthcare Models Consisting of Fusing SimBERT and a Knowledge Graph
-
Published:2024-02-18
Issue:4
Volume:14
Page:1633
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Li Songpu12, Yu Xinran3, Chen Peng24ORCID
Affiliation:
1. College of Economics & Management, Three Gorges University, Yichang 443002, China 2. Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, Three Gorges University, Yichang 443002, China 3. The Second Clinical Medical College, Lanzhou University, Lanzhou 730000, China 4. College of Computer and Information Technology, Three Gorges University, Yichang 443002, China
Abstract
Model robustness is an important index in medical cybersecurity, and hard-negative samples in electronic medical records can provide more gradient information, which can effectively improve the robustness of a model. However, hard negatives pose difficulties in terms of their definition and acquisition. To solve these problems, a data augmentation approach consisting of fusing SimBERT and a knowledge graph for application to a hard-negative sample is proposed in this paper. Firstly, we selected 40 misdiagnosed cases of diabetic complications as the original data for data augmentation. Secondly, we divided the contents of the electronic medical records into two parts. One part consisted of the core disease phrases in the misdiagnosed case records, which a medical specialist selected. These denoted the critical diseases that the model diagnosed as negative samples. Based on these core symptom words, new symptom phrases were directly generated using the SimBERT model. On the other hand, the noncore phrases of misdiagnosed medical records were highly similar to the positive samples. We determined the cosine similarity between the embedding vector of the knowledge graph entities and a vector made up of the noncore phrases. Then, we used Top-K sampling to generate text. Finally, combining the generated text from the two parts and the disturbed numerical indexes resulted in 160 enhancement samples. Our experiment shows that the distances between the samples generated using the SimKG-BERT model’s samples were closer to those of the positive samples and the anchor points in the space vector were closer than those created using the other models. This finding is more in line with how hard negatives are defined. In addition, compared with the model without data augmentation, the F1 values in the three data sets of diabetic complications increased by 6.4%, 2.24%, and 5.54%, respectively. The SimKG-BERT model achieves data augmentation in the absence of misdiagnosed medical records, providing more gradient information to the model, which not only improves the robustness of the model but also meets the realistic needs of assisted-diagnosis safety.
Funder
National Key Research and Development Program of China
Reference26 articles.
1. Research on data driven electronic health service management;Guo;J. Manag. Sci.,2017 2. A Review on Data driven Healthcare Decision making Support;Xu;J. Ind. Eng. Manag.,2017 3. Li, X.Y., Sun, X.F., Meng, Y.X., Liang, J.J., Wu, F., and Li, J.W. (2020, January 5–10). Dice Loss for Data-imbalanced NLP Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Washington, DC, USA. 4. A decision-support framework for data anonymization with application to machine learning processes;Caruccio;Inf. Sci.,2022 5. Riva, G.M., Vasenev, A., and Zannone, N. (2020, January 25–28). SoK: Engineering privacy-aware high-tech systems. Proceedings of the 15th International Conference on Availability, Reliability and Security, Allgäu, Germany.
|
|