BACKGROUND
The field of genetics has made significant advancements, revealing a strong correlation between genetics and health. Consequently, the demand for genetic counseling services to address genetic issues has increased. Consequently, the shortage of professionals in the realm of genetic counseling has posed a significant challenge. The emergence of large language models (LLMs) in recent years offers a potential solution to this issue. However, the current status and issues of genetic counseling in Japanese LLMs require further investigation. Additionally, to develop a dialogue system to support genetic counseling in the future, domain adaptation methods of LLMs should be explored, and expert data should be collected to assess the quality of LLM responses.
OBJECTIVE
This study aims to evaluate the current capabilities and identify obstacles in developing a dialogue system based on LLM for genetic counseling. The primary focus is to assess the effectiveness of domain adaptation methods within the context of genetic counseling. Furthermore, we will establish a dataset in which experts can evaluate responses generated by LLMs adapted with various domain adaptation methods to gather expert feedback for the future development of genetic counseling LLMs.
METHODS
Our study utilized two main datasets: (1) a question-answering (QA) dataset for LLM adaptation and (2) A genetic counseling question dataset for evaluation. The QA dataset comprised 899 pairs covering topics in medicine and genetic counseling, whereas the evaluation dataset comprised 120 refined questions across six genetic counseling categories. Three domain adaptation methods— instruction tuning, retrieval-augmented generation (RAG), and prompt engineering—were applied to a lightweight Japanese LLM. The performance of the adapted LLM was evaluated using a dataset of 120 carefully selected questions on genetic counseling. Two certified genetic counselors and one ophthalmologist assessed the responses generated by the LLM based on four key metrics: (1) inappropriateness of information, (2) sufficiency of information, (3) severity of harm, and (4) alignment with medical consensus.
RESULTS
The evaluation conducted by certified genetic counselors and ophthalmologist revealed varied outcomes across different domain adaptation methods. RAG demonstrated promising results, particularly in enhancing key aspects of genetic counseling. Conversely, instruction tuning and prompt engineering yielded less favorable outcomes. This evaluation process led to the construction of a dataset of expert-evaluated responses generated by LLMs adapted using various combinations of these methods. Error analysis highlighted critical ethical concerns, such as the inappropriate promotion of prenatal testing, criticism of relatives, and inaccurate probability statements.
CONCLUSIONS
RAG has significantly improved performance in all evaluation criteria, with the potential for further enhancement through the expansion of RAG data. Our expert-evaluated dataset offers valuable insights into future developments. However, the ethical issues identified in LLM responses underscore the importance of continued refinement and careful ethical considerations prior to the implementation of these systems in healthcare settings.