BACKGROUND
Chatbots have recently emerged as an alternative approach for delivering cancer risk assessment and genetic counseling. Understanding the metrics used to describe the user-chatbot experience highlights the strengths and weaknesses of AI-assisted healthcare applications, ensuring safe and reliable medical care. While research supports chatbots in cancer genetic risk assessment and counseling, the evaluation measures remain inconsistent and unsystematic.
OBJECTIVE
This systematic review analyzes the metrics used to evaluate chatbot platforms providing cancer genetic risk assessment and pre-test and post-test genetic education. We examine these measures to identify potential limitations and inform a more systematic evaluative approach.
METHODS
A comprehensive search was conducted using three databases: PubMed, Web of Science, and Engineering Village. Articles were screened and analyzed using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework. Study and chatbot characteristics were documented, along with variables affecting metric use. Metrics evaluating the user-chatbot experience were extracted, categorized into domains, and organized within the Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) framework to identify assessment gaps and insights regarding application and effectiveness.
RESULTS
This database search retrieved 684 citations, with 11 articles meeting the inclusion criteria. The studies varied in methodologies, research settings, chatbot functionalities, and participants' characteristics. A total of 104 measures were extracted and categorized into 16 groups, with each study utilizing 2 to 22 metrics (median of 8). The measurement groups were organized into five domains: user experience, knowledge acquisition, outcomes and behaviors, emotional response, and technical performance, with user experience measures being the most common. Notably, despite the educational purpose of AI-assisted genetic counseling, the knowledge acquisition domain ranked third in metric usage. The RE-AIM framework illustrated that the study metrics addressed its five dimensions, highlighting user-centric measures omitted from chatbot evaluations, which included accuracy, transparency, data privacy, and educational continuity.
CONCLUSIONS
The limited studies on automated cancer genetic risk assessment and education showed significant variability in the metrics used. A unified evaluation process is essential for accurately assessing chatbot effectiveness. The measurements of knowledge that users gain hold important value, yet they are currently only moderately significant. Expanding educational metrics will strengthen the informed consent process and empower patients in their healthcare decisions. Additionally, recognizing confounding variables and utilizing frameworks such as RE-AIM can help ensure that appropriate measures are properly implemented and not overlooked. These strategies will ultimately promote the safe and effective use of novel genetic services.
CLINICALTRIAL
N/A