BACKGROUND
By integrating data corresponding to individuals between databases managed by different institutions, big data useful for epidemiological research can be obtained. It is a requirement that privacy information is protected while performing efficient data matching at a high level.
OBJECTIVE
Privacy-Preserving Distributed Data Integration (PDDI) is a technology that enables data matching between multiple databases without moving privacy information. It is necessary to consider errors in matching keys; therefore, we conducted a basic matching experiment using a model to assess accuracy of cancer screening.
METHODS
We created a dataset that mimics the data of cancer screening and registration in Japan and conducted a matching experiment using a PDDI system between geographically distant institutions. Errors similar to those found empirically in data sets recorded in Japanese were artificially introduced into the dataset. The matching-key error rate of the data common to both datasets was set sufficiently higher than expected in the actual database: 85.0% and 59.0% for the data simulating colorectal and breast cancer, respectively. Various combinations of name, gender, date of birth, and address were used for the matching key. To evaluate the matching accuracy, the matching sensitivity and specificity were calculated based on the number of cancer screening data points, and the effect of the matching accuracy on the sensitivity and specificity of the cancer screening was estimated based on the obtained values. To evaluate the performance, we measured CPU usage, memory usage, and network traffic.
RESULTS
For combinations with a specificity of 99% or higher and high sensitivity, the date of birth and first name were used in the data simulating colorectal cancer, and the matching sensitivity and specificity were 55.00% and 99.85%, respectively. In the data simulating breast cancer, the date of birth and family name were used, and the matching sensitivity and specificity were 88.71% and 99.98%, respectively. Assuming the sensitivity and specificity of cancer screening at 90%, the apparent values decreased to 74.90% and 89.93%, respectively. A trial calculation was performed using a combination with the same data set and a specificity of 100%. When the matching sensitivity was 82.26%, the apparent screening sensitivity maintained at 90% and the screening specificity dropped to 89.89% with a small error from the original value. For 214 (16,384) datapoints, the execution time was 82 minutes and 26 seconds without parallelization and 11 minutes and 38 seconds with parallelization; 19.33% of the calculation time was for the data-holding institutions. Memory usage was 3.4 GB for the PDDI server and 2.7 GB for data-holding institutions.
CONCLUSIONS
We demonstrated the rudimentary feasibility of introducing a PDDI system for cancer screening accuracy assessment. We plan to carry out matching experiments based on actual data and comparisons with existing methods.