TY - JOUR
T1 - Bilingual autoencoder-based efficient harmonization of multi-source private data for accurate predictive modeling
AU - Lee, Taek Ho
AU - Lee, Junghye
AU - Jun, Chi Hyuck
N1 - Publisher Copyright:
© 2021 The Authors
PY - 2021/8
Y1 - 2021/8
N2 - Sharing electronic health record data is essential for advanced analysis, but may put sensitive information at risk. Several studies have attempted to address this risk using contextual embedding, but with many hospitals involved, they are often inefficient and inflexible. Thus, we propose a bilingual autoencoder-based model to harmonize local embeddings in different spaces. Cross-hospital reconstruction of embeddings makes encoders map embeddings from hospitals to a shared space and align them spontaneously. We also suggest two-phase training to prevent distortion of embeddings during harmonization with hospitals that have biased information. In experiments, we used medical event sequences from the Medical Information Mart for Intensive Care-III dataset and simulated the situation of multiple hospitals. For evaluation, we measured the alignment of events from different hospitals and the prediction accuracy of a patient's diagnosis in the next admission in three scenarios in which local embeddings do not work. The proposed method efficiently harmonizes embeddings in different spaces, increases prediction accuracy, and gives flexibility to include new hospitals, so is superior to previous methods in most cases. It will be useful in predictive tasks to utilize distributed data while preserving private information.
AB - Sharing electronic health record data is essential for advanced analysis, but may put sensitive information at risk. Several studies have attempted to address this risk using contextual embedding, but with many hospitals involved, they are often inefficient and inflexible. Thus, we propose a bilingual autoencoder-based model to harmonize local embeddings in different spaces. Cross-hospital reconstruction of embeddings makes encoders map embeddings from hospitals to a shared space and align them spontaneously. We also suggest two-phase training to prevent distortion of embeddings during harmonization with hospitals that have biased information. In experiments, we used medical event sequences from the Medical Information Mart for Intensive Care-III dataset and simulated the situation of multiple hospitals. For evaluation, we measured the alignment of events from different hospitals and the prediction accuracy of a patient's diagnosis in the next admission in three scenarios in which local embeddings do not work. The proposed method efficiently harmonizes embeddings in different spaces, increases prediction accuracy, and gives flexibility to include new hospitals, so is superior to previous methods in most cases. It will be useful in predictive tasks to utilize distributed data while preserving private information.
KW - Autoencoder
KW - Contextual embedding
KW - Distributed EHR
KW - Predictive tasks
KW - Space alignment
UR - http://www.scopus.com/inward/record.url?scp=85104930760&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2021.03.064
DO - 10.1016/j.ins.2021.03.064
M3 - Article
AN - SCOPUS:85104930760
SN - 0020-0255
VL - 568
SP - 403
EP - 426
JO - Information Sciences
JF - Information Sciences
ER -