128
Views
0
CrossRef citations to date
0
Altmetric
Featured Articles

An unsupervised embedding harmonization system for privacy-preserving data mining in healthcare

, , &
 

Abstract

Sharing data across hospitals for disease modeling is challenging due to concerns over patient privacy and the lack of an efficient privacy-preserving data mining framework. Contextual embedding models, which encode medical events into vector representations while preserving the contextual dependencies between events, have shown promise in privacy-preserving data mining without requiring original data disclosure. However, the medical event representations learned from multiple data sources lie in different embedding spaces and cannot be directly integrated. Existing embedding harmonization algorithms require a list of common medical events between different data sources and use them as corresponding pairs for transformation, known as the supervised harmonization method. However, common medical events can be difficult to collect in clinical practice. To promote data mining across hospitals, we developed a novel unsupervised embedding harmonization system that introduces an unsupervised harmonization algorithm to align contextual embeddings without the need for corresponding pairs. The proposed framework also considered different contextual embedding techniques, including Word2Vec and Med2Vec, to explore the robustness of the proposed unsupervised harmonization algorithm. The proposed framework was evaluated using medical events extracted from the Medical Information Mart for Intensive Care III database. By integrating the embeddings from multiple sources, the proposed framework can achieve better disease prediction accuracy and medical event clustering compared to models built on a single data source. The proposed unsupervised harmonization method, which achieves similar performance to the supervised harmonization model under different contextual embedding techniques, holds great promise for predictive modeling and event clustering in healthcare.

Disclosure statement

No potential conflict of interest was reported by the authors.

Consent and approval statement

There are no human subjects involved in this study and informed consent is not applicable.

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Additional information

Funding

This paper is partially supported by the NIH grant R01MH121394.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.