EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records
Abstrak
Abstract Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data. We propose a generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: heterogeneity, sparsity, coexistence of numerical and categorical features with distinct characteristics, and time-varying features with highly-varying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity of EHR-Safe is almost-identical with real data (<3% accuracy difference for the models trained on them) while yielding almost-ideal performance in practical privacy metrics.
Topik & Kata Kunci
Penulis (14)
Jinsung Yoon
Michel Mizrahi
Nahid Farhady Ghalaty
Thomas Jarvinen
Ashwin S. Ravi
Peter Brune
Fanyu Kong
Dave Anderson
George Lee
Arie Meir
Farhana Bandukwala
Elli Kanal
Sercan Ö. Arık
Tomas Pfister
Akses Cepat
- Tahun Terbit
- 2023
- Sumber Database
- DOAJ
- DOI
- 10.1038/s41746-023-00888-7
- Akses
- Open Access ✓