Privacy-Preserving Generative AI for Secure Healthcare Data Sharing in Kenya
Sam Odongo · Nairobi, Kenya
DP-GAN Framework
Wasserstein GAN with differential privacy generating synthetic EHRs under Kenya's Data Protection Act 2019
Privacy-Utility Balance
Jensen-Shannon divergence metrics preserving clinical signal for malaria, HIV, and maternal health AI tasks
County-Ready Deployment
Lightweight Docker container integrating with KHIS/DHIS2 across Kenya's 47 devolved county health systems
Abstract
Kenya's healthcare system holds enormous untapped potential for AI-driven clinical research, but the data that would power it is trapped behind silos, fragmented county systems, and a relatively new privacy law that most existing research tools were never designed to satisfy. This paper proposes a framework that combines Wasserstein Generative Adversarial Networks with differential privacy to generate synthetic healthcare datasets that are statistically realistic, clinically useful, and provably private.
Designed to operate within Kenya's Data Protection Act 2019 and aligned with Africa CDC's emerging health data-sharing guidelines, the system enables secure exchange of electronic health records across county hospitals and national research institutions without exposing patient identities. New utility metrics, including Jensen-Shannon divergence and predictive task validation on malaria severity prediction and maternal mortality risk scoring, demonstrate that synthetic records can support the same clinical AI research as real patient data.
The framework is built for Kenya's infrastructure reality: lightweight enough to deploy on commodity hospital hardware, containerized for integration with the Kenya Health Information System (KHIS), and designed around the 47-county devolved structure. The output is an open-source tool that health researchers across East Africa can use without waiting months for the complex data-sharing agreements that currently block clinical AI development.
Research Background
Working with health data in Kenya means sitting inside a frustrating standoff. On one side, you have researchers and AI teams who need patient records to build models that could genuinely save lives. On the other, you have hospitals and county health offices holding records they cannot legally share, even internally across counties, because Kenya's Data Protection Act 2019 classifies health information as sensitive personal data with strict handling and transfer restrictions.
The result is paralysis. Kenya's KHIS platform collects aggregate statistics from across the country, but the granular patient-level data needed to train a malaria severity predictor or a maternal mortality risk model sits in paper files and disconnected EMR systems that cannot communicate with each other. The Social Health Authority (SHA), which replaced NHIF in 2024 as Kenya's national health insurance scheme, is building new digital infrastructure under the Universal Health Coverage agenda, but the data pipeline for AI research is still years away from usable.
Traditional anonymization techniques make the problem worse rather than better. K-anonymity degrades data utility and fails on small population clusters, which are common in Kenya's county-level datasets. A person's county, age bracket, and primary diagnosis can be enough for re-identification in a sub-county of 80,000 people. Standard anonymization gives a false sense of compliance while delivering neither genuine privacy nor usable data.
The core insight is straightforward: if you can generate synthetic patient records that are mathematically indistinguishable from real data in statistical and clinical properties, but provably disconnected from any real individual through formal differential privacy guarantees, the standoff breaks. Researchers get the training data they need. Patients keep their privacy. Hospitals stay compliant with the ODPC.
This project applies Wasserstein GANs combined with differential privacy to do exactly that, grounded in Kenya's regulatory environment and validated against the health challenges that matter most here: malaria, HIV, maternal mortality, and the rising burden of non-communicable diseases in urban and peri-urban areas.
Research Objectives
Build a Differential Privacy-Constrained GAN (DP-GAN)
Synthesize both structured records (lab values, diagnoses, SHA insurance codes) and temporal sequences (ICU monitoring data, outpatient visit histories) from Kenya's EHR landscape. Target data from KEMRI, Kenyatta National Hospital, and county-level KHIS exports from Universal Health Coverage pilot counties.
Design clinically-grounded utility metrics
Ensure synthetic data retains statistical fidelity measured through Jensen-Shannon divergence across age, gender, county, and diagnosis strata. Validate clinical relevance through predictive tasks including malaria severity classification, maternal mortality risk scoring, and HIV progression modelling on both real and synthetic records, targeting less than 5% AUC-ROC gap.
Validate compliance with Kenya's Data Protection Act 2019
Demonstrate that the synthetic data generation pipeline satisfies the ODPC's requirements for sensitive personal data processing, with formal privacy guarantees derived from differential privacy epsilon bounds rather than assumed from anonymization heuristics that have repeatedly proven insufficient.
Deploy for county-level health infrastructure
Package the framework as a lightweight Docker container integrable with KHIS (DHIS2-based), SHA's new digital claims systems, and KEMRI research environments. Designed to run on standard hospital servers, not cloud-only, given the connectivity and budget realities across Kenya's 47 county health systems.
Research Questions
How can Wasserstein GANs combined with differential privacy generate synthetic healthcare records that satisfy both Kenya's Data Protection Act 2019 and the clinical utility requirements of health AI research?
What privacy epsilon bounds allow synthetic EHR data to remain ODPC-compliant while preserving enough signal for malaria severity prediction and maternal mortality risk scoring tasks?
How does the utility-privacy tradeoff in DP-GAN perform when calibrated on Kenya-specific datasets (KEMRI records, SHA claims data) compared to Western benchmarks like MIMIC-III, given the smaller and sparser nature of Kenyan health data?
What are the practical barriers to deploying a synthetic data generation system across Kenya's 47 devolved county health systems, and how can containerized deployment reduce integration friction?
Methodology
Four phases over 24 months, each grounded in Kenya's institutional structure and infrastructure constraints. The sequencing matters: ethics clearance from three Kenyan bodies, not one, is required before data access, and that timeline is realistic only if started early.
Data Acquisition and Ethics (Months 1–6)
- +Datasets: Partner with KEMRI and Kenyatta National Hospital for de-identified EHRs covering malaria, HIV, maternal health, and outpatient records. Supplement with KHIS aggregate exports and SHA claims pilot data from UHC counties (Kisumu, Machakos, Nyeri, Isiolo).
- +Ethics: Obtain clearance from KEMRI's Scientific and Ethics Review Unit (SERU), Kenya's National Commission for Science, Technology and Innovation (NACOSTI), and the Office of the Data Protection Commissioner for synthetic data research involving sensitive health records.
- +Regulatory mapping: Document specific obligations under the Data Protection Act 2019, the Health Act 2017, and SHA data governance guidelines to define the exact compliance requirements the DP-GAN pipeline must satisfy at output.
Model Development (Months 7–12)
- +Architecture: CNN-LSTM hybrid generator to capture both structured tabular records (lab values, diagnoses, insurance claim codes) and temporal sequences (ICU sensor streams, outpatient visit histories across multiple care episodes).
- +Discriminator: Differentially private using PyTorch Opacus with epsilon range 0.5 to 2.0, calibrated against Kenya's ODPC consent and data minimization requirements rather than defaulting to HIPAA-derived benchmarks.
- +Privacy mechanism: Gaussian noise via DP-SGD with Wasserstein loss for training stability on smaller Kenyan datasets, which are significantly sparser than MIMIC-III and require tighter gradient clipping to avoid mode collapse.
- +Kenya-specific preprocessing: Handle ICD-10 coding patterns for common Kenyan diagnoses (P. falciparum malaria, HIV staging, obstetric complications), Swahili-English mixed clinical notes, and the missing data patterns characteristic of paper-to-digital migration in county facilities.
Quality Assurance (Months 13–18)
- +Statistical validation: Jensen-Shannon divergence to compare distribution similarity between synthetic and real records across age, gender, county of origin, and primary diagnosis strata. Report disaggregated by county to catch regional data quality gaps.
- +Clinical utility: Train XGBoost models on both synthetic and real data; compare AUC-ROC scores for malaria severity classification, maternal mortality risk scoring, and HIV progression prediction. Target less than 5% AUC gap, which corresponds to synthetic data being practically substitutable in downstream AI research.
- +Adversarial testing: Run linkage attacks and membership inference attacks calibrated for small-population Kenyan data, where re-identification risk is meaningfully higher than in large Western EHR datasets. Standard MIMIC-III-based attack benchmarks do not apply directly and will be re-parameterized for Kenya's demographic context.
Deployment and Dissemination (Months 19–24)
- +Lightweight deployment: Docker-containerized framework with a DHIS2/KHIS adapter for integration with Kenya's existing national health data infrastructure. Tested on commodity hospital hardware, not cloud instances, to validate real-world deployability across county facilities.
- +Policy contribution: Whitepaper for Kenya's Office of the Data Protection Commissioner and Ministry of Health on synthetic data as a legally compliant mechanism for health AI research, with a governance framework that Uganda, Tanzania, and other DHIS2-adopting East African countries can adapt directly.
- +Open-source release: Full framework published on GitHub under an open license with documentation in English and Swahili, targeting East African health researchers who currently have no practical access to privacy-preserving synthetic data tooling designed for their regulatory and infrastructure context.
Expected Outcomes
Technical
- +Open-source DP-GAN framework for East African healthcare data, targeting Nature Digital Medicine or JAMIA for peer-reviewed publication.
- +Utility-privacy tradeoff metrics calibrated specifically for smaller, sparser Kenyan datasets where standard MIMIC-III-based benchmarks do not transfer without recalibration.
- +DHIS2/KHIS integration adapter enabling synthetic data generation directly from Kenya's existing national health data infrastructure without bespoke ETL engineering by each partner institution.
Societal
- +Enable data collaboration between KEMRI, Kenyatta National Hospital, county health departments, and AI research teams without requiring patient re-consent or formal data transfer agreements that currently take months to execute.
- +Reduce the health research data access lag from months to hours using synthetic surrogates that satisfy ODPC compliance, unblocking clinical AI development that has stalled across Kenya's research institutions.
- +Give Kenya's Ministry of Health and SHA a practical synthetic data governance framework applicable to the new digital infrastructure being built under Universal Health Coverage, ahead of the real-data pipelines maturing.
Why Kenya First
Kenya is not the obvious choice for a healthcare AI privacy paper. The EHR datasets are smaller than MIMIC-III. Digital record-keeping is still partial across county facilities. The Data Protection Act 2019 is less than six years old and enforcement is still being established. On paper, it looks like a harder problem than running the same research in a high-income country with mature hospital data infrastructure.
That is exactly the point. Privacy-preserving healthcare AI research is almost entirely designed for high-income contexts with large hospital networks, stable EHR systems, and well-resourced compliance teams. The techniques that work in those environments often break or become impractical in a 47-county devolved system where some facilities still run paper registers and connectivity cuts out during rainy season. Retrofitting a framework designed for MIMIC-III onto Kenyan data does not work cleanly, and the gap between what the research literature assumes and what Kenya's health system actually looks like is significant.
Kenya's specific conditions also create privacy research problems that do not exist in Western datasets. Small county-level population clusters make re-identification risk materially higher, which means standard epsilon values need recalibration for this context. Swahili-English code-switching in clinical notes requires different text preprocessing before synthesis. Missing data patterns from paper-to-digital migration demand different imputation strategies. These are not edge cases to handle later. They are the core research problem.
A framework validated on Kenya's regulatory constraints, data characteristics, and infrastructure limitations is more directly useful to East African health researchers than one validated on MIMIC-III and theoretically generalized. Build for where you are. The learnings export from there.
Kenya also has the institutional base to make this work. KEMRI is one of Africa's strongest health research bodies. The KHIS platform on DHIS2 is nationally deployed and actively maintained. SHA is building new claims infrastructure with digital-first architecture. The Data Protection Act 2019 gives the ODPC authority to define compliant synthetic data practices. The ingredients exist. The question is whether the technical tooling exists that is designed for this context rather than imported from elsewhere and hoped to fit.
The wider beneficiaries extend beyond Kenya. Uganda, Tanzania, Rwanda, and Ethiopia all use DHIS2. Most face the same data access standoff. A DP-GAN framework with a DHIS2 adapter and governance documentation written for the African Union regulatory context is directly transferable across the region in a way that a framework designed for HIPAA compliance is not.
References
Kenya Data Protection Act, 2019. No. 24 of 2019. Office of the Data Protection Commissioner (ODPC), Nairobi.
Kenya Ministry of Health. (2023). Kenya Health Sector Strategic Plan 2023–2030. Government of Kenya.
Kenya National Bureau of Statistics. (2022). Kenya Demographic and Health Survey 2022. KNBS, Nairobi.
Africa CDC. (2023). Africa Health Data Sharing Framework. African Union, Africa Centres for Disease Control and Prevention.
Feretzakis, G., Papaspyridis, K., Gkoulalas-Divanis, A., & Verykios, V. S. (2024). Privacy-Preserving Techniques in Generative AI and Large Language Models: A Narrative Review. Information, 15(11), 697.
Mumtaz, M., Tayyab, M., Jhanjhi, N. Z., Muzammal, S. M., & Hameed, K. (2025). Privacy preserving data analysis with generative AI. In AI Techniques for Securing Medical and Business Practices (pp. 391–410). IGI Global.
Chen, Y., & Esmaeilzadeh, P. (2024). Generative AI in medical practice: in-depth exploration of privacy and security challenges. Journal of Medical Internet Research, 26, e53008.
Khalid, N., Qayyum, A., Bilal, M., Al-Fuqaha, A., & Qadir, J. (2023). Privacy-preserving artificial intelligence in healthcare: Techniques and applications. Computers in Biology and Medicine, 158, 106848.
Venugopal, R., Shafqat, N., Venugopal, I., Tillbury, B. M. J., Stafford, H. D., & Bourazeri, A. (2022). Privacy preserving generative adversarial networks to model electronic health records. Neural Networks, 153, 339–348.
Yoon, J. et al. (2020). GANs for Synthetic EHR Data. Journal of the American Medical Informatics Association (JAMIA).
KEMRI. (2023). Annual Report: Health Research for a Healthier Kenya. Kenya Medical Research Institute, Nairobi.