Nguyen, T. T. et al. Deep learning for deepfakes creation and detection: a survey. Comp. Vis. Image Underst. 223, 103525 (2022).

Article 

Google Scholar
 

Fallis, D. The epistemic threat of deepfakes. Philos. Technol. 34, 623–643 (2021).

Article 
PubMed 

Google Scholar
 

Raghunathan, T. E. Synthetic data. Annu. Rev. Stat. Appl. 8, 129–140 (2021).

Article 

Google Scholar
 

Figueira, A. & Vaz, B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10, 2733 (2022).

Article 

Google Scholar
 

de MeloA, C. M. et al. Next-generation deep learning based on simulators and synthetic data. Trends Cognit. Sci. 26, 174–187 (2022).

Article 

Google Scholar
 

Laubenbacher, R., Mehrad, B., Shmulevich, I. & Trayanova, N. Digital twins in medicine. Nat. Comput. Sci. 4, 184–191 (2024).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley & Sons, 2004).

Goodfellow, I. J. et al. Generative adversarial network. Preprint at https://arxiv.org/abs/1406.2661 (2014). This study introduces the concept of GANs, upon which all further iterations of GAN-based generative models are based.

Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2022). This study introduces VAEs, a key technology in synthetic data generation.

Ho, J., Jain, A. & Abbeel, P. in Advances in Neural Information Processing Systems Vol. 33 6840–6851 (Curran Associates, 2020). This study introduced denoising diffusion probabilistic models, a generative model based on iterative noise removal that laid the foundation for the widespread adoption of diffusion models for imaging.

Nichol, A. Q. & Dhariwal, P. Improved denoising diffusion probabilistic models. In Proc. 38th International Conference on Machine Learning 8162–8171 (PMLR, 2021).

Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022).


Google Scholar
 

Khader, F. et al. Denoising diffusion probabilistic models for 3D medical image generation. Sci. Rep. 13, 7303 (2023).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Pozzi, M. et al. Generating and evaluating synthetic data in digital pathology through diffusion models. Sci. Rep. 14, 28435 (2024).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Vaswani, A. et al. Attention is all you need. Preprint at https://arxiv.org/abs/1706.03762v7 (2017). This study introduced transformer architecture, a self-attention-based deep-learning framework that has become the foundation of state-of-the-art large language models.

Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 1–8 (2023).

Article 

Google Scholar
 

Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).

Article 
CAS 
PubMed 

Google Scholar
 

Kather, J. N., Ghaffari Laleh, N., Foersch, S. & Truhn, D. Medical domain knowledge in domain-agnostic generative AI. npj Digit. Med. 5, 90 (2022).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Smolyak, D., Bjarnadóttir, M. V., Crowley, K. & Agarwal, R. Large language models and synthetic health data: progress and prospects. JAMIA Open 7, ooae114 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Vallevik, V. B. et al. Can I trust my fake data — a comprehensive quality assessment framework for synthetic tabular data in healthcare. Int. J. Med. Inform. 185, 105413 (2024). This study provides a comprehensive evaluation framework for quality assessment of synthetic tabular healthcare data.

Article 
PubMed 

Google Scholar
 

Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 1–13 (2020).

Article 

Google Scholar
 

Alaa, A. M., van Breugel, B., Saveliev, E. & van der Schaar, M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In Proc. 39th International Conference on Machine Learning 290–306 (PMLR, 2022).

Hittmeir, M., Ekelhart, A. & Mayer, R. On the utility of synthetic data: an empirical evaluation on machine learning tasks. In Proc. 14th International Conference on Availability, Reliability and Security 1–6 (Association for Computing Machinery, 2019); https://doi.org/10.1145/3339252.3339281.

El Emam, K. Seven ways to evaluate the utility of synthetic data. IEEE Security Priv. 18, 56–59 (2020). This study provides a framework for evaluating the usability of synthetic data using study replication, structural similarity, domain expertise and metrics.

Article 

Google Scholar
 

Yale, A. et al. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416, 244–255 (2020).

Article 

Google Scholar
 

Moore, J. & Guichot, Y. D. How to harness the power of health data to improve patient outcomes. World Economic Forum https://www.weforum.org/stories/2024/01/how-to-harness-health-data-to-improve-patient-outcomes-wef24/ (5 January 2024).

US Department of Health and Human Services. Health information privacy. https://www.hhs.gov/hipaa/index.html (2021).

European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (text with EEA relevance). EUR-Lex https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (2016).

Emam, K. E., Jonker, E., Arbuckle, L. & Malin, B. A systematic review of re-identification attacks on health data. PLoS ONE 6, e28071 (2011). This review systematically evaluates studies on re-identification attacks for healthcare, which succeed in 34% of cases, highlighting an urgent need for privacy preservation techniques.

Article 
PubMed 
PubMed Central 

Google Scholar
 

Ferguson, A. R., Nielson, J. L., Cragin, M. H., Bandrowski, A. E. & Martone, M. E. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat. Neurosci. 17, 1442–1447 (2014).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Li, L., Fan, Y., Tse, M. & Lin, K.-Y. A review of applications in federated learning. Computers Ind. Eng. 149, 106854 (2020).

Article 

Google Scholar
 

Warnat-Herresthal, S. et al. Swarm learning for decentralized and confidential clinical machine learning. Nature 594, 265–270 (2021).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Lyu, L. et al. Privacy and robustness in federated learning: attacks and defenses. IEEE Trans. Neural Netw. Learn. Syst. 35, 8726–8746 (2024).

Article 
PubMed 

Google Scholar
 

Graves, J. L., Kearney, M., Barabino, G. & Malcom, S. Inequality in science and the case for a new agenda. Proc. Natl Acad. Sci. USA 119, e2117831119 (2022).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Ejermo, O. & Sofer, Y. When colleges graduate: micro-level effects on publications and scientific organization. Res. Policy 53, 105007 (2024).

Article 

Google Scholar
 

Eckardt, J.-N. et al. Synthetic bone marrow images augment real samples in developing acute myeloid leukemia microscopy classification models. npj Digit. Med. 8, 173 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Levine, A. B. et al. Synthesis of diagnostic quality cancer pathology images by generative adversarial networks. J. Pathol. 252, 178–188 (2020).

Article 
CAS 
PubMed 

Google Scholar
 

Deshpande, S., Dawood, M., Minhas, F. & Rajpoot, N. SynCLay: interactive synthesis of histology images from bespoke cellular layouts. Med. Image Anal. 91, 102995 (2024).

Article 
PubMed 

Google Scholar
 

Müller-Franzes, G. et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 13, 12098 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Osorio, P. et al. Latent diffusion models with image-derived annotations for enhanced AI-assisted cancer diagnosis in histopathology. Diagnostics 14, 1442 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Niehues, J. M. et al. Using histopathology latent diffusion models as privacy-preserving dataset augmenters improves downstream classification performance. Computers Biol. Med. 175, 108410 (2024).

Article 

Google Scholar
 

Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Krause, J. et al. Deep learning detects genetic alterations in cancer histology generated by adversarial networks. J. Pathol. 254, 70–79 (2021).

PubMed 

Google Scholar
 

Dolezal, J. M. et al. Deep learning generates synthetic cancer histology for explainability and education. npj Precis. Oncol. 7, 49 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Howard, F. M. et al. Generative adversarial networks accurately reconstruct pan-cancer histology from pathologic, genomic, and radiographic latent features. Sci. Adv. 10, eadq0856 (2024).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Carrillo-Perez, F. et al. Synthetic whole-slide image tile generation with gene expression profile-infused deep generative models. Cell Rep. Methods 3, 100534 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Carrillo-Perez, F. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng. 9, 320–332 (2025).

Article 
CAS 
PubMed 

Google Scholar
 

Bai, B. et al. Deep learning-enabled virtual histological staining of biological samples. Light Sci. Appl. 12, 57 (2023).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Pati, P. et al. Accelerating histopathology workflows with generative AI-based virtually multiplexed tumour profiling. Nat. Mach. Intell. 6, 1077–1093 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Koetzier, L. R. et al. Generating synthetic data for medical imaging. Radiology 312, e232471 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Sizikova, E. et al. Synthetic data in radiological imaging: current state and future outlook. BJR Artif. Intell. 1, ubae007 (2024).


Google Scholar
 

Jung, H. K., Kim, K., Park, J. E. & Kim, N. Image-based generative artificial intelligence in radiology: comprehensive updates. Korean J. Radiol. 25, 959–981 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

D’Amico, S. et al. Synthetic data generation by artificial intelligence to accelerate research and precision medicine in hematology. JCO Clin. Cancer Inf. 7, e2300021 (2023).


Google Scholar
 

Bernard, E. et al. Molecular international prognostic scoring system for myelodysplastic syndromes. NEJM Evid. 1, EVIDoa2200008 (2022).

Article 
PubMed 

Google Scholar
 

Kang, H. Y. J. et al. Synthetic tabular data based on generative adversarial networks in health care: generation and validation using the divide-and-conquer strategy. JMIR Med. Inform. 11, e47859 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Ganguli, R., Lad, R., Lin, A. & Yu, X. Novel generative recurrent neural network framework to produce accurate, applicable, and deidentified synthetic medical data for patients with metastatic cancer. JCO Clin. Cancer Inf. 7, e2200125 (2023).


Google Scholar
 

Díaz-Navarro, A., Zhang, X., Jiao, W., Wang, B. & Stein, L. In silico generation of synthetic cancer genomes using generative AI. Cell Genom. 5, 100969 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Kim, J. & Seok, J. ctGAN: combined transformation of gene expression and survival data with generative adversarial network. Brief. Bioinform. 25, bbae325 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Norcliffe, A., Cebere, B., Imrie, F., Lio, P. & van der Schaar, M. SurvivalGAN: generating time-to-event data for survival analysis. Preprint at https://arxiv.org/abs/2302.12749 (2023).

Hogenboom, J. et al. Actionability of synthetic data in a heterogeneous and rare health care demographic: adolescents and young adults with cancer. JCO Clin. Cancer Inf. 8, e2400056 (2024).


Google Scholar
 

Vlooswijk, C. et al. Recruiting adolescent and young adult cancer survivors for patient-reported outcome research: experiences and sample characteristics of the SURVAYA study. Curr. Oncol. 29, 5407–5425 (2022).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Juwara, L., El-Hussuna, A. & El Emam, K. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns 5, 100946 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Walonoski, J. et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25, 230–238 (2018).

Article 
PubMed 
PubMed Central 

Google Scholar
 

National Disease Registration Service. The Simulacrum. NDRS https://digital.nhs.uk/ndrs/data/data-outputs/cancer-publications-and-tools/simulacrum (accessed 28 September 2025).

Netherlands Comprehensive Cancer Organisation. Synthetic dataset. IKNL https://iknl.nl/en/ncr/synthetic-dataset (accessed 28 September 2025).

Wouters, O. J., McKee, M. & Luyten, J. Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA 323, 844–853 (2020).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Mulcahy, A. et al. Use of clinical trial characteristics to estimate costs of new drug development. JAMA Netw. Open 8, e2453275 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Sertkaya, A., Beleche, T., Jessup, A. & Sommers, B. D. Costs of drug development and research and development intensity in the US, 2000–2018. JAMA Netw. Open 7, e2415445 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? Acta Pharm. Sin. B 12, 3049–3062 (2022).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Mullard, A. The high, and redundant, cost of failure in cancer drug development. Nat. Rev. Drug Discov. 22, 688–688 (2023).

Article 
CAS 
PubMed 

Google Scholar
 

Briel, M. et al. A systematic review of discontinued trials suggested that most reasons for recruitment failure were preventable. J. Clin. Epidemiol. 80, 8–15 (2016).

Article 
PubMed 

Google Scholar
 

Briel, M. et al. Exploring reasons for recruitment failure in clinical trials: a qualitative study with clinical trial stakeholders in Switzerland, Germany, and Canada. Trials 22, 844 (2021).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Van Norman, G. A. Drugs and devices: comparison of European and U.S. approval processes. JACC Basic Transl. Sci. 1, 399–412 (2016).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Brown, D. G., Wobst, H. J., Kapoor, A., Kenna, L. A. & Southall, N. T. Clinical development times for innovative drugs. Nat. Rev. Drug Discov. 21, 793–794 (2022).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Stewart, D. J. et al. The importance of greater speed in drug development for advanced malignancies. Cancer Med. 7, 1824–1836 (2018).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Sengupta, S. et al. Emulating randomized controlled trials with hybrid control arms in oncology: a case study. Clin. Pharmacol. Ther. 113, 867–877 (2023).

Article 
CAS 
PubMed 

Google Scholar
 

Tan, W. K. et al. Augmenting control arms with real-world data for cancer trials: hybrid control arm methods and considerations. Contemp. Clin. Trials Commun. 30, 101000 (2022).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Ventz, S. et al. The design and evaluation of hybrid controlled trials that leverage external data and randomization. Nat. Commun. 13, 5783 (2022).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Ghadessi, M. et al. A roadmap to using historical controls in clinical trials — by Drug Information Association Adaptive Design Scientific Working Group (DIA-ADSWG). Orphanet J. Rare Dis. 15, 69 (2020).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Hall, K. T. et al. Historical controls in randomized clinical trials: opportunities and challenges. Clin. Pharmacol. Ther. 109, 343–351 (2021).

Article 
PubMed 

Google Scholar
 

Marion, J. D. & Althouse, A. D. The use of historical controls in clinical trials. JAMA 330, 1484–1485 (2023).

Article 
PubMed 

Google Scholar
 

Fountzilas, E., Tsimberidou, A. M., Vo, H. H. & Kurzrock, R. Clinical trial design in the era of precision medicine. Genome Med. 14, 101 (2022).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Duan, X.-P. et al. New clinical trial design in precision medicine: discovery, development and direction. Signal Transduct. Target. Ther. 9, 1–29 (2024).


Google Scholar
 

Lee, H. Y., Ha, H., Kang, J. H. & Park, H.-S. Precision oncology clinical trials: a systematic review of phase II clinical trials with biomarker-driven, adaptive design. JCO 42, e23005 (2024).

Article 

Google Scholar
 

Gatta, G. et al. Rare cancers are not so rare: the rare cancer burden in Europe. Eur. J. Cancer 47, 2493–2511 (2011).

Article 
PubMed 

Google Scholar
 

Matsuda, T. et al. Rare cancers are not rare in Asia as well: the rare cancer burden in East Asia. Cancer Epidemiol. 67, 101702 (2020).

Article 
PubMed 

Google Scholar
 

Eckardt, J.-N. et al. Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. npj Digit. Med. 7, 1–11 (2024).

Article 

Google Scholar
 

Piciocchi, A. et al. Unlocking the potential of synthetic patients for accelerating clinical trials: results of the first GIMEMA experience on acute myeloid leukemia patients. eJHaem 5, 353–359 (2024).

Article 
CAS 
PubMed 
PubMed Central 

Google Scholar
 

Nowok, B., Raab, G. M. & Dibben, C. synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016).

Article 

Google Scholar
 

Venditti, A. et al. GIMEMA AML1310 trial of risk-adapted, MRD-directed therapy for young adults with newly diagnosed acute myeloid leukemia. Blood 134, 935–945 (2019).

Article 
CAS 
PubMed 

Google Scholar
 

Azizi, Z., Zheng, C., Mosquera, L., Pilote, L. & Emam, K. E. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11, e043497 (2021).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Dahdaleh, F. S. et al. Obstruction predicts worse long-term outcomes in stage III colon cancer: a secondary analysis of the N0147 trial. Surgery 164, 1223–1229 (2018).

Article 
PubMed 

Google Scholar
 

Elvatun, S., Knoors, D., Brant, S., Jonasson, C. & Nygård, J. F. Synthetic data as external control arms in scarce single-arm clinical trials. PLoS Digital Health 4, e0000581 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D. & Xiao, X. PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 25:1–25:41 (2017).

Article 

Google Scholar
 

Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems Vol. 32 https://papers.nips.cc/paper_files/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf (Curran Associates, 2019).

Akiya, I., Ishihara, T. & Yamamoto, K. Comparison of synthetic data generation techniques for control group survival data in oncology clinical trials: simulation study. JMIR Med. Inf. 12, e55118 (2024).

Article 

Google Scholar
 

El-Kababji, S. et al. Augmenting insufficiently accruing oncology clinical trials using generative models: validation study. J. Med. Internet Res. 27, e66821 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Beigi, M., Shafquat, A., Mezey, J. & Aptekar, J. Simulants: synthetic clinical trial data via subject-level privacy-preserving synthesis. AMIA Annu. Symp. Proc. 2022, 231–240 (2023).

PubMed 
PubMed Central 

Google Scholar
 

Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 1–8 (2023).

Article 

Google Scholar
 

Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Draghi, B., Wang, Z., Myles, P. & Tucker, A. BayesBoost: identifying and handling bias using synthetic data generators. In Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications 49–62 (PMLR, 2021).

Shahul Hameed, M. A., Qureshi, A. M. & Kaushik, A. Bias mitigation via synthetic data generation: a review. Electronics 13, 3909 (2024).

Article 

Google Scholar
 

Baumann, J., Castelnovo, A., Crupi, R., Inverardi, N. & Regoli, D. Bias on demand: a modelling framework that generates synthetic data with bias. In Proc. 2023 ACM Conference on Fairness, Accountability, and Transparency 1002–1013 (Association for Computing Machinery, 2023); https://doi.org/10.1145/3593013.3594058.

Ge, L., Li, H., Wang, X. & Wang, Z. A review of secure federated learning: privacy leakage threats, protection technologies, challenges and future directions. Neurocomputing 561, 126897 (2023).

Article 

Google Scholar
 

Hu, H. et al. Membership inference attacks on machine learning: a survey. ACM Comput. Surv. 54, 235:1–235:37 (2022).

Article 

Google Scholar
 

Yang, W. et al. Deep learning model inversion attacks and defenses: a comprehensive survey. Artif. Intell. Rev. 58, 242 (2025).

Article 

Google Scholar
 

Farah, E., Kenney, M., Warkentin, M. T., Cheung, W. Y. & Brenner, D. R. Examining external control arms in oncology: a scoping review of applications to date. Cancer Med. 13, e7447 (2024).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Serrano, C. et al. Rethinking placebos: embracing synthetic control arms in clinical trials for rare tumors. Nat. Med. 29, 2689–2692 (2023).

Article 
CAS 
PubMed 

Google Scholar
 

Davies, J. et al. Comparative effectiveness from a single-arm trial and real-world data: alectinib versus ceritinib. J. Comp. Effective. Res. 7, 855–865 (2018).

Article 

Google Scholar
 

Jaksa, A. et al. A comparison of 7 oncology external control arm case studies: critiques from regulatory and health technology assessment agencies. Value Health 25, 1967–1976 (2022).

Article 
PubMed 

Google Scholar
 

Arondekar, B. et al. Real-world evidence in support of oncology product registration: a systematic review of new drug application and biologics license application approvals from 2015–2020. Clin. Cancer Res. 28, 27–35 (2022).

Article 
PubMed 

Google Scholar
 

Food and Drug Administration. Good Machine Learning Practice for Medical Device Development: Guiding Principles. FDA https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles (2025). The FDA recently introduced Good Machine Learning Practices to guide the development, evaluation and implementation of medical software for clinical use based on machine learning.

Hernandez, M. et al. Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees. Front. Digit. Health 7, 1576290 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Shahbazian, R. & Greco, S. Generative adversarial networks assist missing data imputation: a comprehensive survey and evaluation. IEEE Access 11, 88908–88928 (2023).

Article 

Google Scholar
 

Jarrett, D., Cebere, B. C., Liu, T., Curth, A. & van der Schaar, M. HyperImpute: generalized iterative imputation with automatic model selection. In Proc. 39th International Conference on Machine Learning 9916–9937 (PMLR, 2022).

Vero, M., Balunovic, M. & Vechev, M. CuTS: customizable tabular synthetic data generation. In Proc. 41st International Conference on Machine Learning 49408–49433 (PMLR, 2024).

Breugel, B. V., Qian, Z. & Schaar, M. V. D. Synthetic data, real errors: how (not) to publish and use synthetic data. In Proc. 40th International Conference on Machine Learning 34793–34808 (PMLR, 2023).

Probst, P., Boulesteix, A.-L. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).


Google Scholar
 

Wang, H., Sudalairaj, S., Henning, J., Greenewald, K. & Srivastava, A. Post-processing private synthetic data for improving utility on selected measures. Adv. Neural Inf. Process. Syst. 36, 64139–64154 (2023).


Google Scholar
 

Pilgram, L. et al. A consensus privacy metrics framework for synthetic data. Patterns https://doi.org/10.1016/j.patter.2025.101320 (2025).

Article 
PubMed 
PubMed Central 

Google Scholar
 

Dwork, C. in Automata, Languages and Programming (eds Bugliesi, M. et al.) 1–12 (Springer, 2006); https://doi.org/10.1007/11787006_1. This study introduces the concept of differential privacy, demonstrating that adding noise to synthetic health data may mitigate re-identification attempts.

Boulemtafes, A., Derhab, A. & Challal, Y. A review of privacy-preserving techniques for deep learning. Neurocomputing 384, 21–45 (2020).

Article 

Google Scholar
 

Gibney, E. Could machine learning fuel a reproducibility crisis in science? Nature 608, 250–251 (2022).

Article 
CAS 
PubMed 

Google Scholar