Nguyen, T. T. et al. Deep learning for deepfakes creation and detection: a survey. Comp. Vis. Image Underst. 223, 103525 (2022).
Fallis, D. The epistemic threat of deepfakes. Philos. Technol. 34, 623–643 (2021).
Raghunathan, T. E. Synthetic data. Annu. Rev. Stat. Appl. 8, 129–140 (2021).
Figueira, A. & Vaz, B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10, 2733 (2022).
de MeloA, C. M. et al. Next-generation deep learning based on simulators and synthetic data. Trends Cognit. Sci. 26, 174–187 (2022).
Laubenbacher, R., Mehrad, B., Shmulevich, I. & Trayanova, N. Digital twins in medicine. Nat. Comput. Sci. 4, 184–191 (2024).
Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley & Sons, 2004).
Goodfellow, I. J. et al. Generative adversarial network. Preprint at https://arxiv.org/abs/1406.2661 (2014). This study introduces the concept of GANs, upon which all further iterations of GAN-based generative models are based.
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2022). This study introduces VAEs, a key technology in synthetic data generation.
Ho, J., Jain, A. & Abbeel, P. in Advances in Neural Information Processing Systems Vol. 33 6840–6851 (Curran Associates, 2020). This study introduced denoising diffusion probabilistic models, a generative model based on iterative noise removal that laid the foundation for the widespread adoption of diffusion models for imaging.
Nichol, A. Q. & Dhariwal, P. Improved denoising diffusion probabilistic models. In Proc. 38th International Conference on Machine Learning 8162–8171 (PMLR, 2021).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022).
Khader, F. et al. Denoising diffusion probabilistic models for 3D medical image generation. Sci. Rep. 13, 7303 (2023).
Pozzi, M. et al. Generating and evaluating synthetic data in digital pathology through diffusion models. Sci. Rep. 14, 28435 (2024).
Vaswani, A. et al. Attention is all you need. Preprint at https://arxiv.org/abs/1706.03762v7 (2017). This study introduced transformer architecture, a self-attention-based deep-learning framework that has become the foundation of state-of-the-art large language models.
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 1–8 (2023).
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).
Kather, J. N., Ghaffari Laleh, N., Foersch, S. & Truhn, D. Medical domain knowledge in domain-agnostic generative AI. npj Digit. Med. 5, 90 (2022).
Smolyak, D., Bjarnadóttir, M. V., Crowley, K. & Agarwal, R. Large language models and synthetic health data: progress and prospects. JAMIA Open 7, ooae114 (2024).
Vallevik, V. B. et al. Can I trust my fake data — a comprehensive quality assessment framework for synthetic tabular data in healthcare. Int. J. Med. Inform. 185, 105413 (2024). This study provides a comprehensive evaluation framework for quality assessment of synthetic tabular healthcare data.
Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 1–13 (2020).
Alaa, A. M., van Breugel, B., Saveliev, E. & van der Schaar, M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In Proc. 39th International Conference on Machine Learning 290–306 (PMLR, 2022).
Hittmeir, M., Ekelhart, A. & Mayer, R. On the utility of synthetic data: an empirical evaluation on machine learning tasks. In Proc. 14th International Conference on Availability, Reliability and Security 1–6 (Association for Computing Machinery, 2019); https://doi.org/10.1145/3339252.3339281.
El Emam, K. Seven ways to evaluate the utility of synthetic data. IEEE Security Priv. 18, 56–59 (2020). This study provides a framework for evaluating the usability of synthetic data using study replication, structural similarity, domain expertise and metrics.
Yale, A. et al. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416, 244–255 (2020).
Moore, J. & Guichot, Y. D. How to harness the power of health data to improve patient outcomes. World Economic Forum https://www.weforum.org/stories/2024/01/how-to-harness-health-data-to-improve-patient-outcomes-wef24/ (5 January 2024).
US Department of Health and Human Services. Health information privacy. https://www.hhs.gov/hipaa/index.html (2021).
European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (text with EEA relevance). EUR-Lex https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (2016).
Emam, K. E., Jonker, E., Arbuckle, L. & Malin, B. A systematic review of re-identification attacks on health data. PLoS ONE 6, e28071 (2011). This review systematically evaluates studies on re-identification attacks for healthcare, which succeed in 34% of cases, highlighting an urgent need for privacy preservation techniques.
Ferguson, A. R., Nielson, J. L., Cragin, M. H., Bandrowski, A. E. & Martone, M. E. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat. Neurosci. 17, 1442–1447 (2014).
Li, L., Fan, Y., Tse, M. & Lin, K.-Y. A review of applications in federated learning. Computers Ind. Eng. 149, 106854 (2020).
Warnat-Herresthal, S. et al. Swarm learning for decentralized and confidential clinical machine learning. Nature 594, 265–270 (2021).
Lyu, L. et al. Privacy and robustness in federated learning: attacks and defenses. IEEE Trans. Neural Netw. Learn. Syst. 35, 8726–8746 (2024).
Graves, J. L., Kearney, M., Barabino, G. & Malcom, S. Inequality in science and the case for a new agenda. Proc. Natl Acad. Sci. USA 119, e2117831119 (2022).
Ejermo, O. & Sofer, Y. When colleges graduate: micro-level effects on publications and scientific organization. Res. Policy 53, 105007 (2024).
Eckardt, J.-N. et al. Synthetic bone marrow images augment real samples in developing acute myeloid leukemia microscopy classification models. npj Digit. Med. 8, 173 (2025).
Levine, A. B. et al. Synthesis of diagnostic quality cancer pathology images by generative adversarial networks. J. Pathol. 252, 178–188 (2020).
Deshpande, S., Dawood, M., Minhas, F. & Rajpoot, N. SynCLay: interactive synthesis of histology images from bespoke cellular layouts. Med. Image Anal. 91, 102995 (2024).
Müller-Franzes, G. et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 13, 12098 (2023).
Osorio, P. et al. Latent diffusion models with image-derived annotations for enhanced AI-assisted cancer diagnosis in histopathology. Diagnostics 14, 1442 (2024).
Niehues, J. M. et al. Using histopathology latent diffusion models as privacy-preserving dataset augmenters improves downstream classification performance. Computers Biol. Med. 175, 108410 (2024).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Krause, J. et al. Deep learning detects genetic alterations in cancer histology generated by adversarial networks. J. Pathol. 254, 70–79 (2021).
Dolezal, J. M. et al. Deep learning generates synthetic cancer histology for explainability and education. npj Precis. Oncol. 7, 49 (2023).
Howard, F. M. et al. Generative adversarial networks accurately reconstruct pan-cancer histology from pathologic, genomic, and radiographic latent features. Sci. Adv. 10, eadq0856 (2024).
Carrillo-Perez, F. et al. Synthetic whole-slide image tile generation with gene expression profile-infused deep generative models. Cell Rep. Methods 3, 100534 (2023).
Carrillo-Perez, F. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng. 9, 320–332 (2025).
Bai, B. et al. Deep learning-enabled virtual histological staining of biological samples. Light Sci. Appl. 12, 57 (2023).
Pati, P. et al. Accelerating histopathology workflows with generative AI-based virtually multiplexed tumour profiling. Nat. Mach. Intell. 6, 1077–1093 (2024).
Koetzier, L. R. et al. Generating synthetic data for medical imaging. Radiology 312, e232471 (2024).
Sizikova, E. et al. Synthetic data in radiological imaging: current state and future outlook. BJR Artif. Intell. 1, ubae007 (2024).
Jung, H. K., Kim, K., Park, J. E. & Kim, N. Image-based generative artificial intelligence in radiology: comprehensive updates. Korean J. Radiol. 25, 959–981 (2024).
D’Amico, S. et al. Synthetic data generation by artificial intelligence to accelerate research and precision medicine in hematology. JCO Clin. Cancer Inf. 7, e2300021 (2023).
Bernard, E. et al. Molecular international prognostic scoring system for myelodysplastic syndromes. NEJM Evid. 1, EVIDoa2200008 (2022).
Kang, H. Y. J. et al. Synthetic tabular data based on generative adversarial networks in health care: generation and validation using the divide-and-conquer strategy. JMIR Med. Inform. 11, e47859 (2023).
Ganguli, R., Lad, R., Lin, A. & Yu, X. Novel generative recurrent neural network framework to produce accurate, applicable, and deidentified synthetic medical data for patients with metastatic cancer. JCO Clin. Cancer Inf. 7, e2200125 (2023).
Díaz-Navarro, A., Zhang, X., Jiao, W., Wang, B. & Stein, L. In silico generation of synthetic cancer genomes using generative AI. Cell Genom. 5, 100969 (2025).
Kim, J. & Seok, J. ctGAN: combined transformation of gene expression and survival data with generative adversarial network. Brief. Bioinform. 25, bbae325 (2024).
Norcliffe, A., Cebere, B., Imrie, F., Lio, P. & van der Schaar, M. SurvivalGAN: generating time-to-event data for survival analysis. Preprint at https://arxiv.org/abs/2302.12749 (2023).
Hogenboom, J. et al. Actionability of synthetic data in a heterogeneous and rare health care demographic: adolescents and young adults with cancer. JCO Clin. Cancer Inf. 8, e2400056 (2024).
Vlooswijk, C. et al. Recruiting adolescent and young adult cancer survivors for patient-reported outcome research: experiences and sample characteristics of the SURVAYA study. Curr. Oncol. 29, 5407–5425 (2022).
Juwara, L., El-Hussuna, A. & El Emam, K. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns 5, 100946 (2024).
Walonoski, J. et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25, 230–238 (2018).
National Disease Registration Service. The Simulacrum. NDRS https://digital.nhs.uk/ndrs/data/data-outputs/cancer-publications-and-tools/simulacrum (accessed 28 September 2025).
Netherlands Comprehensive Cancer Organisation. Synthetic dataset. IKNL https://iknl.nl/en/ncr/synthetic-dataset (accessed 28 September 2025).
Wouters, O. J., McKee, M. & Luyten, J. Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA 323, 844–853 (2020).
Mulcahy, A. et al. Use of clinical trial characteristics to estimate costs of new drug development. JAMA Netw. Open 8, e2453275 (2025).
Sertkaya, A., Beleche, T., Jessup, A. & Sommers, B. D. Costs of drug development and research and development intensity in the US, 2000–2018. JAMA Netw. Open 7, e2415445 (2024).
Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? Acta Pharm. Sin. B 12, 3049–3062 (2022).
Mullard, A. The high, and redundant, cost of failure in cancer drug development. Nat. Rev. Drug Discov. 22, 688–688 (2023).
Briel, M. et al. A systematic review of discontinued trials suggested that most reasons for recruitment failure were preventable. J. Clin. Epidemiol. 80, 8–15 (2016).
Briel, M. et al. Exploring reasons for recruitment failure in clinical trials: a qualitative study with clinical trial stakeholders in Switzerland, Germany, and Canada. Trials 22, 844 (2021).
Van Norman, G. A. Drugs and devices: comparison of European and U.S. approval processes. JACC Basic Transl. Sci. 1, 399–412 (2016).
Brown, D. G., Wobst, H. J., Kapoor, A., Kenna, L. A. & Southall, N. T. Clinical development times for innovative drugs. Nat. Rev. Drug Discov. 21, 793–794 (2022).
Stewart, D. J. et al. The importance of greater speed in drug development for advanced malignancies. Cancer Med. 7, 1824–1836 (2018).
Sengupta, S. et al. Emulating randomized controlled trials with hybrid control arms in oncology: a case study. Clin. Pharmacol. Ther. 113, 867–877 (2023).
Tan, W. K. et al. Augmenting control arms with real-world data for cancer trials: hybrid control arm methods and considerations. Contemp. Clin. Trials Commun. 30, 101000 (2022).
Ventz, S. et al. The design and evaluation of hybrid controlled trials that leverage external data and randomization. Nat. Commun. 13, 5783 (2022).
Ghadessi, M. et al. A roadmap to using historical controls in clinical trials — by Drug Information Association Adaptive Design Scientific Working Group (DIA-ADSWG). Orphanet J. Rare Dis. 15, 69 (2020).
Hall, K. T. et al. Historical controls in randomized clinical trials: opportunities and challenges. Clin. Pharmacol. Ther. 109, 343–351 (2021).
Marion, J. D. & Althouse, A. D. The use of historical controls in clinical trials. JAMA 330, 1484–1485 (2023).
Fountzilas, E., Tsimberidou, A. M., Vo, H. H. & Kurzrock, R. Clinical trial design in the era of precision medicine. Genome Med. 14, 101 (2022).
Duan, X.-P. et al. New clinical trial design in precision medicine: discovery, development and direction. Signal Transduct. Target. Ther. 9, 1–29 (2024).
Lee, H. Y., Ha, H., Kang, J. H. & Park, H.-S. Precision oncology clinical trials: a systematic review of phase II clinical trials with biomarker-driven, adaptive design. JCO 42, e23005 (2024).
Gatta, G. et al. Rare cancers are not so rare: the rare cancer burden in Europe. Eur. J. Cancer 47, 2493–2511 (2011).
Matsuda, T. et al. Rare cancers are not rare in Asia as well: the rare cancer burden in East Asia. Cancer Epidemiol. 67, 101702 (2020).
Eckardt, J.-N. et al. Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. npj Digit. Med. 7, 1–11 (2024).
Piciocchi, A. et al. Unlocking the potential of synthetic patients for accelerating clinical trials: results of the first GIMEMA experience on acute myeloid leukemia patients. eJHaem 5, 353–359 (2024).
Nowok, B., Raab, G. M. & Dibben, C. synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016).
Venditti, A. et al. GIMEMA AML1310 trial of risk-adapted, MRD-directed therapy for young adults with newly diagnosed acute myeloid leukemia. Blood 134, 935–945 (2019).
Azizi, Z., Zheng, C., Mosquera, L., Pilote, L. & Emam, K. E. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11, e043497 (2021).
Dahdaleh, F. S. et al. Obstruction predicts worse long-term outcomes in stage III colon cancer: a secondary analysis of the N0147 trial. Surgery 164, 1223–1229 (2018).
Elvatun, S., Knoors, D., Brant, S., Jonasson, C. & Nygård, J. F. Synthetic data as external control arms in scarce single-arm clinical trials. PLoS Digital Health 4, e0000581 (2025).
Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D. & Xiao, X. PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 25:1–25:41 (2017).
Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems Vol. 32 https://papers.nips.cc/paper_files/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf (Curran Associates, 2019).
Akiya, I., Ishihara, T. & Yamamoto, K. Comparison of synthetic data generation techniques for control group survival data in oncology clinical trials: simulation study. JMIR Med. Inf. 12, e55118 (2024).
El-Kababji, S. et al. Augmenting insufficiently accruing oncology clinical trials using generative models: validation study. J. Med. Internet Res. 27, e66821 (2025).
Beigi, M., Shafquat, A., Mezey, J. & Aptekar, J. Simulants: synthetic clinical trial data via subject-level privacy-preserving synthesis. AMIA Annu. Symp. Proc. 2022, 231–240 (2023).
Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 1–8 (2023).
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).
Draghi, B., Wang, Z., Myles, P. & Tucker, A. BayesBoost: identifying and handling bias using synthetic data generators. In Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications 49–62 (PMLR, 2021).
Shahul Hameed, M. A., Qureshi, A. M. & Kaushik, A. Bias mitigation via synthetic data generation: a review. Electronics 13, 3909 (2024).
Baumann, J., Castelnovo, A., Crupi, R., Inverardi, N. & Regoli, D. Bias on demand: a modelling framework that generates synthetic data with bias. In Proc. 2023 ACM Conference on Fairness, Accountability, and Transparency 1002–1013 (Association for Computing Machinery, 2023); https://doi.org/10.1145/3593013.3594058.
Ge, L., Li, H., Wang, X. & Wang, Z. A review of secure federated learning: privacy leakage threats, protection technologies, challenges and future directions. Neurocomputing 561, 126897 (2023).
Hu, H. et al. Membership inference attacks on machine learning: a survey. ACM Comput. Surv. 54, 235:1–235:37 (2022).
Yang, W. et al. Deep learning model inversion attacks and defenses: a comprehensive survey. Artif. Intell. Rev. 58, 242 (2025).
Farah, E., Kenney, M., Warkentin, M. T., Cheung, W. Y. & Brenner, D. R. Examining external control arms in oncology: a scoping review of applications to date. Cancer Med. 13, e7447 (2024).
Serrano, C. et al. Rethinking placebos: embracing synthetic control arms in clinical trials for rare tumors. Nat. Med. 29, 2689–2692 (2023).
Davies, J. et al. Comparative effectiveness from a single-arm trial and real-world data: alectinib versus ceritinib. J. Comp. Effective. Res. 7, 855–865 (2018).
Jaksa, A. et al. A comparison of 7 oncology external control arm case studies: critiques from regulatory and health technology assessment agencies. Value Health 25, 1967–1976 (2022).
Arondekar, B. et al. Real-world evidence in support of oncology product registration: a systematic review of new drug application and biologics license application approvals from 2015–2020. Clin. Cancer Res. 28, 27–35 (2022).
Food and Drug Administration. Good Machine Learning Practice for Medical Device Development: Guiding Principles. FDA https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles (2025). The FDA recently introduced Good Machine Learning Practices to guide the development, evaluation and implementation of medical software for clinical use based on machine learning.
Hernandez, M. et al. Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees. Front. Digit. Health 7, 1576290 (2025).
Shahbazian, R. & Greco, S. Generative adversarial networks assist missing data imputation: a comprehensive survey and evaluation. IEEE Access 11, 88908–88928 (2023).
Jarrett, D., Cebere, B. C., Liu, T., Curth, A. & van der Schaar, M. HyperImpute: generalized iterative imputation with automatic model selection. In Proc. 39th International Conference on Machine Learning 9916–9937 (PMLR, 2022).
Vero, M., Balunovic, M. & Vechev, M. CuTS: customizable tabular synthetic data generation. In Proc. 41st International Conference on Machine Learning 49408–49433 (PMLR, 2024).
Breugel, B. V., Qian, Z. & Schaar, M. V. D. Synthetic data, real errors: how (not) to publish and use synthetic data. In Proc. 40th International Conference on Machine Learning 34793–34808 (PMLR, 2023).
Probst, P., Boulesteix, A.-L. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).
Wang, H., Sudalairaj, S., Henning, J., Greenewald, K. & Srivastava, A. Post-processing private synthetic data for improving utility on selected measures. Adv. Neural Inf. Process. Syst. 36, 64139–64154 (2023).
Pilgram, L. et al. A consensus privacy metrics framework for synthetic data. Patterns https://doi.org/10.1016/j.patter.2025.101320 (2025).
Dwork, C. in Automata, Languages and Programming (eds Bugliesi, M. et al.) 1–12 (Springer, 2006); https://doi.org/10.1007/11787006_1. This study introduces the concept of differential privacy, demonstrating that adding noise to synthetic health data may mitigate re-identification attempts.
Boulemtafes, A., Derhab, A. & Challal, Y. A review of privacy-preserving techniques for deep learning. Neurocomputing 384, 21–45 (2020).
Gibney, E. Could machine learning fuel a reproducibility crisis in science? Nature 608, 250–251 (2022).