Artificial intelligence-generated synthetic data for cancer research and clinical trials

Nguyen, T. T. et al. Deep learning for deepfakes creation and detection: a survey. Comp. Vis. Image Underst. 223, 103525 (2022).

Article

Google Scholar

Fallis, D. The epistemic threat of deepfakes. Philos. Technol. 34, 623–643 (2021).

Article
PubMed

Google Scholar

Raghunathan, T. E. Synthetic data. Annu. Rev. Stat. Appl. 8, 129–140 (2021).

Article

Google Scholar

Figueira, A. & Vaz, B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 10, 2733 (2022).

Article

Google Scholar

de MeloA, C. M. et al. Next-generation deep learning based on simulators and synthetic data. Trends Cognit. Sci. 26, 174–187 (2022).

Article

Google Scholar

Laubenbacher, R., Mehrad, B., Shmulevich, I. & Trayanova, N. Digital twins in medicine. Nat. Comput. Sci. 4, 184–191 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Rubin, D. B. Multiple Imputation for Nonresponse in Surveys (John Wiley & Sons, 2004).

Goodfellow, I. J. et al. Generative adversarial network. Preprint at https://arxiv.org/abs/1406.2661 (2014). This study introduces the concept of GANs, upon which all further iterations of GAN-based generative models are based.

Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2022). This study introduces VAEs, a key technology in synthetic data generation.

Ho, J., Jain, A. & Abbeel, P. in Advances in Neural Information Processing Systems Vol. 33 6840–6851 (Curran Associates, 2020). This study introduced denoising diffusion probabilistic models, a generative model based on iterative noise removal that laid the foundation for the widespread adoption of diffusion models for imaging.

Nichol, A. Q. & Dhariwal, P. Improved denoising diffusion probabilistic models. In Proc. 38th International Conference on Machine Learning 8162–8171 (PMLR, 2021).

Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 35, 36479–36494 (2022).

Google Scholar

Khader, F. et al. Denoising diffusion probabilistic models for 3D medical image generation. Sci. Rep. 13, 7303 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar

Pozzi, M. et al. Generating and evaluating synthetic data in digital pathology through diffusion models. Sci. Rep. 14, 28435 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Vaswani, A. et al. Attention is all you need. Preprint at https://arxiv.org/abs/1706.03762v7 (2017). This study introduced transformer architecture, a self-attention-based deep-learning framework that has become the foundation of state-of-the-art large language models.

Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 1–8 (2023).

Article

Google Scholar

Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. 29, 2983–2984 (2023).

Article
CAS
PubMed

Google Scholar

Kather, J. N., Ghaffari Laleh, N., Foersch, S. & Truhn, D. Medical domain knowledge in domain-agnostic generative AI. npj Digit. Med. 5, 90 (2022).

Article
PubMed
PubMed Central

Google Scholar

Smolyak, D., Bjarnadóttir, M. V., Crowley, K. & Agarwal, R. Large language models and synthetic health data: progress and prospects. JAMIA Open 7, ooae114 (2024).

Article
PubMed
PubMed Central

Google Scholar

Vallevik, V. B. et al. Can I trust my fake data — a comprehensive quality assessment framework for synthetic tabular data in healthcare. Int. J. Med. Inform. 185, 105413 (2024). This study provides a comprehensive evaluation framework for quality assessment of synthetic tabular healthcare data.

Article
PubMed

Google Scholar

Tucker, A., Wang, Z., Rotalinti, Y. & Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 1–13 (2020).

Article

Google Scholar

Alaa, A. M., van Breugel, B., Saveliev, E. & van der Schaar, M. How faithful is your synthetic data? Sample-level metrics for evaluating and auditing generative models. In Proc. 39th International Conference on Machine Learning 290–306 (PMLR, 2022).

Hittmeir, M., Ekelhart, A. & Mayer, R. On the utility of synthetic data: an empirical evaluation on machine learning tasks. In Proc. 14th International Conference on Availability, Reliability and Security 1–6 (Association for Computing Machinery, 2019); https://doi.org/10.1145/3339252.3339281.

El Emam, K. Seven ways to evaluate the utility of synthetic data. IEEE Security Priv. 18, 56–59 (2020). This study provides a framework for evaluating the usability of synthetic data using study replication, structural similarity, domain expertise and metrics.

Article

Google Scholar

Yale, A. et al. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416, 244–255 (2020).

Article

Google Scholar

Moore, J. & Guichot, Y. D. How to harness the power of health data to improve patient outcomes. World Economic Forum https://www.weforum.org/stories/2024/01/how-to-harness-health-data-to-improve-patient-outcomes-wef24/ (5 January 2024).

US Department of Health and Human Services. Health information privacy. https://www.hhs.gov/hipaa/index.html (2021).

European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (text with EEA relevance). EUR-Lex https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (2016).

Emam, K. E., Jonker, E., Arbuckle, L. & Malin, B. A systematic review of re-identification attacks on health data. PLoS ONE 6, e28071 (2011). This review systematically evaluates studies on re-identification attacks for healthcare, which succeed in 34% of cases, highlighting an urgent need for privacy preservation techniques.

Article
PubMed
PubMed Central

Google Scholar

Ferguson, A. R., Nielson, J. L., Cragin, M. H., Bandrowski, A. E. & Martone, M. E. Big data from small data: data-sharing in the ‘long tail’ of neuroscience. Nat. Neurosci. 17, 1442–1447 (2014).

Article
CAS
PubMed
PubMed Central

Google Scholar

Li, L., Fan, Y., Tse, M. & Lin, K.-Y. A review of applications in federated learning. Computers Ind. Eng. 149, 106854 (2020).

Article

Google Scholar

Warnat-Herresthal, S. et al. Swarm learning for decentralized and confidential clinical machine learning. Nature 594, 265–270 (2021).

Article
CAS
PubMed
PubMed Central

Google Scholar

Lyu, L. et al. Privacy and robustness in federated learning: attacks and defenses. IEEE Trans. Neural Netw. Learn. Syst. 35, 8726–8746 (2024).

Article
PubMed

Google Scholar

Graves, J. L., Kearney, M., Barabino, G. & Malcom, S. Inequality in science and the case for a new agenda. Proc. Natl Acad. Sci. USA 119, e2117831119 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Ejermo, O. & Sofer, Y. When colleges graduate: micro-level effects on publications and scientific organization. Res. Policy 53, 105007 (2024).

Article

Google Scholar

Eckardt, J.-N. et al. Synthetic bone marrow images augment real samples in developing acute myeloid leukemia microscopy classification models. npj Digit. Med. 8, 173 (2025).

Article
PubMed
PubMed Central

Google Scholar

Levine, A. B. et al. Synthesis of diagnostic quality cancer pathology images by generative adversarial networks. J. Pathol. 252, 178–188 (2020).

Article
CAS
PubMed

Google Scholar

Deshpande, S., Dawood, M., Minhas, F. & Rajpoot, N. SynCLay: interactive synthesis of histology images from bespoke cellular layouts. Med. Image Anal. 91, 102995 (2024).

Article
PubMed

Google Scholar

Müller-Franzes, G. et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci. Rep. 13, 12098 (2023).

Article
PubMed
PubMed Central

Google Scholar

Osorio, P. et al. Latent diffusion models with image-derived annotations for enhanced AI-assisted cancer diagnosis in histopathology. Diagnostics 14, 1442 (2024).

Article
PubMed
PubMed Central

Google Scholar

Niehues, J. M. et al. Using histopathology latent diffusion models as privacy-preserving dataset augmenters improves downstream classification performance. Computers Biol. Med. 175, 108410 (2024).

Article

Google Scholar

Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).

Article
CAS
PubMed
PubMed Central

Google Scholar

Krause, J. et al. Deep learning detects genetic alterations in cancer histology generated by adversarial networks. J. Pathol. 254, 70–79 (2021).

PubMed

Google Scholar

Dolezal, J. M. et al. Deep learning generates synthetic cancer histology for explainability and education. npj Precis. Oncol. 7, 49 (2023).

Article
PubMed
PubMed Central

Google Scholar

Howard, F. M. et al. Generative adversarial networks accurately reconstruct pan-cancer histology from pathologic, genomic, and radiographic latent features. Sci. Adv. 10, eadq0856 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Carrillo-Perez, F. et al. Synthetic whole-slide image tile generation with gene expression profile-infused deep generative models. Cell Rep. Methods 3, 100534 (2023).

Article
PubMed
PubMed Central

Google Scholar

Carrillo-Perez, F. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng. 9, 320–332 (2025).

Article
CAS
PubMed

Google Scholar

Bai, B. et al. Deep learning-enabled virtual histological staining of biological samples. Light Sci. Appl. 12, 57 (2023).

Article
CAS
PubMed
PubMed Central

Google Scholar

Pati, P. et al. Accelerating histopathology workflows with generative AI-based virtually multiplexed tumour profiling. Nat. Mach. Intell. 6, 1077–1093 (2024).

Article
PubMed
PubMed Central

Google Scholar

Koetzier, L. R. et al. Generating synthetic data for medical imaging. Radiology 312, e232471 (2024).

Article
PubMed
PubMed Central

Google Scholar

Sizikova, E. et al. Synthetic data in radiological imaging: current state and future outlook. BJR Artif. Intell. 1, ubae007 (2024).

Google Scholar

Jung, H. K., Kim, K., Park, J. E. & Kim, N. Image-based generative artificial intelligence in radiology: comprehensive updates. Korean J. Radiol. 25, 959–981 (2024).

Article
PubMed
PubMed Central

Google Scholar

D’Amico, S. et al. Synthetic data generation by artificial intelligence to accelerate research and precision medicine in hematology. JCO Clin. Cancer Inf. 7, e2300021 (2023).

Google Scholar

Bernard, E. et al. Molecular international prognostic scoring system for myelodysplastic syndromes. NEJM Evid. 1, EVIDoa2200008 (2022).

Article
PubMed

Google Scholar

Kang, H. Y. J. et al. Synthetic tabular data based on generative adversarial networks in health care: generation and validation using the divide-and-conquer strategy. JMIR Med. Inform. 11, e47859 (2023).

Article
PubMed
PubMed Central

Google Scholar

Ganguli, R., Lad, R., Lin, A. & Yu, X. Novel generative recurrent neural network framework to produce accurate, applicable, and deidentified synthetic medical data for patients with metastatic cancer. JCO Clin. Cancer Inf. 7, e2200125 (2023).

Google Scholar

Díaz-Navarro, A., Zhang, X., Jiao, W., Wang, B. & Stein, L. In silico generation of synthetic cancer genomes using generative AI. Cell Genom. 5, 100969 (2025).

Article
PubMed
PubMed Central

Google Scholar

Kim, J. & Seok, J. ctGAN: combined transformation of gene expression and survival data with generative adversarial network. Brief. Bioinform. 25, bbae325 (2024).

Article
PubMed
PubMed Central

Google Scholar

Norcliffe, A., Cebere, B., Imrie, F., Lio, P. & van der Schaar, M. SurvivalGAN: generating time-to-event data for survival analysis. Preprint at https://arxiv.org/abs/2302.12749 (2023).

Hogenboom, J. et al. Actionability of synthetic data in a heterogeneous and rare health care demographic: adolescents and young adults with cancer. JCO Clin. Cancer Inf. 8, e2400056 (2024).

Google Scholar

Vlooswijk, C. et al. Recruiting adolescent and young adult cancer survivors for patient-reported outcome research: experiences and sample characteristics of the SURVAYA study. Curr. Oncol. 29, 5407–5425 (2022).

Article
PubMed
PubMed Central

Google Scholar

Juwara, L., El-Hussuna, A. & El Emam, K. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns 5, 100946 (2024).

Article
PubMed
PubMed Central

Google Scholar

Walonoski, J. et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25, 230–238 (2018).

Article
PubMed
PubMed Central

Google Scholar

National Disease Registration Service. The Simulacrum. NDRS https://digital.nhs.uk/ndrs/data/data-outputs/cancer-publications-and-tools/simulacrum (accessed 28 September 2025).

Netherlands Comprehensive Cancer Organisation. Synthetic dataset. IKNL https://iknl.nl/en/ncr/synthetic-dataset (accessed 28 September 2025).

Wouters, O. J., McKee, M. & Luyten, J. Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA 323, 844–853 (2020).

Article
PubMed
PubMed Central

Google Scholar

Mulcahy, A. et al. Use of clinical trial characteristics to estimate costs of new drug development. JAMA Netw. Open 8, e2453275 (2025).

Article
PubMed
PubMed Central

Google Scholar

Sertkaya, A., Beleche, T., Jessup, A. & Sommers, B. D. Costs of drug development and research and development intensity in the US, 2000–2018. JAMA Netw. Open 7, e2415445 (2024).

Article
PubMed
PubMed Central

Google Scholar

Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? Acta Pharm. Sin. B 12, 3049–3062 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Mullard, A. The high, and redundant, cost of failure in cancer drug development. Nat. Rev. Drug Discov. 22, 688–688 (2023).

Article
CAS
PubMed

Google Scholar

Briel, M. et al. A systematic review of discontinued trials suggested that most reasons for recruitment failure were preventable. J. Clin. Epidemiol. 80, 8–15 (2016).

Article
PubMed

Google Scholar

Briel, M. et al. Exploring reasons for recruitment failure in clinical trials: a qualitative study with clinical trial stakeholders in Switzerland, Germany, and Canada. Trials 22, 844 (2021).

Article
PubMed
PubMed Central

Google Scholar

Van Norman, G. A. Drugs and devices: comparison of European and U.S. approval processes. JACC Basic Transl. Sci. 1, 399–412 (2016).

Article
PubMed
PubMed Central

Google Scholar

Brown, D. G., Wobst, H. J., Kapoor, A., Kenna, L. A. & Southall, N. T. Clinical development times for innovative drugs. Nat. Rev. Drug Discov. 21, 793–794 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Stewart, D. J. et al. The importance of greater speed in drug development for advanced malignancies. Cancer Med. 7, 1824–1836 (2018).

Article
PubMed
PubMed Central

Google Scholar

Sengupta, S. et al. Emulating randomized controlled trials with hybrid control arms in oncology: a case study. Clin. Pharmacol. Ther. 113, 867–877 (2023).

Article
CAS
PubMed

Google Scholar

Tan, W. K. et al. Augmenting control arms with real-world data for cancer trials: hybrid control arm methods and considerations. Contemp. Clin. Trials Commun. 30, 101000 (2022).

Article
PubMed
PubMed Central

Google Scholar

Ventz, S. et al. The design and evaluation of hybrid controlled trials that leverage external data and randomization. Nat. Commun. 13, 5783 (2022).

Article
CAS
PubMed
PubMed Central

Google Scholar

Ghadessi, M. et al. A roadmap to using historical controls in clinical trials — by Drug Information Association Adaptive Design Scientific Working Group (DIA-ADSWG). Orphanet J. Rare Dis. 15, 69 (2020).

Article
PubMed
PubMed Central

Google Scholar

Hall, K. T. et al. Historical controls in randomized clinical trials: opportunities and challenges. Clin. Pharmacol. Ther. 109, 343–351 (2021).

Article
PubMed

Google Scholar

Marion, J. D. & Althouse, A. D. The use of historical controls in clinical trials. JAMA 330, 1484–1485 (2023).

Article
PubMed

Google Scholar

Fountzilas, E., Tsimberidou, A. M., Vo, H. H. & Kurzrock, R. Clinical trial design in the era of precision medicine. Genome Med. 14, 101 (2022).

Article
PubMed
PubMed Central

Google Scholar

Duan, X.-P. et al. New clinical trial design in precision medicine: discovery, development and direction. Signal Transduct. Target. Ther. 9, 1–29 (2024).

Google Scholar

Lee, H. Y., Ha, H., Kang, J. H. & Park, H.-S. Precision oncology clinical trials: a systematic review of phase II clinical trials with biomarker-driven, adaptive design. JCO 42, e23005 (2024).

Article

Google Scholar

Gatta, G. et al. Rare cancers are not so rare: the rare cancer burden in Europe. Eur. J. Cancer 47, 2493–2511 (2011).

Article
PubMed

Google Scholar

Matsuda, T. et al. Rare cancers are not rare in Asia as well: the rare cancer burden in East Asia. Cancer Epidemiol. 67, 101702 (2020).

Article
PubMed

Google Scholar

Eckardt, J.-N. et al. Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. npj Digit. Med. 7, 1–11 (2024).

Article

Google Scholar

Piciocchi, A. et al. Unlocking the potential of synthetic patients for accelerating clinical trials: results of the first GIMEMA experience on acute myeloid leukemia patients. eJHaem 5, 353–359 (2024).

Article
CAS
PubMed
PubMed Central

Google Scholar

Nowok, B., Raab, G. M. & Dibben, C. synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016).

Article

Google Scholar

Venditti, A. et al. GIMEMA AML1310 trial of risk-adapted, MRD-directed therapy for young adults with newly diagnosed acute myeloid leukemia. Blood 134, 935–945 (2019).

Article
CAS
PubMed

Google Scholar

Azizi, Z., Zheng, C., Mosquera, L., Pilote, L. & Emam, K. E. Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11, e043497 (2021).

Article
PubMed
PubMed Central

Google Scholar

Dahdaleh, F. S. et al. Obstruction predicts worse long-term outcomes in stage III colon cancer: a secondary analysis of the N0147 trial. Surgery 164, 1223–1229 (2018).

Article
PubMed

Google Scholar

Elvatun, S., Knoors, D., Brant, S., Jonasson, C. & Nygård, J. F. Synthetic data as external control arms in scarce single-arm clinical trials. PLoS Digital Health 4, e0000581 (2025).

Article
PubMed
PubMed Central

Google Scholar

Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D. & Xiao, X. PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42, 25:1–25:41 (2017).

Article

Google Scholar

Xu, L., Skoularidou, M., Cuesta-Infante, A. & Veeramachaneni, K. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems Vol. 32 https://papers.nips.cc/paper_files/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf (Curran Associates, 2019).

Akiya, I., Ishihara, T. & Yamamoto, K. Comparison of synthetic data generation techniques for control group survival data in oncology clinical trials: simulation study. JMIR Med. Inf. 12, e55118 (2024).

Article

Google Scholar

El-Kababji, S. et al. Augmenting insufficiently accruing oncology clinical trials using generative models: validation study. J. Med. Internet Res. 27, e66821 (2025).

Article
PubMed
PubMed Central

Google Scholar

Beigi, M., Shafquat, A., Mezey, J. & Aptekar, J. Simulants: synthetic clinical trial data via subject-level privacy-preserving synthesis. AMIA Annu. Symp. Proc. 2022, 231–240 (2023).

PubMed
PubMed Central

Google Scholar

Giuffrè, M. & Shung, D. L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 1–8 (2023).

Article

Google Scholar

Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).

Article
PubMed
PubMed Central

Google Scholar

Draghi, B., Wang, Z., Myles, P. & Tucker, A. BayesBoost: identifying and handling bias using synthetic data generators. In Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications 49–62 (PMLR, 2021).

Shahul Hameed, M. A., Qureshi, A. M. & Kaushik, A. Bias mitigation via synthetic data generation: a review. Electronics 13, 3909 (2024).

Article

Google Scholar

Baumann, J., Castelnovo, A., Crupi, R., Inverardi, N. & Regoli, D. Bias on demand: a modelling framework that generates synthetic data with bias. In Proc. 2023 ACM Conference on Fairness, Accountability, and Transparency 1002–1013 (Association for Computing Machinery, 2023); https://doi.org/10.1145/3593013.3594058.

Ge, L., Li, H., Wang, X. & Wang, Z. A review of secure federated learning: privacy leakage threats, protection technologies, challenges and future directions. Neurocomputing 561, 126897 (2023).

Article

Google Scholar

Hu, H. et al. Membership inference attacks on machine learning: a survey. ACM Comput. Surv. 54, 235:1–235:37 (2022).

Article

Google Scholar

Yang, W. et al. Deep learning model inversion attacks and defenses: a comprehensive survey. Artif. Intell. Rev. 58, 242 (2025).

Article

Google Scholar

Farah, E., Kenney, M., Warkentin, M. T., Cheung, W. Y. & Brenner, D. R. Examining external control arms in oncology: a scoping review of applications to date. Cancer Med. 13, e7447 (2024).

Article
PubMed
PubMed Central

Google Scholar

Serrano, C. et al. Rethinking placebos: embracing synthetic control arms in clinical trials for rare tumors. Nat. Med. 29, 2689–2692 (2023).

Article
CAS
PubMed

Google Scholar

Davies, J. et al. Comparative effectiveness from a single-arm trial and real-world data: alectinib versus ceritinib. J. Comp. Effective. Res. 7, 855–865 (2018).

Article

Google Scholar

Jaksa, A. et al. A comparison of 7 oncology external control arm case studies: critiques from regulatory and health technology assessment agencies. Value Health 25, 1967–1976 (2022).

Article
PubMed

Google Scholar

Arondekar, B. et al. Real-world evidence in support of oncology product registration: a systematic review of new drug application and biologics license application approvals from 2015–2020. Clin. Cancer Res. 28, 27–35 (2022).

Article
PubMed

Google Scholar

Food and Drug Administration. Good Machine Learning Practice for Medical Device Development: Guiding Principles. FDA https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles (2025). The FDA recently introduced Good Machine Learning Practices to guide the development, evaluation and implementation of medical software for clinical use based on machine learning.

Hernandez, M. et al. Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees. Front. Digit. Health 7, 1576290 (2025).

Article
PubMed
PubMed Central

Google Scholar

Shahbazian, R. & Greco, S. Generative adversarial networks assist missing data imputation: a comprehensive survey and evaluation. IEEE Access 11, 88908–88928 (2023).

Article

Google Scholar

Jarrett, D., Cebere, B. C., Liu, T., Curth, A. & van der Schaar, M. HyperImpute: generalized iterative imputation with automatic model selection. In Proc. 39th International Conference on Machine Learning 9916–9937 (PMLR, 2022).

Vero, M., Balunovic, M. & Vechev, M. CuTS: customizable tabular synthetic data generation. In Proc. 41st International Conference on Machine Learning 49408–49433 (PMLR, 2024).

Breugel, B. V., Qian, Z. & Schaar, M. V. D. Synthetic data, real errors: how (not) to publish and use synthetic data. In Proc. 40th International Conference on Machine Learning 34793–34808 (PMLR, 2023).

Probst, P., Boulesteix, A.-L. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).

Google Scholar

Wang, H., Sudalairaj, S., Henning, J., Greenewald, K. & Srivastava, A. Post-processing private synthetic data for improving utility on selected measures. Adv. Neural Inf. Process. Syst. 36, 64139–64154 (2023).

Google Scholar

Pilgram, L. et al. A consensus privacy metrics framework for synthetic data. Patterns https://doi.org/10.1016/j.patter.2025.101320 (2025).

Article
PubMed
PubMed Central

Google Scholar

Dwork, C. in Automata, Languages and Programming (eds Bugliesi, M. et al.) 1–12 (Springer, 2006); https://doi.org/10.1007/11787006_1. This study introduces the concept of differential privacy, demonstrating that adding noise to synthetic health data may mitigate re-identification attempts.

Boulemtafes, A., Derhab, A. & Challal, Y. A review of privacy-preserving techniques for deep learning. Neurocomputing 384, 21–45 (2020).

Article

Google Scholar

Gibney, E. Could machine learning fuel a reproducibility crisis in science? Nature 608, 250–251 (2022).

Article
CAS
PubMed

Google Scholar

Artificial intelligence-generated synthetic data for cancer research and clinical trials

Tags: