Synthetic Data Generation for Biomedical Deep Learning: Methods, Challenges, and Opportunities

2nd International Conference on Chemo and Bioinformatics ICCBIKG 2023 (26-33)

АУТОР(И) / AUTHOR(S): Zlatan Car, Sandi Baressi Šegota, Branka Dobraš

Е-АДРЕСА / E-MAIL: car@riteh.hr

Download Full Pdf   

DOI: 10.46793/ICCBI23.026C

САЖЕТАК / ABSTRACT:

With the proliferation of deep learning (DL) techniques in biomedical applications, the need for large-scale and diverse datasets has become more and more apparent. However, obtaining labeled biomedical data is often challenging. Concerns such as patient privacy, data sharing issues, ethical questions, and the lack of data – either due to the bias towards patients affected by an illness in medical examinations or due to the rarity of the investigated disease, can cause significant issues in the data collection process. In addition, manual annotation of collected data is time-intensive and requires trained personnel. One of the potential solutions discussed in the area of data science is the application of synthetically generated data, with the goal of creating artificial data points, based on previously collected data, which can aid in model training. A look into the existing synthetic data applications and generation methods, for both numeric and image data is provided by the authors. This paper explores the potential of synthetic data generation as a solution to this data scarcity, with the focus given on current state-of-the-art methods, standard approaches and challenges introduced by the application of the synthetic data in DL methodologies, and future opportunities in the field.

КЉУЧНЕ РЕЧИ / KEYWORDS:

biomedical data, data science, deep learning, generative data methods, synthetic data

ЛИТЕРАТУРА / REFERENCES:

  • I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
  • Y. LeCun, Y. Bengio, and G. Hinton. „Deep learning.“ nature 521, no. 7553 (2015): 436-444.
  • Y. Matsuo, Y. LeCun, M. Sahani, D. Precup, D. Silver, M. Sugiyama, E. Uchibe, and J. Morimoto. „Deep learning, reinforcement learning, and world models.“ Neural Networks 152 (2022): 267-275.
  • S. Dong, P. Wang, and K. Abbas. „A survey on deep learning and its applications.“ Computer Science Review 40 (2021): 100379.
  • P. Rajpurkar, E. Chen, O. Banerjee, and E.J. Topol. „AI in health and medicine.“ Nature medicine 28, no. 1 (2022): 31-38.
  • G. S. Nelson, 2019. Bias in artificial intelligence. North Carolina medical journal, 80(4), pp.220- 222.
  • N. W. Price, and G. Cohen. „Privacy in the age of medical big data.“ Nature medicine 25, no. 1 (2019): 37-43.
  • J. Morley, C. Machado, C. Burr, J. Cowls, I. Joshi, M. Taddeo, and L. Floridi. „The ethics of AI in health care: a mapping review.“ Social Science & Medicine 260 (2020)
  • B. Shaker, S. Ahmad, J. Lee, C. Jung, and D. Na. „In silico methods and tools for drug discovery.“ Computers in biology and medicine 137 (2021): 104851.
  • S. Kar, and J- Leszczynski. „Open access in silico tools to predict the ADMET profiling of drug candidates.“ Expert opinion on drug discovery 15, no. 12 (2020): 1473-1487.
  • S. Baressi Šegota, N. Anđelić, D. Štifanić, J. Štifanić, and Z. Car. „On differentiating synthetic and real data in medical applications.“ (2023): 1-4.
  • I. Lorencin, S. Baressi Šegota, N. Anđelić, V. Mrzljak, T. Ćabov, J. Španjol, and Z. Car. „On urinary bladder cancer diagnosis: Utilization of deep convolutional generative adversarial networks for data augmentation.“ Biology 10, no. 3 (2021): 175.
  • L. Adams, F. Busch, D. Truhn, M. Makowski, H. Aerts, and K. Bressem. „What Does DALL-E 2 Know About Radiology?.“ Journal of Medical Internet Research 25 (2023): e43110.
  • S. Baressi Šegota, N. Anđelić, M. Šercer, and H. Meštrić. „Dynamics Modeling of Industrial Robotic Manipulators: A Machine Learning Approach Based on Synthetic Data.“ Mathematics 10, no. 7 (2022): 1174.
  • Z. Li, Y. Zhao, and J. Fu. „Sync: A copula based framework for generating synthetic data from aggregated sources.“ In 2020 International Conference on Data Mining Workshops (ICDMW), pp. 571-578. IEEE, 2020.
  • A. Shafquat, J. Mezey, M. Beigi, J. Sun, A. Gao, and J. Aptekar. „An interpretable data augmentation framework for improving generative modeling of synthetic clinical trial data.“ In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH). 2023.
  • S. Baressi Šegota, I. Lorencin, Z. Kovač, and Z. Car. „On Approximating the pIC50 Value of COVID-19 Medicines In Silico with Artificial Neural Networks.“ Biomedicines 11, no. 2 (2023): 284.
  • H. Thanh-Tung, and T. Tran. „Catastrophic forgetting and mode collapse inGANs.“ In 2020 international joint conference on neural networks (ijcnn), pp. 1-10. IEEE, 2020.
  • F. Jacobs, S. D’Amico, C. Benvenuti, M. Gaudio, G. Saltalamacchia, C. Miggiano, De Sanctis, M. Giovanni Della Porta, A. Santoro, and A. Zambelli. „Opportunities and Challenges of Synthetic Data Generation in Oncology.“ JCO Clinical Cancer Informatics 7 (2023): e2300045.
  • R. Chen, M. Lu, T. Chen, D. Williamson, and F. Mahmood. „Synthetic data in machine learning for medicine and healthcare.“ Nature Biomedical Engineering 5, no. 6 (2021): 493-497.
  • S. Nikolenko,. Synthetic data for deep learning. Vol. 174. Springer Nature, 2021.
  • M. Lacasa, F. Prados, J. Alegre, and J. Casas-Roma. „A synthetic data generation system for myalgic encephalomyelitis/chronic fatigue syndrome questionnaires.“ Scientific Reports 13, no. 1 (2023): 14256.
  • M. Guillaudeux, O. Rousseau, J. Petot, Z. Bennis, C. Dein, T. Goronflot, N. Vince et al. „Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis.“ NPJ Digital Medicine 6, no. 1 (2023): 37.
  • S. Baressi Šegota,, I. Lorencin, N. Anđelić, D. Štifanić, J. Musulin, S. Vlahinić, T. Šušteršič, A. Blagojević, and Z. Car. „Automated pipeline for continual data gathering and retraining of the machine learning-based COVID-19 spread models.“ EAI Endorsed Transactions on Bioengineering and Bioinformatics 1, no. 21 (2021)
  • K. Theodora, and K. Harron. „Synthetic data in medical research.“ BMJ medicine 1, no. 1 (2022).
  • A. Lal, J. Dang, C. Nabzdyk, O. Gajic, and V. Herasevich. „Regulatory oversight and ethical concerns surrounding software as medical device (SaMD) and digital twin technology in healthcare.“ Annals of Translational Medicine 10, no. 18 (2022).
  • R. Foraker, S. Yu, A. Gupta, A. Michelson, J. Pineda Soto, R. Colvin, F. Loh et al. „Spot the difference: comparing results of analyses from real patient data and synthetic derivatives.“ JAMIA open 3, no. 4 (2020): 557-566.