Abstract
Synthetic data, generated using advanced artificial intelligence (AI) techniques, replicates the statistical properties of real-world datasets while excluding identifiable information. Although synthetic data does not consist of actual data points, it is derived from original datasets, thereby enabling analyses that yield results comparable to those obtained with real data. Synthetic datasets are evaluated based on their utility—a measure of how effectively they mirror real data for analytical purposes. This paper presents the generation of synthetic datasets through the Healthcare Big Data Showcase Project (2019–2023). The original dataset comprises comprehensive multi-omics data from 400 individuals, including cancer survivors, chronic disease patients, and healthy participants. Synthetic data facilitates efficient access and robust analyses, serving as a practical tool for research and education. It addresses privacy concerns, supports AI research, and provides a foundation for innovative applications across diverse fields, such as public health and precision medicine.
-
Keywords: Genomics; Life-log data; Omics-data; Public health; Synthetic data
Introduction
Biomedical research increasingly relies on large-scale datasets, such as multi-omics data—including wearable life-log data, whole exome sequencing, whole genome sequencing, RNA sequencing (RNA-seq), and microbiome profiles—to uncover insights into human health. However, these datasets often contain sensitive information, which limits their accessibility and hampers research progress [
1,
2].
To address these challenges, the Healthcare Big Data Showcase Project (2019–2023) was launched to collect and integrate multi-omics and life-log data from 400 individuals, including cancer survivors, chronic disease patients, and healthy participants, thereby creating a valuable resource for healthcare research. Despite its comprehensive nature, the sensitivity of the dataset restricts its use, necessitating innovative solutions to facilitate analysis without compromising privacy [
3].
Synthetic data have emerged as a transformative solution. By replicating the statistical properties of real data while eliminating identifiable information, synthetic datasets enable analyses that yield results comparable to those obtained with real data [
1,
4]. Unlike direct replicas, synthetic data are derived from original datasets and designed to mirror their statistical characteristics. Its utility is evaluated by how effectively it serves as a proxy for real data, allowing researchers to draw reliable conclusions while safeguarding privacy [
5,
6].
Among the various data types, life-log data were synthesized using advanced deep learning models, such as recurrent time-series generative adversarial networks (RTSGAN), due to their temporal structure and suitability for such models. Generative adversarial networks (GAN)-based approaches, such as the Wasserstein GAN, significantly improved the quality and stability of the synthesis process [
7].
Methods
In this study, synthetic datasets comprising 1,000 individuals for each data type were generated based on original multi-omics and life-log data collected from 400 participants in the Healthcare Big Data Showcase Project (2019–2023) (
Table 1).
For the life-log data, the RTSGAN model was employed to generate synthetic datasets (
Supplementary Material 1). In contrast, synthetic sequencing data—including RNA-seq and methyl-seq—were generated using statistical methods tailored to each data type, whereas microbiome data were reconstructed without applying statistical perturbations.
The RTSGAN model, introduced at the 2021 Institute of Electrical and Electronics Engineers International Conference on Data Mining, is an autoencoder-based GAN architecture specifically designed for medical data with irregular time intervals, addressing limitations of conventional GAN models [
8,
9]. Life-log data were categorized into dynamic and static variables. Dynamic variables represent time-dependent information, such as daily weight, calories burned, and step count, whereas static variables include fixed attributes like date of birth, blood type, and gender. The RTSGAN model learns temporal and structural patterns from both types of variables, enabling the generation of synthetic data that reflects the statistical properties of the original data.
For RNA-seq data, synthetic datasets were created using 28,284 Ensembl gene IDs. Metrics such as read count, fragments per kilobase of transcript per million mapped reads (FPKM), and transcripts per million (TPM) were synthesized by introducing random errors to group-specific mean values, thereby mimicking realistic biological variability [
5,
9].
Synthetic methyl-seq data were generated by adding stochastic noise to group-level methylation averages, preserving statistical fidelity while introducing appropriate variation [
5,
10].
Microbiome datasets were generated by retaining microbial features and taxonomic classifications from original fecal and saliva samples, with all personal identifiers removed to ensure data privacy.
To assess the quality of the synthetic data, a “train on synthesized, test on real” (TSTR) evaluation was conducted (
Table 2). This evaluation was performed exclusively on the synthetic life-log dataset generated using RTSGAN, owing to its temporal nature and suitability for downstream predictive modeling tasks. The results demonstrated a high degree of similarity between the synthetic and real data, as evidenced by comparable area under the receiver operating characteristic curve (AUROC) and classification accuracy in both test scenarios. Although scenario 2 showed a slight decline in performance, it remained within an acceptable range for practical applications.
In contrast, sequencing datasets such as RNA-seq, methyl-seq, and microbiome were not subjected to TSTR validation, as these sequencing-based data types are primarily used for biological interpretation—such as differential expression and pathway analysis—rather than supervised classification. Therefore, their synthetic validity was assessed through the preservation of statistical distributions and biological patterns relative to the original data. Given these constraints, their validity was instead supported through statistical similarity assessments and the preservation of group-level trends relative to the original data. Nonetheless, the results confirm that RTSGAN-generated synthetic life-log data retain substantial analytical value and can serve as reliable proxies for real data in predictive modeling tasks [
4,
9].
Collection
Data Resource Area and Population Coverage
Synthetic datasets encompassing 1,000 individuals for each data type were generated based on original data collected from 400 participants in the Healthcare Big Data Showcase Project (2019–2023). These synthetic datasets provide an integrated view of microbial, genetic, and lifestyle metrics, enabling multidimensional analysis while preserving privacy.
The data were synthesized as follows: (1) microbiome data: Original fecal and saliva samples were processed using next-generation sequencing to identify microbial features and their taxonomic classifications. Synthetic datasets retained these structures without introducing additional statistical perturbations, thereby preserving the compositional integrity of the original microbiome profiles. (2) Life-log data: Activity, body composition, and sleep metrics were collected using wearable devices, providing temporal and longitudinal insights. Synthetic life-log datasets were generated using a recurrent time-series generative adversarial network (RTSGAN) to reflect irregular temporal dynamics. (3) Methyl-seq data: DNA methylation profiles covering 45,579 Ensembl gene IDs were synthesized by applying statistical noise to group-level methylation patterns to reflect epigenetic modifications. (4) RNA-seq data: Transcriptomic profiles including gene expression patterns (FPKM, read counts, TPM) were generated using 28,284 Ensembl gene IDs with random perturbations added to preserve the statistical properties of the original data.
This integrated synthetic data resource provides a comprehensive framework for exploring the relationships among microbiome composition, genetic and epigenetic features, and lifestyle behaviors in a privacy-preserving manner.
Survey Frequency
Since the dataset consisted of synthetic data, there were no actual survey dates or repeated measures.
Measures
Microbiome data: (1) feature IDs and DNA sequences: captured in the dna-seq folder; (2) feature abundance: stored as relative abundance in the feature table; (3) taxonomy: taxonomic classifications in the taxonomy folder.
Life-log data: (1) activity metrics: calories burned, distance, steps, activity time (vigorous, moderate, light), rest time; (2) body metrics: body mass index, body fat percentage, weight; (3) sleep metrics: awake time, rapid eye movement sleep time, light sleep time, total sleep time.
Methyl-seq data: (1) methylation matrix: Ensembl gene IDs (e.g., ENSG00000123456) and DNA methylation levels for each gene and patient; (2) summary statistics: a total of 45,579 genes analyzed and distribution of methylation values.
RNA-seq data: gene expression: Ensembl gene IDs (e.g., ENSG00000123456) and gene expression levels (FPKM, read count, and TPM) for each gene and patient with 28,284 genes.
Data Resource Use
The synthetic datasets generated from life-log, RNA-seq, Methyl-seq, and microbiome data offer a wide range of applications in predictive modeling, risk assessment, biomarker discovery, and public health research. By leveraging these datasets, researchers can address critical health challenges while maintaining privacy and data security. The expected benefits of synthetic data, combined with its versatile applications, render it a powerful tool for advancing healthcare and personalized medicine.
One of the primary applications of synthetic datasets is the development of predictive models for health monitoring and disease risk assessment. Using life-log data, researchers can identify behavioral patterns, such as physical activity levels and sleep quality, and correlate these with health outcomes like metabolic syndrome and mental health conditions. These insights are valuable for creating personalized health interventions and preventive strategies. For RNA and microbiome data, synthetic datasets enable the identification of biomarkers and predictive factors for a wide range of diseases, including cancer, chronic illnesses, and gastrointestinal disorders. By integrating RNA-seq and microbiome data with clinical information, researchers can develop robust risk stratification models to predict disease onset or progression, facilitating earlier diagnosis and targeted treatment strategies.
The multi-omics nature of the synthetic datasets facilitates the discovery of novel biomarkers by enabling the analysis of complex relationships between genetic, transcriptomic, and microbiome data. Gene expression profiles can be examined to identify transcriptional signatures specific to cancer subtypes or chronic disease conditions, thereby facilitating the discovery of potential therapeutic targets [
2]. The composition of the gut microbiota can be linked to metabolic disorders or immune system dysregulation, aiding in the development of interventions to restore microbiome balance and treat related diseases. By identifying disease-specific biomarkers, synthetic data provides a foundation for population-specific biomarker identification and the development of diagnostic tools tailored to diverse demographic or health profiles. These findings can also lead to more accurate diagnostics and personalized treatments.
Synthetic datasets derived from life-log, RNA, and microbiome data also offer valuable insights for public health research. Descriptive statistics generated from these datasets can simulate national trends in health behaviors, such as physical activity, sleep patterns, and microbiome diversity. Such analyses assist researchers in estimating the prevalence of health behaviors or biological factors associated with chronic diseases, thereby informing public health interventions and policy-making. These insights are essential for developing population-level health strategies, such as campaigns to promote physical activity, improve sleep quality, or enhance gut health. Additionally, the capacity to simulate national trends aids in predicting healthcare demands and allocating resources effectively.
Combining synthetic life-log, RNA-seq, and microbiome data allows for comprehensive integrative analyses that uncover the interactions among lifestyle factors, genetic profiles, and microbial ecosystems. This multi-dimensional approach enables researchers to investigate the interplay between physical activity, gene expression, and gut health. Furthermore, lifestyle variables such as sleep patterns can influence microbiome diversity and, subsequently, health outcomes. By leveraging these datasets, researchers can uncover new biological pathways and interactions that drive disease progression, ultimately enabling the development of personalized healthcare solutions and preventive strategies. This integrative analysis is crucial for advancing precision medicine, where treatments are tailored to individuals’ unique genetic, environmental, and microbiome profiles.
Synthetic datasets are also valuable for educational programs, offering realistic data for training in data analysis, model development, and simulation-based projects.
Strengths and Weaknesses
Strengths and Weaknesses of Synthetic Data
Synthetic data offers several compelling advantages that make it a valuable tool for research and analysis. First, it provides robust privacy protection by eliminating identifiable information, thereby addressing the critical privacy concerns that often accompany real-world datasets. This feature enables researchers to work with data without violating individual confidentiality, a significant hurdle in many studies involving sensitive information.
Another key strength is the simplification of regulatory processes. Because synthetic data does not involve actual personal data, it bypasses the need for extensive ethical reviews and institutional approvals, thereby streamlining the data-sharing process. This significantly reduces the administrative burden on researchers and accelerates project initiation.
Synthetic data also demonstrates high utility by accurately reflecting the statistical properties of real-world data. This fidelity supports diverse analytical approaches and ensures that findings derived from synthetic data are comparable to those based on real data. Additionally, synthetic datasets are freely shareable for research and educational purposes, enhancing accessibility and promoting collaboration across institutions and disciplines.
Lastly, the broad applicability of synthetic data makes it an invaluable resource across various fields, including healthcare, AI development, and public health. Its versatility ensures that it can support a wide range of studies, from predictive modeling to public health surveillance.
Nonetheless, synthetic data also have notable limitations. One significant challenge is the need for thorough validation. To ensure usability and fidelity, synthetic data must be rigorously evaluated against real-world datasets. Without proper validation, insights derived from synthetic data may lack reliability, thereby limiting its effectiveness.
Another weakness is the inability of synthetic data to fully capture rare or highly complex patterns present in real data. This limitation is particularly evident when studying rare diseases or genetic variants, where the unique nuances of the original dataset may be lost during the synthetic generation process. Addressing this issue requires continuous refinement of synthetic data generation techniques to improve their ability to replicate these intricate patterns.
By understanding both the strengths and weaknesses of synthetic data, researchers can better leverage its potential while mitigating its limitations. This balanced approach ensures that synthetic data remains a powerful and reliable tool for advancing scientific and medical knowledge.
Discussion
Integration and relationship of multimodal data
To ensure that synthetic data accurately reflects real-world characteristics, different generation methods were applied, each tailored to a specific data type. For life-log data, time-series patterns were preserved using an RTSGAN model, which captures temporal dependencies and variability. For multi-omics data, synthetic RNA-seq datasets were constructed by introducing controlled variations to gene expression metrics such as read count, FPKM, and TPM. Similarly, synthetic methyl-seq data was generated by incorporating stochastic noise to maintain the statistical integrity of group-specific trends.
In contrast, microbiome datasets were generated by retaining microbial features and taxonomic classifications from original fecal and saliva profiles, without introducing statistical perturbations. All personal identifiers were removed from all data types to ensure data privacy while preserving the compositional structure of the original data.
Whole genome sequencing (WGS) and whole exome sequencing (WES) data, although originally collected as part of the Healthcare Big Data Showcase Project, were not included in this synthetic data generation study due to several practical limitations. These data types are extremely large in scale, requiring substantial storage, processing power, and computational time to handle. Moreover, the complexity of genomic variation—including rare variants, structural rearrangements, and coverage depth—makes it challenging to generate realistic synthetic equivalents.Another key limitation is the absence of widely accepted evaluation frameworks to validate the biological fidelity of synthetic WGS/WES data. Unlike expression or methylation profiles, where group-level statistical patterns can be preserved with controlled noise, genomic sequencing data require more advanced modeling techniques that are still under development. Therefore, this study focused on RNA-seq and methyl-seq data, which are more amenable to statistical synthesis approaches and allow more reliable fidelity assessments.
Collectively, these approaches support the robustness of synthetic multimodal datasets and demonstrate their potential for reliable downstream analyses.
Ethical implications of using synthetic data in biomedical research
The use of synthetic data in biomedical research presents significant ethical advantages, particularly in addressing concerns related to patient privacy and data security. One primary ethical benefit of synthetic data is its ability to mitigate the risks of re-identification and unauthorized access to sensitive personal health information. Unlike real patient data, synthetic datasets do not contain direct identifiers, thereby reducing concerns about privacy breaches and ethical dilemmas associated with data sharing.
Additionally, synthetic data facilitates broader access to biomedical datasets without compromising individual patient confidentiality. This is particularly relevant in collaborative research settings where data sharing across institutions or countries can be restricted by stringent privacy regulations. By providing a secure and privacy-preserving alternative, synthetic data enables researchers to conduct meaningful analyses while adhering to ethical and legal frameworks.
Moreover, the generation of synthetic datasets allows for the control or mitigation of biases inherent in real-world data, contributing to more balanced and representative analyses. This can enhance the fairness of machine learning models and prevent unintended biases in biomedical research applications.
These ethical advantages underscore the potential of synthetic data as a responsible and effective tool for advancing biomedical research while upholding the principles of patient privacy and ethical integrity.
Access
The synthetic omics data will be made publicly available as educational resources through the Clinical & Omics Data Archive (CODA) platform (
https://coda.nih.go.kr/) once the validation process is complete. This validation ensures that the unique characteristics of each omics dataset are thoroughly considered, thereby preserving the fidelity and utility of the synthetic data for research and educational purposes.
HIGHLIGHTS
• Each synthetic dataset includes1,000 individuals and was generated based on original datasets from 400 individuals in the Healthcare Big Data Showcase Project (2019–2023).
• Recurrent time-series generative adversarial network (RTSGAN), a state-of-the-art autoencoder- generative adversarial networks (GAN) model designed for irregularly timed medical data, was employed to generate synthetic datasets.
• Synthetic data replicate the statistical properties of real data, enabling privacy-preserving research without compromising analytical validity.
• Synthetic datasets for life-log, RNA sequencing, methyl-seq and microbiome data included 1,000 individuals each.
• Utility tests, including “train on synthesized, test on real” (TSTR) and “train on real, test on synthesized” (TRTS), demonstrated the practical value of synthetic data for predictive modeling.
• Synthetic data eliminate the need for institutional review board approval, overcoming barriers to data access and collaboration.
• Synthetic datasets enhance privacy, accessibility, offering broad societal and research benefits.
Supplementary Material
Article information
Ethics Approval
The informed consent was waived because of the retrospective nature of this study.
Conflicts of Interest
Hee Youl Chai has been the Journal Management Team director of Osong Public Health and Research Perspectives. The other authors have no conflicts of interest to declare.
Availability of Data
The datasets generated during the current study will be available in the CODA repository, https://coda.nih.go.kr/.
Authors’ Contributions
Conceptualization: YGL, JEK; Data curation: YGL, MSK, DUN; Formal analysis: MSK; Funding acquisition: HYC, MSK; Project administration: HYC; Resources: HYC; Writing–original draft: YGL; Writing–review & editing: all authors. All authors read and approved the final manuscript.
Table 1.The Healthcare Big Data Showcase Project data
Table 1.
Multi-omics |
Total (n=399)
|
Normal (n=99)
|
Chronic disease (n=100)
|
Breast cancer (n=80)
|
Colorectal cancer (n=80)
|
Gastric cancer (n=40)
|
1sta)
|
2nd |
3rd |
1st |
2nd |
3rd |
1st |
2nd |
3rd |
1st |
2nd |
3rd |
1st |
2nd |
3rd |
1st |
2nd |
3rd |
WGS |
396 |
- |
- |
99 |
- |
- |
95 |
- |
- |
81 |
- |
- |
81 |
- |
- |
40 |
- |
- |
RNA-Seq |
374 |
- |
- |
98 |
- |
- |
95 |
- |
- |
81 |
- |
- |
81 |
- |
- |
24 |
- |
- |
LiquidBx. |
202 |
- |
- |
- |
- |
- |
- |
- |
- |
81 |
- |
- |
81 |
- |
- |
40 |
- |
- |
DTC genes |
99 |
- |
- |
99 |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
- |
Proteom |
379 |
- |
- |
99 |
- |
- |
100 |
- |
- |
65 |
- |
- |
80 |
- |
- |
35 |
- |
- |
Methyl-Seq |
390 |
44 |
- |
99 |
- |
- |
90 |
- |
- |
81 |
16 |
- |
81 |
- |
- |
39 |
- |
- |
Microbiome |
353 |
140 |
168 |
99 |
79 |
73 |
89 |
61 |
40 |
46 |
- |
34 |
79 |
- |
21 |
40 |
- |
- |
Metabolume |
313 |
165 |
- |
92 |
- |
49 |
78 |
- |
51 |
30 |
- |
- |
74 |
65 |
- |
39 |
- |
- |
Table 2.TSTR results (applied only to synthetic life-log data)
Table 2.
|
Scenario 1 |
Scenario 2 |
AUROCa)
|
0.9844 |
0.9667 |
Accuracy |
0.9999 |
0.9677 |
References
- 1. Snoke J, Raab GM, Nowok B, et al. General and specific utility measures for synthetic data. J R Stat Soc Ser A Stat Soc 2018;181:663−88.
- 2. Peng C, Yang X, Chen A, et al. A study of generative large language model for medical research and healthcare. NPJ Digit Med 2023;6:210.
- 3. Vaid A, Lampert J, Lee J, et al. Natural language programming in medicine: administering evidence based clinical workflows with autonomous agents powered by generative large language models. arXiv [Preprint] 2024 Aug 22 https://doi.org/10.48550/arXiv.2401.02851.
- 4. Ke YH, Jin L, Elangovan K, et al. Development and testing of retrieval augmented generation in large language models: a case study report. arXiv [Preprint] 2024 Feb 22 https://doi.org/10.48550/arXiv.2402.01733.
- 5. Kim J, Shim C, Yang BS, et al. General-purpose retrieval-enhanced medical prediction model using near-infinite history. arXiv [Preprint] 2024 Jul 22 https://doi.org/10.48550/arXiv.2310.20204.
- 6. Stamatakos GS, Dionysiou DD, Zacharaki EI, et al. In silico radiation oncology: combining novel simulation algorithms with current visualization techniques. Proc IEEE 2002;90:1764−77.
- 7. Shah NH, Milstein A, Bagley PhD SC. Making machine learning models clinically useful. JAMA 2019;322:1351−2.
- 8. Viceconti M, Henney A, Morley-Fletcher E. In silico clinical trials: how computer simulation will transform the biomedical industry. Int J Clin Trials 2016;3:37−46.
- 9. Pfohl SR, Foryciarz A, Shah NH. An empirical characterization of fair machine learning for clinical risk prediction. J Biomed Inform 2021;113:103621.
- 10. Erdman AG, Keefe DF, Schiestl R. Grand challenge: applying regulatory science and big data to improve medical device innovation. IEEE Trans Biomed Eng 2013;60:700−6.