Standardizing the approach to clinical-based human microbiome research: from clinical information collection to microbiome profiling and human resource utilization
Article information
Abstract
Objectives
This study presents the standardized protocols developed by the Clinical-Based Human Microbiome Research and Development Project (cHMP) in the Republic of Korea.
Methods
It addresses clinical metadata collection, specimen handling, DNA extraction, sequencing methods, and quality control measures for microbiome research.
Results
The cHMP involves collecting samples from healthy individuals and patients across various body sites, including the gastrointestinal tract, oral cavity, respiratory system, urogenital tract, and skin. These standardized procedures ensure consistent data quality through controlled specimen collection, storage, transportation, DNA extraction, and sequencing. Sequencing encompasses both amplicon and whole metagenome methods, followed by stringent quality checks. The protocols conform to international guidelines, ensuring that the data generated are both reliable and comparable across microbiome studies.
Conclusion
The cHMP underscores the importance of methodological standardization in enhancing data integrity, reproducibility, and advancing microbiome-based research with potential applications for improving human health outcomes.
Introduction
The human microbiome comprises all microbes inhabiting various organs and their associated ecosystems [1]. Advancements in high-throughput sequencing and bioinformatics have made microbiome research more feasible, revealing links between microbiomes and diseases [2]. In this field, standardization is crucial to ensure comparable results across studies and accelerate research progress. International initiatives, including the European Union Metagenomics of the Human Intestinal Tract project (http://www.gutmicrobiotaforhealth.com/metahit) and the National Institutes of Health (NIH) Human Microbiome Project (http://commonfund.nih.gov/human-microbiome-project-hmp), have sought to standardize microbiome research methods. The International Human Microbiome Consortium (http://www.human-microbiome.org/) was organized to coordinate these efforts. However, it offers limited guidelines, as it neither covers the full scope of human microbiome research nor includes considerations for other study types.
Since 2023, the Korea Disease Control and Prevention Agency (KDCA) and the Korea National Institute of Health (KNIH), in collaboration with the Ministry of Health and Welfare, have led a Clinical-Based Human Microbiome Research and Development Project (cHMP). The project collects clinical samples from both healthy individuals and patients, including specimens from the gastrointestinal tract, oral and respiratory systems, urogenital system, and skin. These samples are analyzed by a sequencing consortium employing standardized methods to ensure data harmonization. All data are stored, managed, and made publicly available to researchers through an integrated database within the KNIH (Figure 1). Data generated by the cHMP adhere to principles of standardization, digitization, centralization, and national-level standard data sharing. Here, we summarize the principles and processes established for standardizing the cHMP.

Structure and governance of Clinical-Based Human Microbiome Research and Development Project, Republic of Korea.
KNIH, Korea National Institute of Health; MIST, Ministry of Science and ICT; K-BDS, Korea BioData Station; KRIBB, Korea Research Institute of Bioscience & Biotechnology; KOBIC, Korea Bioinformation Center; CODA, Clinical & Omics Data Archive; MOHW, Ministry of Health and Welfare; KDCA, Korea Disease Control and Prevention Agency.
Materials and Methods
Clinical Metadata Collection
Accurate microbiome data collection necessitates corresponding clinical metadata, which is essential for interpreting metagenome and multiomics data in clinical settings. Essential patient information comprises details on both antibiotic and non-antibiotic medication use, dietary habits, and health history recorded within 6 months of specimen collection. Clinical data are collected via case report forms and anonymized by assigning unique participant codes. The case report form excludes identifying information such as names or registration numbers, and the rate of missing clinical data should be less than 10%. Participants are categorized into disease, healthy, and disease control groups; the disease control group comprises individuals without the disease under study. Demographic, comorbidity, and medication data are collected for all groups, while additional blood tests are performed for disease groups and controls, with the exception of oral and skin specimens (Table 1). Essential information items have been defined for gastrointestinal, respiratory, urogenital, and oral specimens. For gastrointestinal specimens, information regarding bowel habits, daily activities, and dietary habits is mandatory (Table 2).

Essential and additional clinical information for healthy individuals, disease groups, and disease control groups
Sample Collection, Storage, and Transportation
Sample collection
For gut microbiota analysis, specimens such as feces, colonic biopsies, and rectal swabs are utilized [1]. Colonic biopsies are invasive and challenging to obtain from healthy individuals, as they necessitate a colonoscopy. The condition of the stool specimen is recorded according to the Bristol stool chart. A minimum of 1 g of solid stool and 5 mL of liquid stool is required [3]. Rectal swabs are selectively employed for gut microbiota analysis due to the high risk of human DNA contamination. Urogenital specimens mainly include vaginal swabs [4–6] and urine samples. In addition, cervical and urethral swabs can be used selectively for specific research purposes. Urine samples can be collected via various methods, including clean-catch midstream urine [7,8], catheterized urine, and suprapubic aspiration. However, suprapubic aspiration has been excluded as a practical collection method due to its invasive nature. Respiratory specimens are collected from both the upper airways (e.g., nasopharyngeal and oropharyngeal swabs) [9,10] and the lower airways (e.g., sputum, bronchial washing, and bronchoalveolar lavage [BAL]) [11]. For oral microbiome analysis, saliva is the preferred specimen, collected either by non-stimulated methods [12] or through rinsing. Subgingival plaque is collected using a curette-based [13] or paper strip-based method [14]. Skin microbiome sampling primarily relies on swabbing and taping, with instructions to refrain from washing the area shortly before collection. For lesion sampling, both the lesion and adjacent non-lesion areas are sampled.
Preventing contamination during sample collection is vital, and gloves and sterilized tools are required. Different preprocessing techniques are used depending on the analysis type—metagenome, metatranscriptome, metabolome, or culturome—to maintain sample integrity and support accurate microbiome research.
Sample storage and transportation
The storage method depends on the time interval between specimen collection at the hospital and delivery to the analytical institution (consortium). If sample delivery to the research institution can be achieved within 2 hours of collection, the specimen should be immediately placed in an icebox for transport. For delivery times between 2 to 4 hours, specimens should be refrigerated at 4 °C until delivery, after which they are placed in an icebox for transportation. For delivery times exceeding 4 hours, specimens must be stored at –20 °C and transported in a frozen state. All specimens should reach the analytical institution within 72 hours of collection, with frozen specimens transported within 24 hours under a maintained cold chain. Upon receipt, specimens must be stored at –70 to –80 °C, as appropriate, to minimize freeze–thaw cycles. The analytical institution must conduct nucleic acid extraction from the received specimens within 72 hours.
Nucleic Acid Extraction, Storage, and Transportation
For stool specimens, frozen samples are thawed at room temperature, homogenized with a spatula, and an appropriate volume is aliquoted for DNA extraction. Urogenital specimens may undergo additional preprocessing steps, such as homogenization and centrifugation, or the sediment may be used after centrifugation, depending on specimen conditions and research objectives. Preliminary validation must be conducted before analysis. Frozen urine specimens are thawed at room temperature and centrifuged at speeds exceeding 3,000 × g for at least 10 minutes at 4 °C. For upper respiratory swab specimens, the swabs are aseptically cut with a sterile scalpel and vortexed with a transport medium; DNA extraction is performed after removing liquid and swab debris. For induced sputum, mucus removal is essential. Bronchial washings and BAL fluids may be concentrated as needed before nucleic acid extraction. Oral specimens often contain substantial human DNA; therefore, selective removal of human DNA can enhance microbiome sequencing efficiency. Several methods may be employed to reduce host genomic DNA contamination and enrich microbial DNA from oral samples, including differential lysis, enzymatic depletion, commercial host DNA removal kits, human-specific DNA blocking during library preparation, and density-based ultracentrifugation. DNA extraction is carried out according to IHMS SOP 01 ver. 2. Each experiment must include either a commercially available mock community or a custom-made mock community designed for the experiment’s specific purpose. DNA should be stored at 4 °C for up to 1 week and at –70 to –80 °C for longer storage periods [15]. The DNA must be transported to the sequencing consortium while maintaining the cold chain.
Sequencing Analysis
DNA samples provided to the sequencing consortium were used to prepare libraries for both amplicon sequencing and whole metagenome sequencing. Sequencing was performed using equipment specifically designated and validated for microbiome analysis by the consortium. The consortium periodically conducted parallel tests to verify sequencing data reliability across instruments and runs; data were deemed reliable if the Bray-Curtis dissimilarity between results was below 0.3. Raw metagenomic reads underwent preprocessing through the following steps: (1) trimming low-quality bases (Phred score >20) and adapter sequences using tools like Trimmomatic; (2) removing duplicate reads (e.g., using Trf); and (3) filtering out human-derived reads by aligning against the human reference genome GRCh37 (hg19) or GRCh38 (hg38) with read mapping programs such as Bowtie or BWA.
Amplicon sequencing
DNA libraries were prepared targeting the hypervariable V3–V4 region of the 16S rRNA gene. The V3–V4 region was amplified using the 341F (5'-CCTACGGGNGGCWGCAG-3') and 805R (5'-GACTACHVGGGTATCTAATCC-3') primers. The amplification program comprised: initial denaturation at 95 °C for 5 minutes; 25 cycles of denaturation at 95 °C for 30 seconds, primer annealing at 55 °C for 30 seconds, and extension at 72 °C for 30 seconds; followed by a final elongation at 72 °C for 5 minutes. Post-sequencing, a minimum of 20,000 quality-controlled reads is required for fecal specimens and 5,000 for other human tissue specimens. The amplicon size should be at least 1,200 bp; if shorter amplicon sequences are used, their use must be justified. Databases such as the National Center for Biotechnology Information (NCBI), SILVA, and Greengene are recommended. Software performance should be validated using virtual data for quality control (QC); recommended programs include Divisive Amplicon Denoising Algorithm 2, K-mer-based Rapid Classification and Identification of Metagenomes, Ribosomal Database Project classifier, and Quantitative Insights Into Microbial Ecology [16–19]. The output should be saved as a CSV table. The table’s first row should list sample names (columns 2 to the last column), while the first column (from the second row onward) should list taxon names; subsequent columns should indicate the proportion of each taxon for each sample, displayed to 4 decimal places. The table should consist of 6 levels: phylum, class, order, family, genus, and species. Bacterial names should adhere to the NCBI Taxonomy version released after January 1, 2023 (Figure 2).
Whole metagenome sequencing
After sequencing, the quantity of data that passes QC and human genome removal should be at least 5 GB for fecal samples and 2 GB for other human tissue specimens. Before use, software performance should be validated with virtual data for QC; MetaPhlAn [20] and Kraken [17] are recommended for this purpose. The output format should match that used for amplicon sequencing. The assembly process is necessary to generate bacterial genomes from the microbiome. Co-assembly should be performed using MEGAHIT, MetaSPAdes, and MetaVelvet [21–23], followed by contig binning using CONCOCT, MetaBAT2, and MaxBin2 [24–26]. The quality of the generated metagenome-assembled genome (MAG) should be assessed based on its completeness and contamination. Completeness is evaluated using the ratio of single-copy genes, while contamination is determined by the frequency of single-copy genes present more than once. MAGs that meet the criteria for medium or higher quality are selected according to MAG evaluation standards. For gene function prediction, tools such as MetaProdigal, MetaGeneMark, FragGeneScan, and Glimmer-MG [27–30] are recommended. Annotation must include at least 1 of the following: Enzyme Commission number, Kyoto Encyclopedia of Genes and Genomes Orthology, Pfam number, or Gene Ontology terms (Figure 2).
Quality Control
Within the cHMP, external quality assessment is performed using mock community standard materials produced by a QC center. Inter-laboratory proficiency tests (IPT) are carried out using a small amount of positive control material. A comparison study was performed on reagents, and target values were assigned to QC materials at each stage. IPT is conducted on 1% of the analyzed samples, randomly selected. The QC center receives these samples to verify the consistency of test results obtained through nucleic acid extraction and sequencing analysis.
Data Collection, Disclosure, and Deposit of Human Resources
All clinical and genomic data collected through this project are stored in the data core, where authorized researchers can access and analyze them (Figure 1). The data sets from cHMP deemed publicly shareable are made available to the public.
Human resources collected in this project (referring to DNA extracted from samples) are generally deposited in the National Bank of Korea (NBK) (Figure 2). The deposit procedure complies with the “Regulations on the Operation and Management of the NBK (KDCA Directive No. 42)” and the “Guidelines for the Management of Human Resources of the NBK” under Section 2.2, Deposit of Human Resources.
Results
Standardizing experimental protocols is essential to minimize biases in microbiome research, especially in sample collection, DNA extraction, sequencing, and analysis. These guidelines provide recommended practices for the initial stages of microbiome studies across various fields, including gastrointestinal, respiratory, oral, skin, urogenital, and clinical research.
Discussion
The cHMP can substantially advance microbiome-based research and clinical applications by establishing rigorous protocols for sample collection, processing, and analysis, which may lead to improved human health outcomes. However, the current Korean microbiome initiative faces several challenges: sporadic government initiatives affecting research and development (R&D) efficiency and national competitiveness, insufficient establishment and management of microbiome banks and databases, inadequate comprehensive R&D support for various disease targets, a lack of commercialization-based technologies for microbiome applications, and an absence of strategies to achieve outcomes distinct from previous initiatives. To overcome these challenges, sustained efforts are required to integrate, manage, and standardize data through large-scale, inter-agency national R&D projects involving multiple government ministries.
HIGHLIGHTS
• Standardized clinical microbiome research protocols were developed to achieve consistent data quality.
• Samples were collected from gastrointestinal, oral, respiratory, urogenital, and skin sites.
• The reliability of data was ensured through strict quality control in amplicon and metagenome sequencing procedures.
• Protocols were established for data storage, management, and disclosure in the microbiome data core.
• Resources have been deposited in National Bank of Korea, Clinical & Omics Data Archive, and National Culture Collection for Pathogens.
Notes
Ethics Approval
Not applicable.
Conflicts of Interest
The authors have no conflicts of interest to declare.
Funding
This research was supported by the National Institute of Health (NIH) Research Project (Project No. 2023-NI-019-01).
Availability of Data
All data generated or analyzed during this study are included in this published article. For other data, these may be requested through the corresponding author.
Authors’ Contributions
Conceptualization: ECC, KJL; Data curation: JWK; Funding acquisition: KJL; Project administration: JWK, ECC; Investigation: JWK, ECC; Visualization: JWK; Writing–original draft: JWK; Writing–review & editing: all authors. All authors read and approved the final manuscript.