This chapter applies the CDI Systematic Dataset Discovery workflow to a practical example: healthy human gut microbiome datasets.
The purpose of the case study is to show how a broad biological topic can be converted into a documented, screened, prioritized, and CDI-DAS-ready public dataset package.
The case study follows the full discovery workflow:
Research Question
↓
Search Strategy
↓
Candidate Studies
↓
Eligibility Criteria
↓
Dataset Screening
↓
Study Prioritization
↓
Included Study Assembly
↓
CDI-DAS Input Package
The final output is not only a selected BioProject.
The final output is a documented decision trail showing why the BioProject was selected, which accessions are used for testing, and how the result connects to the CDI Data Acquisition System.
Default Case Study Settings
This executable case study uses the default accessions defined in the workflow scripts.
Which public omics studies contain healthy human gut microbiome sequencing data suitable for reference dataset assembly?
The case-study objective is:
Identify, screen, and prioritize public studies containing healthy human gut microbiome sequencing data, and assemble a curated accession package suitable for CDI-DAS data acquisition.
This objective connects the discovery workflow directly to downstream acquisition.
Discovery Scope
The discovery scope focuses on public omics studies with:
human samples
gut, stool, fecal, faecal, intestinal, or gastrointestinal sample context
healthy, non-diseased, control, or baseline sample relevance
microbiome sequencing data
public repository accessions
sufficient metadata for screening and reuse
suitability for CDI-DAS handoff
The workflow does not attempt to include every microbiome publication.
It focuses on accession-linked public studies that can move from discovery into reproducible acquisition.
Primary Case-Study Accession
The prioritized BioProject for this case study is:
PRJNA802976
This BioProject is used as the primary case-study accession because it provides a practical human gut microbiome example for systematic discovery and CDI-DAS handoff.
It is treated as the main study carried forward from discovery into data acquisition.
Primary Test Subset
The primary test subset is:
SRR17868090
SRR17868091
SRR17868092
These three run accessions are used as a small acquisition-testing subset.
A test subset is useful because it allows the workflow to validate metadata retrieval, manifest generation, download planning, and file validation before scaling to the full BioProject.
PRJNA802976
↓
SRR17868090
SRR17868091
SRR17868092
↓
CDI-DAS test acquisition
Secondary Comparison Accession
A secondary comparison BioProject is retained:
PRJNA322554
This accession is not the primary case-study handoff.
It is retained as a comparison record to demonstrate that relevant public studies may differ in technical characteristics, sequencing layout, metadata completeness, repository structure, or acquisition readiness.
Step 1: Discovery Output Structure
The case study begins by creating the standard discovery output files.
The inclusion criteria define what a study must contain to move forward.
The exclusion criteria define what removes a study from the main workflow.
For this case study, eligibility focuses on human gut or stool microbiome studies with public sequencing accessions, usable metadata, and readiness for CDI-DAS handoff.
Step 6: Dataset Screening
The candidate studies are screened using the eligibility criteria.
The prioritization table ranks records based on research alignment, metadata quality, accession clarity, file availability, CDI-DAS readiness, and test subset suitability.
Together, these files document the full discovery process.
They show what was planned, where records were searched, which candidates were identified, how criteria were applied, which studies were prioritized, and what was handed off to CDI-DAS.
Manual Judgment in the Case Study
This case study is executable and reproducible, but it is not fully automatic.
Manual judgment is required when defining the research question, interpreting search results, reviewing metadata, applying eligibility criteria, resolving ambiguous records, and approving the final handoff.
The scripts make these decisions structured and documented.
For this case study, the key manual decision is that PRJNA802976 is prioritized as the main BioProject, while PRJNA322554 is retained only as a secondary comparison.
Handoff to CDI-DAS
At the end of this workflow, CDI-DAS receives a clear input package.
CDI Systematic Dataset Discovery
↓
PRJNA802976
↓
SRR17868090
SRR17868091
SRR17868092
↓
CDI Data Acquisition System
CDI-DAS can then use these accessions for metadata acquisition, manifest generation, data download, file validation, and reference dataset assembly.
This completes the bridge between systematic discovery and reproducible acquisition.
Case Study Summary
The healthy human gut microbiome case study demonstrates how a broad public-data topic can be converted into a documented, executable, and reproducible discovery workflow.
The workflow begins with a research question and ends with CDI-DAS-ready accessions.
The primary case-study BioProject is:
PRJNA802976
The primary test subset is:
SRR17868090
SRR17868091
SRR17868092
The secondary comparison BioProject is:
PRJNA322554
The result is a transparent and reproducible handoff from dataset discovery into data acquisition.
Looking Ahead
This concludes the CDI Systematic Dataset Discovery guide.
The same workflow can be adapted to other omics domains, including RNA-seq, single-cell, metagenomics, GWAS, proteomics, metabolomics, and multi-omics studies.