Research Questions and Objectives

Published

Jun 2026

Start With the Question

Systematic dataset discovery should begin with a clear research question.

A research question defines what the dataset discovery process is trying to support. It helps determine which studies are relevant, which studies should be excluded, which metadata fields are important, and which accessions should eventually be passed to the CDI Data Acquisition System.

Without a clear research question, dataset discovery can become too broad, inconsistent, or difficult to justify.

For example, the phrase “human gut microbiome datasets” is useful as a topic, but it is not yet a discovery question.

A stronger question would be:

Which public studies contain healthy human gut microbiome sequencing data suitable for building a reference dataset?

This question gives the discovery process a clearer direction. It identifies the domain, population, sample context, and intended use.

From Topic to Discovery Question

A topic is a broad area of interest.

A discovery question is a focused question that guides dataset selection.

Topic
      ↓
Research Area
      ↓
Discovery Question
      ↓
Eligibility Criteria
      ↓
Dataset Selection

For example:

Broad Topic	More Focused Discovery Question
Human microbiome	Which public studies contain healthy human gut microbiome sequencing data?
Cancer RNA-seq	Which public RNA-seq studies include tumor and matched normal samples?
Antimicrobial resistance	Which public metagenomic studies include antibiotic exposure metadata?
Plant genomics	Which public sequencing studies include drought-stress experiments in maize?
Single-cell biology	Which public single-cell studies include annotated immune cell populations?

The discovery question does not need to be perfect at the beginning. It can be refined as the analyst learns more about the available data landscape. However, it should be clear enough to guide the first search strategy.

Why the Question Matters

The research question influences every downstream decision.

It affects:

which repositories are searched
which search terms are used
which metadata fields are required
which studies are considered eligible
which studies are excluded
which datasets are prioritized
which accessions are prepared for acquisition

A vague question produces vague selection decisions.

A clear question makes screening and prioritization more consistent.

For example, if the question focuses on healthy adult gut microbiome studies, then studies involving disease cohorts, animal models, infants, or intervention-only designs may need to be excluded or handled separately.

If the question focuses on benchmarking, then platform consistency and technical metadata may be especially important.

If the question focuses on biological interpretation, then phenotype and sample metadata may be more important.

Define the Purpose of Discovery

A dataset discovery workflow should also define the purpose of the dataset search.

Different purposes may require different selection decisions.

Common purposes include:

Purpose	Discovery Focus
Reference dataset assembly	Well-described studies with clear metadata and reusable accessions
Benchmarking	Comparable studies with consistent technologies or known expected results
Method development	Datasets with enough complexity to test analysis methods
Training or teaching	Small, accessible datasets with clear documentation
Comparative analysis	Studies with compatible groups, metadata, and biological context
Evidence mapping	Broad coverage of available studies in a topic area
Client-facing work	Transparent, defensible, and auditable study selection

The same public study may be suitable for one purpose but unsuitable for another.

For example, a small microbiome study with clear metadata may be excellent for teaching but insufficient for a large reference dataset. A large study may be useful for benchmarking but difficult to interpret if metadata are incomplete.

Define the Discovery Objective

A discovery objective translates the research question into an operational goal.

The research question asks what the analyst wants to know.

The discovery objective defines what the discovery workflow should produce.

Example research question:

Which public studies contain healthy human gut microbiome sequencing data suitable for reference dataset assembly?

Example discovery objective:

Identify, screen, and prioritize public omics studies containing healthy human gut microbiome sequencing data, and produce a curated accession list suitable for CDI-DAS data acquisition.

The objective is important because it connects the discovery workflow to a concrete output.

Research Question
      ↓
Discovery Objective
      ↓
Search Strategy
      ↓
Screening Criteria
      ↓
Included Study List
      ↓
CDI-DAS Input Accessions

Define the Population or Biological Context

Many omics topics are broad. The biological context should be defined early.

Depending on the domain, this may include:

organism
tissue or body site
population group
disease status
exposure or treatment
developmental stage
environmental condition
phenotype or clinical status

For the healthy human gut microbiome case study, the biological context may be defined as:

Component	Example Definition
Organism	Human
Body site	Gut or stool
Health status	Healthy or non-diseased
Data type	Microbiome sequencing
Intended use	Reference dataset assembly

This definition helps separate eligible studies from studies that are only loosely related.

Define the Data Type

The discovery question should also define the type of omics data required.

Examples include:

16S rRNA amplicon sequencing
shotgun metagenomic sequencing
bulk RNA-seq
single-cell RNA-seq
whole-genome sequencing
genotyping array data
ATAC-seq
proteomics
metabolomics

Data type matters because it affects repository choice, metadata expectations, file types, and downstream analysis workflows.

For microbiome studies, 16S rRNA amplicon data and shotgun metagenomic data are both useful, but they support different types of analysis. They may need separate eligibility criteria, separate acquisition workflows, or separate reference dataset packages.

Define the Unit of Inclusion

A systematic discovery workflow should define what is being included.

The unit of inclusion may be:

publication
study
BioProject
dataset
experiment
sample
run accession
repository record

This distinction matters because one publication may describe multiple datasets, one BioProject may contain multiple experiments, and one study may include both eligible and ineligible samples.

For CDI workflows, the practical unit often becomes the public study or BioProject, with run-level accessions used later for acquisition.

Publication
      ↓
Study Record
      ↓
BioProject / Repository Accession
      ↓
Sample Metadata
      ↓
Run Accessions
      ↓
Download Manifest

Defining the unit of inclusion prevents confusion during screening and documentation.

Define the Scope

The discovery scope sets boundaries around the search.

Scope may include:

time period
species
population
sample type
disease status
sequencing technology
repository coverage
language or publication type
metadata requirements
file availability

For example, a healthy human gut microbiome discovery workflow may choose to include only human stool studies with public sequence data and basic sample-level metadata.

The scope should be broad enough to find relevant studies, but narrow enough to support consistent screening.

Define Inclusion and Exclusion Direction Early

Detailed eligibility criteria are developed in a later chapter, but the general direction should be clear from the beginning.

For example:

Potential inclusion direction:

human samples
gut or stool microbiome
healthy or non-diseased participants
public accession available
sequencing data available from public repositories
sufficient metadata for reuse

Potential exclusion direction:

animal studies
disease-only cohorts
intervention-only studies without baseline healthy controls
unclear sample source
missing accession information
unavailable sequence files
insufficient metadata

These early boundaries help make the search strategy more focused and reduce ambiguity during screening.

Define Required Outputs

The discovery workflow should define its expected outputs before the search begins.

For CDI Systematic Dataset Discovery, expected outputs may include:

outputs/
├── candidate-studies.tsv
├── screened-studies.tsv
├── included-studies.tsv
├── excluded-studies.tsv
├── prioritization-table.tsv
└── cdi-das-input-accessions.txt

These outputs turn the discovery question into a reproducible workflow.

They also prepare the selected studies for the CDI Data Acquisition System.

Creating a Discovery Planning Template

Before searching repositories, it is useful to document the research question, discovery objective, scope, and expected outputs.

#| label: create-discovery-plan
#| eval: false

bash scripts/bash/02-create-discovery-plan.sh

The resulting file captures the starting assumptions for the discovery workflow and provides a documented foundation for search strategy development.

Example: Healthy Human Gut Microbiome

For the case study in this guide, the discovery question can be framed as:

Which public omics studies contain healthy human gut microbiome sequencing data suitable for reference dataset assembly?

The discovery objective can be framed as:

Identify, screen, and prioritize public studies containing healthy human gut microbiome sequencing data, and assemble a curated accession list for CDI-DAS data acquisition.

The scope may include:

Discovery Element	Case Study Definition
Domain	Microbiome
Organism	Human
Body site	Gut or stool
Health status	Healthy or non-diseased
Data source	Public omics repositories and linked publications
Output	Eligible studies and accession list
Downstream system	CDI Data Acquisition System

This case study will be used later to demonstrate how a research question becomes a set of eligible public studies.

Common Problems When Questions Are Unclear

Unclear research questions can create several problems:

search results become too broad
unrelated studies enter the candidate list
screening decisions become inconsistent
eligibility criteria are difficult to apply
studies are excluded without clear justification
downstream datasets become difficult to interpret
accession lists become disconnected from the original purpose

These problems can usually be reduced by defining the research question, objective, scope, and expected outputs before searching.

Practical Checklist

Before developing the search strategy, confirm that the following items are clear:

Research topic defined
        ↓
Discovery question written
        ↓
Discovery objective stated
        ↓
Biological context defined
        ↓
Data type identified
        ↓
Unit of inclusion clarified
        ↓
Scope boundaries described
        ↓
Expected outputs listed

This checklist helps ensure that the dataset discovery workflow begins with a clear direction.

Summary

Systematic dataset discovery begins with a clear research question and a practical discovery objective.

The research question defines what the workflow is trying to support. The objective defines what the workflow should produce. Together, they guide repository selection, search strategy development, eligibility criteria, screening decisions, study prioritization, and the final CDI-DAS input package.

A well-defined discovery question helps transform a broad topic into a reproducible dataset selection workflow.

Looking Ahead

In the next chapter, we examine the public data sources and repositories that can be used to identify candidate omics studies.