This workflow provides two complementary ways to perform genomic distribution analysis across multiple annotation layers. Users may either run a single wrapper function to execute all selected analyses in one step, or call each annotation function individually for inspection of intermediate results.
Integrated workflow (recommended):
The genomic_distribution() wrapper function executes one or
more distribution analyses within a single, unified pipeline. According
to the specified distributions argument, it internally
invokes the four distribution functions in sequence, while automatically
handling shared parameters and organizing all outputs in a common
directory.
Step-by-step workflow:
The individual functions genic_distribution(),
ccre_distribution(), chromhmm_distribution(),
and repeat_distribution() can be run independently. This
approach allows users to customize parameters for each analysis and to
inspect intermediate results with greater flexibility.
The genomic_distribution() function performs genomic
distribution analyses across multiple annotation types.
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRanges or GRangesList | Yes | Genomic regions to be annotated. Each element in a GRangesList is treated as an independent sample or cluster. | query = my_peaks |
out_dir |
character | Yes | Output directory for all result tables and plots. The directory will be created if it does not exist. | out_dir = "./distribution_results" |
distributions |
character | No (default: c("genic", "ccre")) |
Annotation layers to run. Supported values include
"genic", "ccre", "chromhmm", and
"repeat". Multiple analyses can be selected
simultaneously. |
distributions = c("genic", "repeat") |
ref_genome |
character | No (default: "hg38") |
Reference genome used for annotation. Supported values are
"hg38" and "mm10". |
ref_genome = "mm10" |
ref_source |
character | No (default: "knownGene") |
Gene annotation source used for genic annotation.
"knownGene" uses UCSC knownGene annotations via TxDb
packages, while "GENCODE" uses GENCODE reference
annotations (Release 49 for hg38, vM35 for mm10). |
ref_source = "GENCODE" |
mode |
character | No (default: "nearest") |
Assignment mode for overlapping features. "nearest"
assigns each query region to the closest annotation feature, while
"weighted" assigns features proportionally based on
overlap. |
mode = "weighted" |
plot |
logical | No (default: TRUE) |
Whether to generate stacked bar plots in addition to summary tables. | plot = FALSE |
library(multiEpiCore)
genomic_distribution(
query = my_grl,
out_dir = "./genomic_distribution_results",
distributions = c("genic", "ccre", "chromhmm", "repeat"),
ref_genome = "hg38",
ref_source = "knownGene",
mode = "nearest",
plot = TRUE
)
The genic_distribution() function annotates genomic
regions with gene structural features, calculates
overlap-based feature proportions, and generates a summary table with an
optional visualization. The genic categories are defined as follows:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with gene structures | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./genic_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
ref_source |
character | No (default: “knownGene”) | Gene annotation source used to define gene models. “knownGene” uses UCSC’s manually-curated gene models (via TxDb packages), “GENCODE” uses comprehensive reference annotations from GENCODE (Release 49 for hg38, vM35 for mm10) | ref_source = "GENCODE" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest feature) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of genic feature distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
genic_distribution.tsv
genic_distribution.pdf (if plot = TRUE)
The ccre_distribution() function annotates genomic
regions with candidate cis-regulatory elements (cCREs)
using RepeatMasker data, calculates overlap-based repeat class
proportions, and outputs a summary table with an optional visualization.
The cCRE categories are defined as follows:
Promoter (Promoter-like signature):
cCREs located within 200 bp of an annotated or experimentally supported
TSS. They show high chromatin accessibility together with strong H3K4me3
enrichment, consistent with active transcription initiation
sites.
Enhancer (Enhancer-like signature):
cCREs with high chromatin accessibility and strong H3K27ac signals.
Elements within 2 kb of a TSS are classified as TSS-proximal enhancers,
while those farther away are classified as TSS-distal enhancers. Regions
within 200 bp of a TSS must have low H3K4me3 to avoid promoter
assignment.
CA-H3K4me3 (Chromatin accessibility with
H3K4me3):
cCREs showing high chromatin accessibility and H3K4me3 enrichment, low
H3K27ac, and located outside the 200 bp TSS window. These regions
resemble promoter-like chromatin states but are not directly associated
with annotated TSSs.
CA-CTCF (Chromatin accessibility with
CTCF):
cCREs characterized by high chromatin accessibility and CTCF binding,
with low H3K4me3 and H3K27ac signals. They are typically associated with
chromatin insulation and higher-order genome organization.
CA-TF (Chromatin accessibility with
transcription factor binding):
cCREs with high chromatin accessibility that overlap transcription
factor binding clusters, while showing low H3K4me3, H3K27ac, and CTCF
signals. These regions likely represent TF-driven regulatory
sites.
CA (Chromatin accessibility only):
cCREs marked by high chromatin accessibility but lacking H3K4me3,
H3K27ac, and CTCF signals. They may represent open chromatin regions
with potential or context-dependent regulatory activity.
TF (Transcription factor binding):
cCREs overlapping transcription factor binding clusters but showing low
chromatin accessibility and low levels of H3K4me3, H3K27ac, and CTCF
signals. These sites may reflect transient or condition-specific TF
occupancy.

| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with cCRE elements | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./ccre_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest category) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of cCRE distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
ccre_distribution.tsv
ccre_distribution.pdf (if plot = TRUE)
The chromhmm_distribution() function annotates genomic
regions with chromatin states using ChromHMM data,
calculates overlap-based chromatin state proportions, and produces a
summary table with an optional visualization.
Acet: Acetylation-associated active regulatory
regions
EnhWk: Weak enhancers
EnhA: Active enhancers
PromF: Promoter flanking regions
TSS: Transcription start sites
TxWk: Weak transcription
TxEx: Transcription with exon signals
Tx: Transcription
OpenC: Open chromatin
TxEnh: Transcribed enhancers
ReprPCopenC: Polycomb-repressed regions with open chromatin
BivProm: Bivalent promoters
ZNF: Zinc finger gene clusters
ReprPC: Polycomb-repressed chromatin
HET: Constitutive heterochromatin
GapArtf: Assembly gaps and artifacts
Quies: Quiescent background
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with cCRE elements | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./ccre_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest category) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of cCRE distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
chromhmm_distribution.tsv
chromhmm_distribution.pdf (if plot = TRUE)
The repeat_distribution() function anotates genomic
regions with chromatin states using ChromHMM data, computes
overlap-based chromatin state percentages, and generates a summary table
along with an optional visualization.
SINE: Short interspersed nuclear elements
LINE: Long interspersed nuclear elements
LTR: Long terminal repeat retrotransposons
Retroposon: Non-LTR retrotransposon-derived
elements
RC: Rolling-circle transposons
DNA: DNA transposons
Satellite: Satellite repeats
Simple_repeat: Simple sequence repeats
Low_complexity: Low-complexity sequences
rRNA: Ribosomal RNA repeats
tRNA: Transfer RNA repeats
snRNA: Small nuclear RNA repeats
scRNA: Small cytoplasmic RNA repeats
srpRNA: Signal recognition particle RNA
repeats
RNA: Other RNA-related repeats
Unknown: Unclassified repeat elements
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with cCRE elements | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./ccre_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest feature) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of cCRE distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
repeat_distribution.tsv
repeat_distribution.pdf (if plot = TRUE)
The get_matched_control() function generates background
control regions by randomly selecting genes with similar length to the
nearest gene of each query region, then placing control regions at the
same TSS-relative distance on the selected genes. This approach
preserves the relationship between peaks and gene structure while
avoiding biases from using the same genomic neighborhoods.
What this function does:
For each query genomic region: - Identify the nearest gene (used only to define the TSS-relative offset) - Compute the signed distance to the gene TSS, accounting for strand
To generate control regions: - Select genes with similar lengths (excluding the original gene) - Randomly sample matched genes as anchors - Place control regions at the same TSS-relative positions
Quality control: - Automatically relax gene length tolerance if candidates are insufficient - Discard control regions outside valid chromosome boundaries
Output: - Generate multiple control regions per query region (configurable replicates) - Return all control regions as a single GRanges object
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRanges or GRangesList | Yes | Query regions. If GRangesList, controls are generated separately for each element and combined | query = my_peaks or
query = peak_clusters |
ref_genome |
character | No (default: “hg38”) | Reference genome version: “hg38” (Human GRCh38) or “mm10” (Mouse GRCm38) | ref_genome = "hg38" |
ref_source |
character | No (default: “knownGene”) | Gene annotation source: “knownGene” (UCSC TxDb) or “GENCODE” (v49 for hg38, vM23 for mm10) | ref_source = "GENCODE" |
style |
character | No (default: NULL) | Chromosome naming style. If NULL, automatically inferred from input query object | style = "UCSC" |
n_rep |
integer | No (default: 1) | Number of control regions to generate per query region | n_rep = 3 |
regions |
integer | No (default: 800) | Width of control regions in base pairs (centered on calculated position) | regions = 500 |
seed |
integer | No (default: 42) | Random seed for reproducible control region selection | seed = 12345 |
length_tolerance |
numeric | No (default: 0.2) | Initial tolerance for gene length matching (0.2 = ±20%). Automatically relaxed up to ±100% if insufficient candidates | length_tolerance = 0.3 |
length_tolerance
that is adaptively relaxed up to ±100% if needed.The function returns a GRanges object containing all successfully generated control regions.
Object type: GRanges (GenomicRanges package)
Number of regions:
n_rep × length(query)
regionsn_rep × sum(lengths(query)) regionsn_rep control
regionsRegion characteristics:
regions parameterstyle parameter
(e.g., “chr1” for UCSC)
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr21 6497621-6498421 *
[2] chr21 8197426-8198226 *
[3] chr21 10482659-10483459 *
[4] chr21 13979717-13980517 *
[5] chr21 14216386-14217186 *
[6] chr21 15729512-15730312 *
[7] chr21 16301884-16302684 *
[8] chr21 25735402-25736202 *
[9] chr21 26059459-26060259 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(multiEpiCore)
# Load query peaks
query_peaks <- rtracklayer::import("CTCF_peaks.bed")
# Basic usage with single GRanges
control_regions <- get_matched_control(
query_gr = query_peaks,
ref_genome = "hg38",
regions = 800 # 800bp control regions
)
# Generate multiple controls per query
control_regions <- get_matched_control(
query = query_peaks,
ref_genome = "hg38",
n_rep = 5, # 5 controls per query
regions = 800
)
# Mouse genome with GENCODE annotations
control_regions <- get_matched_control(
query = query_peaks,
ref_genome = "mm10",
ref_source = "GENCODE", # Use GENCODE instead of knownGene
n_rep = 3,
length_tolerance = 0.3,
seed = 123
)
# GRangesList input (biclustering results)
cluster_file <- read.table("biclustering_results.tsv", header = TRUE)
peaks_gr <- makeGRangesFromDataFrame(cluster_file, keep.extra.columns = TRUE)
peaks_grl <- split(peaks_gr, peaks_gr$cluster)
control_regions <- get_matched_control(
query = peaks_grl, # GRangesList with named clusters
ref_genome = "hg38",
ref_source = "knownGene",
n_rep = 2
)
The TFBS_enrichment() function performs motif enrichment
analysis comparing a set of query genomic regions against control
regions using Fisher’s exact test to identify significantly enriched
transcription factor binding motifs.
What this function does:
Compares query genomic regions (GRanges) against background/control regions (GRanges)
Works directly with GRanges objects without file I/O for region inputs
Overlaps regions with a comprehensive motif library (stored as RDS files)
Optionally filters motif sites using functional region annotations (GRanges)
Counts motif occurrences in query vs control regions using
countOverlaps()
Performs Fisher’s exact test for each motif with pseudocount correction
Calculates odds ratios with standard errors
Adjusts p-values for multiple testing using Benjamini-Hochberg FDR
Identifies transcription factors with significantly enriched binding sites
Provides quantitative enrichment statistics for downstream interpretation
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRanges or GRangesList | Yes | Target genomic regions. If GRangesList, analysis is performed separately for each element | query = my_peaks or
query = peak_clusters |
control |
GRanges | Yes | Control/background genomic regions. Same control set is used for all elements if query is GRangesList | control = background_peaks |
regions |
integer or NULL | No (default: NULL) | Width in base pairs to resize all regions around their center. If NULL, uses original region sizes | regions = 500 |
ref_genome |
character | No (default: “hg38”) | Reference genome version: “hg38” (human) or “mm10” (mouse) | ref_genome = "hg38" |
functional_region |
GRanges or NULL | No (default: NULL) | Optional functional regions (e.g., ATAC-seq peaks) for filtering motif sites. Only motifs overlapping these regions are considered | functional_region = open_chromatin |
out_dir |
character | No (default: “./”) | Output directory path. For GRanges input, can also be a full file path. Directory is created if it doesn’t exist | out_dir = "./results/" |
style |
character or NULL | No (default: NULL) | Chromosome naming style (e.g., “UCSC”, “NCBI”). If NULL, auto-detected from query object | style = "UCSC" |
Key parameter notes:
out_dir/TFBS_enrichment.tsvTFBS_enrichment_<label>.tsvout_dir is a file path (not directory) for GRanges
input, uses that exact pathregions = NULL: Regions keep their original
widthsregions specified: Both query and control regions
are resized to this width, centered on each region’s midpointFor GRanges input: - Single TSV file:
<out_dir>/TFBS_enrichment.tsv
For GRangesList input: - Multiple TSV files:
<out_dir>/TFBS_enrichment_<label>.tsv (one per
element)
TSV file format: Each TSV file contains motif enrichment statistics with the following columns:
| feature | target_hit | control_hit | target_off | control_off | odds_ratio | pvalue | odds_ratio_se | FDR |
|---|---|---|---|---|---|---|---|---|
| CREBBP | 11 | 4 | 206 | 3086 | 35.686987900689 | 2.41692888678869e-11 | 0.537110966474507 | 7.46831026017705e-09 |
| NCOA2 | 10 | 22 | 207 | 3068 | 7.04766662951468 | 6.09470440920052e-06 | 0.373524943105363 | 0.000812607561274904 |
| STAT5A | 12 | 34 | 205 | 3056 | 5.50587515453413 | 7.88939379878547e-06 | 0.332680015204062 | 0.000812607561274904 |
| MCM5 | 3 | 0 | 214 | 3090 | 57.3957662256922 | 8.84174540651768e-05 | 1.1202565253937 | 0.00546419866122793 |
| ZNF274 | 3 | 0 | 214 | 3090 | 57.3957662256922 | 8.84174540651768e-05 | 1.1202565253937 | 0.00546419866122793 |
| PHB2 | 10 | 36 | 207 | 3054 | 4.36306397950694 | 0.000210357129487557 | 0.350814969016483 | 0.0108333921686092 |
library(multiEpiCore)
# =======================================
# Enrichment analysis with GRanges
# =======================================
query_region <- rtracklayer::import("query_peaks.bed")
control_region <- rtracklayer::import("control_peaks.bed")
# Using original region widths
TFBS_enrichment(
query = query_peaks,
control = control_peaks,
out_dir = "./results/",
ref_genome = "hg38"
)
# Resize all regions to 200bp centered windows
TFBS_enrichment(
query = query_peaks,
control = control_peaks,
regions = 200,
out_dir = "./results/",
ref_genome = "hg38"
)
# With functional region filtering (e.g., open chromatin)
atac_peaks <- rtracklayer::import("ATAC_peaks.bed")
TFBS_enrichment(
query = query_peaks,
control = control_peaks,
regions = 500,
functional_region = atac_peaks,
out_dir = "./results/",
ref_genome = "hg38"
)
# =======================================
# Batch processing with GRangesList
# =======================================
cluster_file <- read.table("biclustering_results.tsv", header = TRUE)
peaks_gr <- makeGRangesFromDataFrame(cluster_file, keep.extra.columns = TRUE)
peaks_grl <- split(peaks_gr, peaks_gr$cluster)
TFBS_enrichment(
query = peaks_grl, # GRangesList with clusters A, B, C
control = control_peaks,
regions = 800,
out_dir = "./results/",
ref_genome = "hg38"
)
The TFBS_enrichment_heatmap() function generates
heatmaps to visualize transcription factor binding site (TFBS)
enrichment patterns across multiple genomic clusters or experimental
conditions, using results from TFBS_enrichment()
analysis.
What this function does:
Reads multiple TFBS enrichment result files from
TFBS_enrichment() output
Combines odds ratios, FDR values, and query hits across all samples/clusters
Filters TFBS based on significance (FDR ≤ 0.05), effect size (odds ratio ≥ 2), and coverage criteria
Allows filtering for specific transcription factors of interest via pattern matching
Applies log2 transformation to odds ratios for symmetric visualization
Generates a comprehensive heatmap showing all TFBS passing filtering criteria
(Optional) Creates a focused heatmap displaying the union of top N most enriched TFBS per sample
(Optional) Supports optional hierarchical clustering of samples for pattern discovery
Produces publication-ready visualizations with consistent formatting and scalable dimensions
Exports log2-transformed odds ratio and FDR matrices for downstream analysis
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
tsv_path |
character vector | Yes | Vector of file paths to TFBS enrichment result TSV files from TFBS_enrichment() analysis | tsv_path = c("cluster1.tsv", "cluster2.tsv") |
label |
character vector | Yes | Vector of sample/cluster labels corresponding to tsv_path. Must have same length as tsv_path | label = c("Cluster1", "Cluster2", "Cluster3") |
out_dir |
character | Yes | Output directory to save heatmaps and data matrices | out_dir = "./heatmaps" |
top_n |
integer | No (default: NULL) | Number of top TFBS to display in second heatmap based on coefficient of variation (CV = SD/mean) across samples. If NULL, only the full heatmap is generated | top_n = 50 |
selected_tfs |
character vector | No (default: NULL) | Vector of transcription factor names or substrings to filter rows (TFBS). Matching is case-insensitive. If NULL, all TFBS retained | selected_tfs = c("CTCF", "AP1", "STAT") |
apply_cluster |
logical | No (default: FALSE) | Whether to apply hierarchical clustering to columns and reorder them. If TRUE, columns are clustered and reversed for visualization | apply_cluster = TRUE |
TFBS_enrichment() functionodds_ratio, FDR,
query_hitapply_cluster = FALSE): Columns
ordered by sorted labelsapply_cluster = TRUE):
The function generates the following output files in the specified
out_dir:
TFBS_heatmap_all.pdf
apply_cluster = TRUETFBS_heatmap_top<n>.pdf (if top_n
provided)
apply_cluster = TRUETFBS_odds_ratio_log2.csv
TFBS_FDR.csv
TFBS_odds_ratio_log2.csv)TFBS_odds_ratio_log2.csv)library(multiEpiCore)
# Example 1: Basic usage - generate heatmap for all filtered TFBS
TFBS_enrichment_heatmap(
tsv_path = c(
"output/TFBS_enrichment_Cluster1.tsv",
"output/TFBS_enrichment_Cluster2.tsv",
"output/TFBS_enrichment_Cluster3.tsv"
),
label = c("Cluster1", "Cluster2", "Cluster3"),
out_dir = "heatmaps/all_tfbs"
)
# Example 2: With top N selection - focus on most enriched TFBS
TFBS_enrichment_heatmap(
tsv_path = c(
"output/TFBS_enrichment_Cluster1.tsv",
"output/TFBS_enrichment_Cluster2.tsv",
"output/TFBS_enrichment_Cluster3.tsv"
),
label = c("Cluster1", "Cluster2", "Cluster3"),
out_dir = "heatmaps/top10",
top_n = 10
)
# Example 3: Filter for specific transcription factors
TFBS_enrichment_heatmap(
tsv_path = c(
"output/TFBS_enrichment_TypeA.tsv",
"output/TFBS_enrichment_TypeB.tsv",
"output/TFBS_enrichment_TypeC.tsv"
),
label = c("TypeA", "TypeB", "TypeC"),
out_dir = "heatmaps/selected_tfs",
top_n = 15,
selected_tfs = c("CTCF", "AP1", "STAT", "NRF", "FOXO")
)