7.4 DIVAS: Data Integration Via Analysis of Subspaces
DIVAS (Prothero et al. 2024; Sun et al. 2026) was developed
specifically for the multi-block setting. Given \(K\) data blocks measured on the same \(n\) samples, DIVAS decomposes each block’s
signal subspace into components that are jointly shared
(present in all blocks), partially shared (present in a
subset of blocks), or individual (present in one block
only).
The key idea is geometric. Each data block \(\mathbf{X}_k \in \mathbb{R}^{d_k \times
n}\) has a signal subspace in trait space \(\mathbb{R}^n\). DIVAS measures the
relationships between these subspaces via principal
angles: the canonical angles between pairs of subspaces. Small
angles indicate shared structure; large angles indicate independent
variation. A rotational bootstrap provides statistical inference on
which angles are significantly small (i.e. which components are
genuinely shared vs. coincidentally aligned).
For each component, DIVAS returns: - A score vector
\(\mathbf{s} \in \mathbb{R}^n\) (shared
across the relevant blocks), analogous to a factor loading in NMF — it
tells you which samples participate. - A loading vector
\(\boldsymbol{\ell}_k \in
\mathbb{R}^{d_k}\) per block, analogous to a gene/protein weight
— it tells you which features drive the component in each block.
The decomposition is:
\[\mathbf{A}_k = \sum_{\mathbf{i} \ni k}
\mathbf{L}_{\mathbf{i},k} \, \mathbf{S}_\mathbf{i}^\top\]
where \(\mathbf{i}\) indexes subsets
of blocks, \(\mathbf{S}_\mathbf{i}\)
are the score vectors shared by blocks in subset \(\mathbf{i}\), and \(\mathbf{L}_{\mathbf{i},k}\) are the
corresponding loadings for block \(k\).
Key advantages over NMF stacking:
| Shared vs. individual |
Not distinguished |
Explicitly identified and labelled |
| Number of each type |
Single total K |
\(K_\text{shared}\), \(K_\text{indiv,1}\), \(K_\text{indiv,2}\) estimated
separately |
| Statistical inference |
None on component identity |
Rotational bootstrap p-values for shared structure |
| Input structure |
Single matrix (stacking choice is the user’s) |
Separate blocks; no stacking decision needed |
| Sparsity / prior |
EB priors (flashier), regularisation (standard NMF) |
SVD-based; no sparsity prior |
7.5 COVID-19 case study: proteomics T1 vs. T2
We applied DIVAS to the same COVID-19 proteomics data used in the NMF
analyses, treating T1 and T2 as two blocks (each 481 proteins × 120
patients, Olink NPX log2, column-centred — proteins centred within each
block). DIVAS identified 24 components: 10 shared
(T1+T2), 8 T1-individual, and 6 T2-individual.
library(DIVAS)
# ── Build blocks: features × samples (DIVAS convention) ─────────────────────
# Prot_T1 and Prot_T2: 481 proteins × 120 patients, same column order
Prot_T1 <- X_prot[grepl("T1", rownames(X_prot)), ]
Prot_T2 <- X_prot[grepl("T2", rownames(X_prot)), ]
data_list <- list(T1 = t(Prot_T1), T2 = t(Prot_T2))
# ── Fit DIVAS ────────────────────────────────────────────────────────────────
if(F){
set.seed(123)
tic <- Sys.time()
divas_res <- DIVASmain(
datablock = data_list,
nsim = 400, # bootstrap simulations for rank estimation
colCent = TRUE, # centre proteins within each block
rowCent = FALSE,
seed = 123,
ReturnDetail = TRUE # needed for getTopFeatures()
)
Sys.time() - tic
saveRDS(divas_res, "EBNMF/CovidCase/results/divas_2block_forWorkshop.rds")
} else {
divas_res <- readRDS("EBNMF/CovidCase/results/divas_2block_forWorkshop.rds")
}
# ── Extract results ──────────────────────────────────────────────────────────
scores_mat <- divas_res$Scores # 120 patients × 24 components
comp_names <- colnames(scores_mat)
cat(sprintf("Components: %d\n", ncol(scores_mat)))
print(table(ifelse(grepl("\\+", comp_names), "Shared", "Individual")))
# ── Score heatmap ────────────────────────────────────────────────────────────
library(pheatmap)
ann_col <- data.frame(
T1_severity = t1_severity,
WHO_delta = who_delta,
row.names = rownames(scores_mat)
)
pheatmap(
t(scores_mat),
annotation_col = ann_col,
cluster_rows = FALSE,
cluster_cols = TRUE,
show_colnames = FALSE,
color = colorRampPalette(c("#313695", "white", "#A50026"))(100),
main = "DIVAS scores: proteomics T1 vs T2"
)
# ── Associate with clinical metadata ────────────────────────────────────────
t1_meta <- meta[rownames(Prot_T1),]
t2_meta <- meta[rownames(Prot_T2),]
t1_severity <- t1_meta$who_score_num
t2_severity <- t2_meta$who_score_num
rho_sev <- apply(scores_mat, 2, function(x)
cor(x, t1_severity, method = "spearman", use = "complete.obs"))
pval_sev <- sapply(seq_len(ncol(scores_mat)), function(k)
cor.test(scores_mat[, k], t1_severity, method = "spearman")$p.value)
fdr_sev <- p.adjust(pval_sev, method = "BH")
data.frame(
rho = rho_sev, pval = pval_sev, pval_adj = fdr_sev
)
rho_sev <- apply(scores_mat, 2, function(x)
cor(x, who_delta, method = "spearman", use = "complete.obs"))
pval_sev <- sapply(seq_len(ncol(scores_mat)), function(k)
cor.test(scores_mat[, k], who_delta, method = "spearman")$p.value)
fdr_sev <- p.adjust(pval_sev, method = "BH")
data.frame(
rho = rho_sev, pval = pval_sev, pval_adj = fdr_sev
)
data.frame(
T1T2_1 = scores_mat[,"T1+T2-1"],
T1T2_2 = scores_mat[,"T1+T2-2"],
T1_severity = t1_severity
) |>
ggplot(aes(T1_severity, T1T2_1)) +
geom_point() +
geom_smooth()
# ── Top proteins for a shared component ─────────────────────────────────────
# Shared components: compName = e.g. "T1+T2-1", modName = "T1" or "T2"
feats_T1 <- getTopFeatures(divas_res, compName = "T1+T2-1", modName = "T1",
n_top_pos = 10, n_top_neg = 10)
cat("T1+T2-1 top positive proteins (T1):", paste(feats_T1$top_positive, collapse = ", "), "\n")
feats_T2 <- getTopFeatures(divas_res, compName = "T1+T2-1", modName = "T2",
n_top_pos = 10, n_top_neg = 10)
cat("T1+T2-1 top positive proteins (T2):", paste(feats_T2$top_positive, collapse = ", "), "\n")
feats_T1 <- getTopFeatures(divas_res, compName = "T1+T2-6", modName = "T1",
n_top_pos = 10, n_top_neg = 10)
cat("T1+T2-6 top positive proteins (T1):", paste(feats_T1$top_positive, collapse = ", "), "\n")
feats_T2 <- getTopFeatures(divas_res, compName = "T1+T2-6", modName = "T2",
n_top_pos = 10, n_top_neg = 10)
cat("T1+T2-6 top positive proteins (T2):", paste(feats_T2$top_positive, collapse = ", "), "\n")
Shared components and severity
The shared component T1+T2-1 showed the strongest
severity association of any component in the entire analysis (Spearman
\(\rho = -0.721\) with T1 WHO score,
FDR \(< 0.0001\)). This is a
time-stable severity program — the same proteomics axis separates mild
from severe patients at both admission and follow-up (negative \(\rho\): higher score = lower severity). The
strength of this association (\(|\rho| =
0.721\)) substantially exceeds the best NMF result (\(\rho = 0.344\) for EBNMF F5), illustrating
how separating shared from individual variation sharpens the severity
signal. The second shared component T1+T2-2 also
associates strongly with severity (\(\rho =
-0.500\), FDR \(<
0.0001\)).
The shared component T1+T2-6 shows the strongest
association with clinical change (WHO delta, \(\rho = -0.334\), FDR \(= 0.005\)): patients with higher scores
tended to improve more from admission to follow-up.
Individual components: what NMF cannot easily reveal
The 14 individual components (8 T1, 6 T2) are the distinctive
contribution of DIVAS. These capture proteomics variation unique to one
time point — programs that are present at admission but not follow-up,
or vice versa. None of these have strong severity associations (all FDR
> 0.3), which is itself informative: it confirms that severity is
predominantly a time-stable phenomenon. The value of individual
components lies in revealing time-specific biology.
T1-individual (admission-specific) themes:
Troponin I / IFN-gamma vs. IL-10 axis
(T1-Indiv-1). Troponin I (+0.188) and FKBP4 (immunophilin)
vs. IFN-gamma (−0.178) and IL-10 at admission.
Troponin I / HAVCR1 cardiac-renal damage
(T1-Indiv-2). Troponin I (+0.262) and HAVCR1/TIM-1 (kidney
injury) dominate the positive pole; opposed by IL-5 isoforms (−0.178). A
cardiac/renal injury axis present at admission.
FGF21 / HAO1 hepatic metabolic stress
(T1-Indiv-3). HAO1 (liver peroxisomal, −0.313) and FGF21
(hepatokine, multiple isoforms) mark liver metabolic derangement.
Distinct from T1-Indiv-2 (metabolic/mitochondrial stress vs
cardiac/renal injury).
NAD kinase / IFN-gamma axis (T1-Indiv-4). NAD
kinase (+0.129) and TRIM21 vs. HAVCR1/TIM-1 (−0.133) and FGF21.
Metabolic enzyme vs. kidney/hepatic damage axis.
Meprin A / IFN-gamma axis (T1-Indiv-5). Meprin A
metalloprotease (+0.165) and cardiac markers (TnI, BNP) vs. IFN-gamma
(−0.269), HAO1, and CXCL9. An antiviral response opposed by tissue
remodelling and cardiac markers.
SERPINA12 / troponin cardiovascular axis
(T1-Indiv-6). SERPINA12/vaspin (+0.221), RNase 3, troponin I,
and BNP vs. KRT19 (−0.162), AGR2. Cardiovascular markers at
admission.
IL-10 / pleiotrophin axis (T1-Indiv-7). IL-10
isoforms (+0.166) and pleiotrophin vs. GH1 (−0.245). Anti-inflammatory
vs. somatotropic axis at admission.
NCF2 / SERPINA12 neutrophil axis (T1-Indiv-8).
NCF2/p67phox (+0.191), nibrin, SH2D1A vs. SERPINA12 (−0.209), TnI, and
IFN-gamma. Neutrophil oxidative burst markers.
T2-individual (follow-up-specific) themes:
FGF21 / CCL3 vs. HAO1 (T2-Indiv-1). FGF21
isoforms (+0.152) and CCL3 vs. ARNT (−0.150, hypoxia transcription
factor), HAO1, and meprin. A hepatic metabolic axis emerging at
follow-up.
IL-6 vs. troponin I (T2-Indiv-2). IL-6 isoforms
(+0.129) vs. troponin I (−0.245) and HAVCR1 isoforms. Persistent
inflammation opposed by cardiac/renal damage.
GH1 / IL-5 vs. HAO1 / meprin (T2-Indiv-3). GH1
(+0.266) and IL-5 isoforms vs. HAO1 (−0.236), meprin, FBP1, and CA5A.
Somatotropic/eosinophilic vs. hepatorenal damage axis at
follow-up.
GH1 / IL-10 vs. HAVCR1 (T2-Indiv-4). GH1
(+0.156) and IL-10 isoforms vs. HAVCR1 isoforms (−0.175) and SERPINA12.
Metabolic recovery vs. kidney injury axis.
GH1 / HAVCR1 vs. TMPRSS15 (T2-Indiv-5). GH1
(+0.305) and HAVCR1 isoforms vs. TMPRSS15 (−0.136, enterokinase), EPO,
and IL-5. GI/renal axis at follow-up.
GH1 / SERPINA12 vs. IL-10 (T2-Indiv-6). GH1
(+0.294) and SERPINA12 vs. IL-10 isoforms (−0.194) and CCL20.
Metabolic/adipokine vs. anti-inflammatory axis.
What individual components reveal that NMF does not
The NMF analysis (Part IV) found 14 programs on the stacked matrix,
some of which hinted at time-dependent behaviour (e.g. factors with
higher T2 than T1 loadings). But it could not answer: “Is the troponin I
cardiac damage program at admission a genuinely distinct axis, or just a
noisy variant of a time-stable program?” DIVAS answers this directly:
T1-Indiv-2 (troponin/HAVCR1) is statistically individual — it passed the
rotational bootstrap test for being absent from the T2 signal subspace.
Similarly, the GH1-dominated metabolic axes (T2-Indiv-3 through
T2-Indiv-6) are confirmed as follow-up-specific phenomena.
The biological picture that emerges: at admission, patients vary
along axes of acute cardiac/renal injury, hepatic metabolic stress, and
immune activation. By follow-up, new axes emerge centred on GH1
(somatotropic recovery), IL-6 (persistent inflammation), and HAVCR1
(ongoing kidney injury). The time-stable severity program (T1+T2-1)
persists throughout, but the character of the disease
changes — and only DIVAS, not NMF, can formally separate these
dynamics.