BridgeAnnotation.Rmd
ScRNA-seq reference + scATAC-seq query
Utilising a bimodal bridge dataset
Comprehensive benchmark study
Single-cell ATAC-seq (Assay for Transposase-Accessible Chromatin sequencing) measures chromatin accessibility across the genome. In the nucleus of each cell, each DNA chain is like a long thread wrapping around protein spools called histones. Each DNA-wrapped histone is known as a nucleosome – the basic unit of chromatin. Measuring the openness of chromatin regions tells us two things:
Whether a gene can be transcribed. For a gene to be transcribed, the gene’s corresponding chromatin region has to be open, i.e. part of the DNA chain being unwound.
Transcription factor (TF) footprints. TFs bind to gene regions to promote or suppress gene expression. ATAC-seq can infer which regions TFs are binding to.
ScATAC-seq allows us to infer cell-type-specific gene regulations. Hence identifying cell types based on scATAC-seq is an important step. One common appraoch is to view ATAC-seq as surrogate for RNA-seq. That is, if a gene has more open regions, then we assume that it has more transcripts.
In this case study, we use a reference-based approach to annotate cell types in scATAC-seq. This approach is inspired by bridge integration implemented in Seurat V5 (Hao et al., 2024). The idea is to transfer cell type annotations in scRNA-seq reference to scATAC-seq query, via a bimodal bridge dataset with matchcd RNA and ATAC features (e.g. generated from 10x multiome).
We use
Reference: BMMC (bone marrow mononuclear cell) scRNA-seq data set (Stuart et al. (2019));
Bridge and query: 10x multiome (scRNA+ATAC-seq) dataset with 13 batches from Leucken et al. (2021).
In our paper we used a cross-validation (CV) scheme as follows. Each time we use one of the 13 batches of multiome data as the bridge dataset to facilitate the transfer of cell type annotations from the scRNA-seq reference to the remaining 12 batches of query scATAC-seq data (with their RNA part and ground truth cell type labels hidden). This allowed us to benchmark PhiSpace cross-modality annotation with Seurat V5. Here we simply show how PhiSpace bridge annotation can be done using one bridge and one query datasets.
All data related to this case study can be downloaded here.
suppressPackageStartupMessages(library(PhiSpace))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(ggpubr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(RColorBrewer))
dat_dir <- "~/Dropbox/Research_projects/PhiSpace/VignetteData/ATAC/" # Replace this by your own directory
source("~/Dropbox/Research_projects/PhiSpace/VignetteData/ATAC/utils.R")
ScATAC-seq data are often available in the peak by cell format, where a peak refers to a small region on the genome that are open in a significant number of cells (i.e. region of interest). Peaking calling, i.e. defining the peaks based on the raw sequencing data, is a sophisticated computational step and there are a few different approaches, which we cannot cover here. Peaks are often further aggregated to the gene level as a proxy of gene expression. That is, open peaks falling in the region of a certain gene are summed up as ‘gene activity score’, approximating the expression level of that gene.
In our paper, we demonstrated that gene activite scores (aggregted peaks) are more suitable for cell typing due to its enhanced data quality. Here we show PhiSpace cross-modality cell typing using both peaks and gene activity scores.
reference <- readRDS(paste0(dat_dir, "data/obj.rna_for_refMap_sce.rds"))
bridgeRNA <- readRDS(paste0(dat_dir, "data/bridgeRNA_s4d8.rds"))
bridgeATACpeaks <- readRDS(paste0(dat_dir, "data/bridgeATACpeaks_s4d8.rds"))
bridgeGA <- readRDS(paste0(dat_dir, "data/bridgeGA_s4d8.rds"))
queryPeaks <- readRDS(paste0(dat_dir, "data/queryATACpeaks_s1d1.rds"))
queryGA <- readRDS(paste0(dat_dir, "data/queryATAC_GA_s1d1.rds"))
Run PhiSpace to transfer annotations from reference scRNA-seq to bridge scRNA-seq (RNA part of the bimodal bridge dataset).
PhiResPath <- paste0(dat_dir, "output/PhiResRNA.rds")
if(!file.exists(PhiResPath)){
PhiSpaceAssay <- "logcounts"
YtrainName <- "celltype.l2"
PhiRes <- PhiSpaceR_1ref(
reference,
query = bridgeRNA,
phenotypes = YtrainName,
PhiSpaceAssay = PhiSpaceAssay,
regMethod = "PLS",
scale = FALSE
)
saveRDS(PhiRes, PhiResPath)
} else {
PhiRes <- readRDS(PhiResPath)
}
Note that the bridge dataset has now been annotated.
bridgeAnn <- normPhiScores(PhiRes$PhiSpaceScore)
head(bridgeAnn)
## Prog_RBC gdT CD4 Naive CD4 Memory
## TTCGCAACAATAATGG-14-s4d8 -0.1707345986 0.3142937 0.024318290 -0.13258113
## TATCCAGCAATTGAGA-14-s4d8 -0.0539900171 0.1223587 -0.079132130 0.25909237
## ACTCGCGCAAACTGTT-14-s4d8 0.0136845293 -0.1913396 0.074089946 0.01314921
## GACTATTCATGTCGCG-14-s4d8 0.1381644903 -0.1472680 -0.139126446 -0.11354641
## CTAATGTCATTGTTGG-14-s4d8 0.0006747695 -0.1353683 -0.003968247 0.34867338
## TGACCAAGTAGACAAA-14-s4d8 -0.1286231148 0.1834022 -0.300778458 -0.01186283
## CD14 Mono Naive B CD8 Naive Treg
## TTCGCAACAATAATGG-14-s4d8 -0.03508110 0.177299624 -0.005793706 -0.18729220
## TATCCAGCAATTGAGA-14-s4d8 0.03920407 0.070065045 0.069233769 0.02057322
## ACTCGCGCAAACTGTT-14-s4d8 0.55715881 -0.004419651 -0.146966721 0.12118794
## GACTATTCATGTCGCG-14-s4d8 0.01349943 0.657490769 0.010944475 -0.01029004
## CTAATGTCATTGTTGG-14-s4d8 -0.02409488 0.015744293 -0.075164530 0.22040510
## TGACCAAGTAGACAAA-14-s4d8 0.21756821 0.081154893 0.405585859 0.05267516
## CD8 Effector_2 NK GMP CD8 Effector_1
## TTCGCAACAATAATGG-14-s4d8 -0.19359786 -0.23114732 0.10532532 0.030164486
## TATCCAGCAATTGAGA-14-s4d8 0.10336668 -0.04702014 -0.01607817 -0.009467531
## ACTCGCGCAAACTGTT-14-s4d8 0.02719140 -0.06959604 -0.08155136 0.009924969
## GACTATTCATGTCGCG-14-s4d8 0.18685296 -0.10031176 0.03269648 -0.083924599
## CTAATGTCATTGTTGG-14-s4d8 -0.06730475 0.04239765 -0.02165142 0.161820780
## TGACCAAGTAGACAAA-14-s4d8 -0.25069804 0.32892606 -0.22216454 0.247644301
## CD16 Mono pDC CD8 Memory_1 MAIT
## TTCGCAACAATAATGG-14-s4d8 0.0267491818 0.07519187 0.10838891 0.15693329
## TATCCAGCAATTGAGA-14-s4d8 0.0004520518 0.04187673 0.16961830 0.28565867
## ACTCGCGCAAACTGTT-14-s4d8 0.0401469480 0.07347996 0.03591260 -0.11384447
## GACTATTCATGTCGCG-14-s4d8 0.0956767357 0.08542073 -0.11933952 0.15696125
## CTAATGTCATTGTTGG-14-s4d8 0.0896238445 -0.03629314 0.09045062 -0.32292449
## TGACCAAGTAGACAAA-14-s4d8 -0.2205784394 0.06602202 0.03483928 -0.09627815
## Memory B cDC2 CD56 bright NK Prog_B 2
## TTCGCAACAATAATGG-14-s4d8 -0.018326246 -0.14549994 0.10604594 0.67305697
## TATCCAGCAATTGAGA-14-s4d8 -0.157773873 -0.03388604 0.11796970 0.08840598
## ACTCGCGCAAACTGTT-14-s4d8 -0.011452452 -0.02233130 0.05531265 -0.01953300
## GACTATTCATGTCGCG-14-s4d8 -0.009060715 -0.15772452 -0.19432405 -0.03411201
## CTAATGTCATTGTTGG-14-s4d8 -0.087183069 0.02049838 -0.07045638 -0.03339777
## TGACCAAGTAGACAAA-14-s4d8 -0.034366419 0.04425579 0.04903824 -0.01442982
## Prog_Mk CD8 Memory_2 Plasmablast HSC
## TTCGCAACAATAATGG-14-s4d8 -0.05306825 0.11360104 -0.0368015817 -0.012091183
## TATCCAGCAATTGAGA-14-s4d8 -0.23100435 -0.01249257 -0.0136314348 -0.128567181
## ACTCGCGCAAACTGTT-14-s4d8 0.20827609 0.04596609 0.0014879483 0.153315573
## GACTATTCATGTCGCG-14-s4d8 0.02229154 -0.04417593 -0.0519034938 -0.008114113
## CTAATGTCATTGTTGG-14-s4d8 0.24471407 0.12718647 0.0486055125 0.077611859
## TGACCAAGTAGACAAA-14-s4d8 -0.37960088 0.28419243 -0.0006143916 -0.182652596
## LMPP Prog_DC Prog_B 1
## TTCGCAACAATAATGG-14-s4d8 0.382422573 0.54725478 0.52730214
## TATCCAGCAATTGAGA-14-s4d8 -0.072865706 0.06203403 0.04468466
## ACTCGCGCAAACTGTT-14-s4d8 0.036138028 -0.06053255 0.12502706
## GACTATTCATGTCGCG-14-s4d8 -0.113021357 -0.16087069 0.01583117
## CTAATGTCATTGTTGG-14-s4d8 -0.005820409 -0.07928313 0.06925808
## TGACCAAGTAGACAAA-14-s4d8 0.001141363 0.03850285 -0.13583515
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 7136499 381.2 11783855 629.4 NA 8882335 474.4
## Vcells 452249997 3450.4 581936216 4439.9 16384 465353384 3550.4
Next we transfer this continuous annotation from bridge to query. Note that we use bridgeATACpeaks as reference in the following code to match the query. Peaks in both bridge and query have been normalised using TF-IDF.
PhiResPath <- paste0(dat_dir, "output/PhiResATACpeaks.rds")
if(!file.exists(PhiResPath)){
PhiSpaceAssay <- "data"
PhiRes <- PhiSpaceR_1ref(
reference = bridgeATACpeaks,
query = queryPeaks,
response = bridgeAnn,
PhiSpaceAssay = PhiSpaceAssay,
regMethod = "PLS",
center = F,
scale = FALSE
)
saveRDS(PhiRes, PhiResPath)
} else {
PhiRes <- readRDS(PhiResPath)
}
PhiScPeaks <- normPhiScores(PhiRes$PhiSpaceScore)
Note that we set center = F
here since centring would
turn a sparse peak matrix to dense, consuming a lot of RAM. If you have
enough RAM, centring is still encouraged.
Alternatively, we can transfer continuous annotation via aggregated peaks or gene activity (GA) score.
PhiResPath <- paste0(dat_dir, "output/PhiResATAC_GA.rds")
if(!file.exists(PhiResPath)){
PhiSpaceAssay <- "logcounts"
PhiRes <- PhiSpaceR_1ref(
reference = bridgeRNA,
query = queryGA,
response = bridgeAnn,
PhiSpaceAssay = PhiSpaceAssay,
regMethod = "PLS",
center = T,
scale = FALSE
)
saveRDS(PhiRes, PhiResPath)
} else {
PhiRes <- readRDS(PhiResPath)
}
PhiScGA <- normPhiScores(PhiRes$PhiSpaceScore)
Now we can compare which annotation gave more accurate cell type annotation. An immediate problem is that the ground truth cell types for query were different from the reference cell types. Finding one-to-one correspondence between them is tricky. However, it’s possible to define broad cell types and align both sets of cell types with the broad cell types. This will also allow us to compare the accuracy of the peaks-based and GA-based approaches at the broad cell type level.
refTypes_l2 <- unique(reference$celltype.l2) %>% sort
refTypes_l1 <- sapply(
refTypes_l2,
function(x){
unique(reference$celltype.l1[reference$celltype.l2 == x])
}
)
ref_lookup <- data.frame(l2 = refTypes_l2, l1 = refTypes_l1)
# Query original annotations (ILC is hard to classify)
cellTypeTable <- readRDS(paste0(dat_dir, "data/CellTypeTable.rds"))
queryTypes_l2 <- rownames(cellTypeTable) %>% sort
queryTypes_l1 <- c(
"B cell", "Mono/DC", "Mono/DC", "T cell", "T cell",
"T cell", "T cell", "Mono/DC", "Progenitor cells", "Progenitor cells",
"Progenitor cells", "Progenitor cells", "ILC", "Progenitor cells", "Progenitor cells",
"B cell", "NK", "Progenitor cells", "Mono/DC", "B cell",
"Progenitor cells", "B cell"
)
query_lookup <- data.frame(
l2 = queryTypes_l2,
l1 = queryTypes_l1
)
Evaluate performances: calculating overall and balanced (per cell type average) classification errors. Indeed using aggregated peaks gave lower erros compared to using peaks.
originAnn <- queryGA$cellType
originAnn_l1 <- query_lookup$l1[match(originAnn, query_lookup$l2)]
# Translate PhiSpace labels
PhiSpaceAnn <- PhiScGA %>% getClass
PhiSpaceAnn_l1 <- ref_lookup$l1[match(PhiSpaceAnn, ref_lookup$l2)]
PhiSpaceAnnPeaks <- PhiScPeaks %>% getClass()
PhiSpaceANN_l1_peaks <- ref_lookup$l1[match(PhiSpaceAnnPeaks, ref_lookup$l2)]
PhiSpace:::classErr(PhiSpaceAnn_l1, originAnn_l1)$err
## [1] 0.2291131 0.3536215
PhiSpace:::classErr(PhiSpaceANN_l1_peaks, originAnn_l1)$err
## [1] 0.2892031 0.3978839