Feature Quantification#
This page describes how to obtain probe/peak/isoform-level quantification for various spatial transcriptomics (ST) platforms as input to SPLISOSM.
To check if you already have a compatible AnnData or SpatialData object, see Expected input data format.
Platform overview#
The table below summarizes the supported ST platforms, the type of isoform feature used, and the recommended quantification approach.
Platform type |
Example platforms |
Feature |
Quantification approach |
|---|---|---|---|
Long-read ST |
ONT + Visium (e.g., SiT [LBT+23]) |
Full-length transcript isoform |
|
Short-read 3' end |
10x Visium/Visium HD (fresh-frozen), Slide-seqV2 |
3' end diversity (TREND) event (peak) |
Sierra for de novo peak calling |
Short-read targeted |
10x Visium v2 CytAssist (FFPE), 10x Visium HD (FFPE), 10x Flex |
Exon/junction probe |
Space Ranger output directly ( |
In situ targeted |
10x Xenium Prime 5K |
Exon/junction probe (codeword) |
Xenium Ranger output directly ( |
Expected data format#
See Expected input data format in the Quick Start for the expected
AnnData / SpatialData layout. The rest of this page covers how to
produce such inputs from each platform’s raw output.
For SplisosmFFT, counts are internally rasterised into square
bins via spatialdata.rasterize_bins
(unobserved bins are zero-padded). The SpatialData table must therefore also
carry .uns['spatialdata_attrs'] with spatial_key / row_key /
col_key entries — see the SpatialData tutorial on table metadata.
Long-read ST data#
SPLISOSM is compatible with any long-read quantification tool that produces an isoform-by-spot count matrix. Popular options include IsoQuant and kallisto.
Note
Detection power scales linearly with sequencing depth (FAQ). Any processing choice that increases captured UMIs per spot will improve results.
Tested platforms:
10x Visium + ONT (SiT [LBT+23])
10x Visium HD + ONT (the EPI2ME dataset)
The Visium HD ONT tutorial shows how to load a preprocessed ONT + Visium HD dataset and run SPLISOSM.
The dataset is publicly accessible and can be downloaded from EPI2ME.
Transcript assignment and quantification were performed using minimap2, Stringtie, and FLAMES. See the EPI2ME workflow documentation for details.
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ ├── '_hires_image': DataArray[cyx] (3, 3000, 3200)
│ └── '_lowres_image': DataArray[cyx] (3, 563, 600)
├── Shapes
│ ├── '_square_002um': GeoDataFrame shape: (7857218, 1) (2D shapes)
│ ├── '_square_008um': GeoDataFrame shape: (492663, 1) (2D shapes)
│ └── '_square_016um': GeoDataFrame shape: (123658, 1) (2D shapes)
└── Tables
├── 'square_002um': AnnData (7857218, 48001)
├── 'square_008um': AnnData (492663, 48001)
└── 'square_016um': AnnData (123658, 48001)
with coordinate systems:
▸ '', with elements:
_hires_image (Images), _lowres_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_hires', with elements:
_hires_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_lowres', with elements:
_lowres_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
Short-read 3’ end ST data#
We recommend using Sierra to call 3’ end diversity (TREND) events de novo from Space Ranger BAM files. See the Sierra vignette for installation and usage details.
Tested platforms:
10x Visium 3’ (fresh-frozen)
Slide-seqV2
10x Visium HD 3’ (fresh-frozen)
Warning
When processing multiple samples, avoid Sierra’s MergePeakCoordinates function — it can produce overlapping peak definitions.
Note
If you encounter issues when running Sierra’s CountPeaks, the following workaround quantification workflow may be helpful:
10x Visium HD 3’ data (custom quantification)#
For 10x Visium HD 3’ data, we have prepared a hybrid quantification workflow where Sierra is used for peak calling but not counting. See the attached bash script (scripts/visiumhd_3p_trend_quant.sh).
Step-by-step workflow explained (click to expand)
Run
spaceranger count(>=v4.0) withcreate-bam=trueto get the BAM file.Run Sierra
FindPeaks(plusAnnotatePeaksFromGTF) on the Space Ranger BAM.Convert peak definitions to SAF/BED and run
featureCounts -R BAMto add peak-level tags using Subread.Count UMIs per peak per barcode with
umi_tools count.Build a 10x-compatible feature-barcode matrix and save as
raw_probe_bc_matrix.h5under the output directoryouts/binned_outputs/square_002um.Load peak-level data into
SpatialDatawithsplisosm.io.load_visiumhd_probe().
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ ├── '_hires_image': DataArray[cyx] (3, 5492, 6000)
│ └── '_lowres_image': DataArray[cyx] (3, 549, 600)
├── Shapes
│ ├── '_cell_segmentations': GeoDataFrame shape: (84031, 2) (2D shapes)
│ ├── '_square_002um': GeoDataFrame shape: (5998466, 1) (2D shapes)
│ ├── '_square_008um': GeoDataFrame shape: (376419, 1) (2D shapes)
│ └── '_square_016um': GeoDataFrame shape: (94592, 1) (2D shapes)
└── Tables
├── 'cell_segmentations': AnnData (84031, 19575)
├── 'square_002um': AnnData (5998466, 19575)
├── 'square_008um': AnnData (376419, 19575)
└── 'square_016um': AnnData (94592, 19575)
with coordinate systems:
▸ '', with elements:
_hires_image (Images), _lowres_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_hires', with elements:
_hires_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_lowres', with elements:
_lowres_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
For downstream analysis of TREND spatial patterns, see the Visium HD 3’ tutorial.
10x Visium / Slide-seqV2 3’ data (Sierra quantification)#
Running Sierra on Space Ranger BAM
# load the R package
library(Sierra)
FindPeaks(
output.file = '${peak_file}',
gtf.file = '${gtf_file}', # SpaceRanger reference gtf file
bamfile = '${bam_file}', # bam file from SpaceRanger
junctions.file = '${junc_file}', # junctions bed files extracted using regtools junctions extract
# optional arguments for retaining low-abundance peaks
# min.jcutoff.prop = 0.0,
# min.cov.prop = 0.0,
# min.peak.prop = 0.0
)
CountPeaks(
peak.sites.file = '${peak_file}',
gtf.file = '${gtf_file}',
bamfile = '${bam_file}',
whitelist.file = '${whitelist_file}', # barcodes.tsv file from SpaceRanger
output.dir = '${output_dir}',
)
Converting Sierra output to AnnData (click to expand)
import scanpy as sc
import pandas as pd
from splisosm.io import load_visium_sp_meta
sierra_out_dir = "path/to/sierra/output" # 'output.dir' in CountPeaks
sp_meta_dir = "path/to/visium/spatial/metadata" # 'spatial' directory in 10X Visium data
# load the Sierra outputs as an AnnData object
adata = sc.read(
f"{sierra_out_dir}/matrix.mtx.gz",
cache_compression='cache_compression',
).T
# load TREND peak metadata
peaks = pd.read_csv(
f"{sierra_out_dir}/sitenames.tsv.gz",
header=None,
sep="\t",
)
df_var = peaks[0].str.split(':', expand=True)
df_var.columns = ['gene_symbol', 'chr', 'position', 'strand']
df_var.index = peaks[0].values
# load spatial barcode metadata
barcodes = pd.read_csv(f"{sierra_out_dir}/barcodes.tsv.gz", header=None)
# add metadata to the AnnData object
adata.var_names = peaks[0].values
adata.obs_names = barcodes[0].values
adata.var = df_var
adata.var['gene_id'] = adata.var['gene_symbol']
# load Visium spatial metadata
adata = load_visium_sp_meta(adata, f"{sp_meta_dir}/", library_id='adata_peak')
adata = adata[adata.obs['in_tissue'].astype(bool), :].copy()
Custom AnnData filtering (click to expand)
SPLISOSM compares all events associated with the same gene and is agnostic to their specific structure. Filter out low-abundance features before testing to reduce computation and improve power.
# filter out lowly expressed peaks
sc.pp.filter_genes(adata, min_cells=0.01 * adata.shape[0])
# extract gene symbols and peak ids
df_iso_meta = adata.var.copy() # gene_symbol, chr, position, strand, gene_id
df_iso_meta['peak_id'] = adata.var_names
# prepare gene-level metadata
df_gene_meta = df_iso_meta.groupby('gene_symbol').size().reset_index(name='n_peak')
df_gene_meta = df_gene_meta.set_index('gene_symbol')
print(f"Number of spots: {adata.shape[0]}")
print(f"Number of genes before QC: {df_gene_meta.shape[0]}")
print(f"Number of peaks before QC: {adata.shape[1]}")
print(f"Average number of peaks per gene before QC: {adata.shape[1] / df_gene_meta.shape[0]}")
# calculate the total counts per gene
mapping_matrix = pd.get_dummies(df_iso_meta['gene_symbol'])
mapping_matrix = mapping_matrix.loc[df_iso_meta.index, df_gene_meta.index]
isog_counts = adata[:, mapping_matrix.index].layers['counts'] @ mapping_matrix
# calculate mean and sd of total gene counts
df_gene_meta['pct_spot_on'] = (isog_counts > 0).mean(axis=0)
df_gene_meta['count_avg'] = isog_counts.mean(axis=0)
df_gene_meta['count_std'] = isog_counts.std(axis=0)
# filter out lowly expressed genes
_gene_keep = df_gene_meta['pct_spot_on'] > 0.01
# _gene_keep = (df_gene_meta['count_avg'] > 0.5) & _gene_keep
# filter out genes with single isoform
_gene_keep = (df_gene_meta['n_peak'] > 1) & _gene_keep
# filter for isoforms
_iso_keep = df_iso_meta['gene_symbol'].isin(df_gene_meta.index[_gene_keep])
# update feature meta
df_gene_meta = df_gene_meta.loc[_gene_keep, :]
adata = adata[:, _iso_keep]
adata.var = df_iso_meta.loc[_iso_keep, :].copy()
print(f"Number of genes after QC: {sum(_gene_keep)}")
print(f"Number of peaks after QC: {sum(_iso_keep)}")
print(f"Average number of peaks per gene after QC: {sum(_iso_keep) / sum(_gene_keep)}")
Short-read targeted ST data#
10x Visium and Visium HD FFPE kits use fixed pan-genome probes for read enrichment. SPLISOSM treats probe-level counts as the isoform-level quantification — probes targeting the same gene are grouped and tested jointly.
Tested platforms:
10x Visium FFPE (v2 CytAssist Spatial Gene Expression)
10x Visium HD FFPE
10x Visium HD FFPE#
Given Space Ranger output, the following code creates a SpatialData object with probe-level counts (from raw_probe_bc_matrix.h5):
from splisosm.io import load_visiumhd_probe
sdata = load_visiumhd_probe(
path=visium_hd_outs,
bin_sizes=[2, 8, 16],
filtered_counts_file=True,
load_all_images=False,
var_names_make_unique=True,
counts_layer_name="counts",
)
Note
splisosm.io.load_visiumhd_probe() uses the barcode_mappings.parquet file,
which contains the spatial mapping information of 2um barcodes to coarser bins and segmented cells.
See 10x documentation for details.
If you don’t have this file, please re-run the
Space Ranger count pipeline
with Space Ranger v4.0+.
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ ├── 'Visium_HD_Mouse_Brain_full_image': DataTree[cyx] (3, 23947, 18872), (3, 11973, 9436), (3, 5986, 4718), (3, 2993, 2359), (3, 1496, 1179)
│ ├── 'Visium_HD_Mouse_Brain_hires_image': DataArray[cyx] (3, 6000, 4729)
│ └── 'Visium_HD_Mouse_Brain_lowres_image': DataArray[cyx] (3, 600, 473)
├── Shapes
│ ├── 'Visium_HD_Mouse_Brain_cell_segmentations': GeoDataFrame shape: (40222, 2) (2D shapes)
│ ├── 'Visium_HD_Mouse_Brain_square_002um': GeoDataFrame shape: (6296688, 1) (2D shapes)
│ ├── 'Visium_HD_Mouse_Brain_square_008um': GeoDataFrame shape: (393543, 1) (2D shapes)
│ └── 'Visium_HD_Mouse_Brain_square_016um': GeoDataFrame shape: (98917, 1) (2D shapes)
└── Tables
├── 'cell_segmentations': AnnData (40222, 55538)
├── 'square_002um': AnnData (6296688, 55538)
├── 'square_008um': AnnData (393543, 55538)
└── 'square_016um': AnnData (98917, 55538)
with coordinate systems:
▸ 'Visium_HD_Mouse_Brain', with elements:
Visium_HD_Mouse_Brain_full_image (Images), Visium_HD_Mouse_Brain_hires_image (Images), Visium_HD_Mouse_Brain_lowres_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
▸ 'Visium_HD_Mouse_Brain_downscaled_hires', with elements:
Visium_HD_Mouse_Brain_hires_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
▸ 'Visium_HD_Mouse_Brain_downscaled_lowres', with elements:
Visium_HD_Mouse_Brain_lowres_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
For downstream analysis, see the Visium HD FFPE tutorial.
10x Visium FFPE (v2 CytAssist)#
Standard-resolution Visium CytAssist FFPE uses a probe-based capture workflow.
Space Ranger (v3.0+) outputs a raw_probe_bc_matrix.h5 file containing per-probe, per-barcode counts
that SPLISOSM uses directly for probe usage testing.
Running Space Ranger count
See the 10x documentation for detailed instructions. Example Space Ranger command:
spaceranger count \
--id="CytAssist_FFPE_Mouse_Brain_Rep1" \
--transcriptome=refdata-gex-mm10-2020-A \
--probe-set=CytAssist_FFPE_Mouse_Brain_Rep1_probe_set.csv \
--fastqs=path/to/fastqs \
--cytaimage=CytAssist_FFPE_Mouse_Brain_Rep1_image.tif \
--image=CytAssist_FFPE_Mouse_Brain_Rep1_tissue_image.tif \
--loupe-alignment=CytAssist_FFPE_Mouse_Brain_Rep1_alignment_file.json \
--slide=V42A20-353 \
--area=A1 \
--create-bam=false \
--localcores=16 \
--localmem=64
Loading as AnnData/SpatialData
from splisosm.io import load_visium_probe
adata = load_visium_probe(
"CytAssist_FFPE_Mouse_Brain_Rep1/outs",
counts_file="raw_probe_bc_matrix.h5", # probe-level counts (default)
counts_layer_name="counts",
)
The loaded AnnData has the following structure:
AnnData object with n_obs x n_vars = 4992 x 21178
obs: 'filtered_barcodes', 'in_tissue', 'array_row', 'array_col'
var: 'gene_ids', 'probe_ids', 'feature_types', 'filtered_probes', 'gene_name', 'genome', 'probe_region'
uns: 'spatial'
obsm: 'spatial'
layers: 'counts'
Each row is a Visium spot (barcode) and each column is an individual probe.
The gene_ids column in .var groups probes by gene — pass group_iso_by="gene_ids"
and gene_names="gene_name" to setup_data().
To load as a SpatialData object (e.g., for SplisosmFFT), pass return_type="spatialdata":
sdata = load_visium_probe(
"CytAssist_FFPE_Mouse_Brain_Rep1/outs",
return_type="spatialdata",
)
Note
On the Mouse Brain Coronal Section 1 (FFPE) dataset,
281 genes with multiple probes passed filtering (min_counts=10, filter_single_iso_genes=True),
of which 83 were identified as spatially variably processed (SVP, HSIC-IR) at FDR < 0.01.
For downstream analysis comparing SplisosmNP (AnnData) and SplisosmFFT (SpatialData) spatial variability tests using this dataset,
see the Visium FFPE tutorial.
In situ targeted ST data#
For imaging-based platforms with exon- or junction-specific probes (e.g., 10x Xenium Prime 5K), SPLISOSM uses codeword-level counts as isoform proxies. Data can be analyzed at single-cell resolution (segmented cells) or on spatially binned spots.
Tested platform:
10x Xenium Prime 5K
Given Xenium Ranger output, the following code creates a binned SpatialData object with codeword-level counts (from transcripts.zarr.zip):
from splisosm.io import load_xenium_codeword
sdata = load_xenium_codeword(
path=xenium_ranger_outs,
spatial_resolutions=[8.0, 16.0], # None or [] if you only want segmented cell-level data
quality_threshold=20.0,
chunk_batch_size=64,
counts_layer_name="counts",
build_cell_codeword_table=True,
create_square_shapes=True,
)
Note
If your Xenium Output Bundle was generated by older Xenium Ranger versions before v3.1.0, please re-run the
Xenium Ranger relabel pipeline
to get an updated transcripts.zarr.zip file.
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ └── 'morphology_focus': DataTree[cyx] (4, 23912, 34154), (4, 11956, 17077), (4, 5978, 8538), (4, 2989, 4269), (4, 1494, 2134)
├── Labels
│ ├── 'cell_labels': DataTree[yx] (23912, 34154), (11956, 17077), (5978, 8538), (2989, 4269), (1494, 2134)
│ └── 'nucleus_labels': DataTree[yx] (23912, 34154), (11956, 17077), (5978, 8538), (2989, 4269), (1494, 2134)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 13) (3D points)
├── Shapes
│ ├── 'cell_boundaries': GeoDataFrame shape: (63173, 1) (2D shapes)
│ ├── 'nucleus_boundaries': GeoDataFrame shape: (63036, 1) (2D shapes)
│ ├── 'square_008um_bins': GeoDataFrame shape: (576580, 1) (2D shapes)
│ └── 'square_016um_bins': GeoDataFrame shape: (144372, 1) (2D shapes)
└── Tables
├── 'square_008um': AnnData (576580, 11163)
├── 'square_016um': AnnData (144372, 11163)
├── 'table': AnnData (63173, 5006)
└── 'table_codeword': AnnData (63173, 11163)
with coordinate systems:
▸ 'global', with elements:
morphology_focus (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes), square_008um_bins (Shapes), square_016um_bins (Shapes)
For downstream analysis, see the Xenium Prime 5K tutorial.