Feature Quantification#
This page describes how to obtain probe/peak/isoform-level quantification for various spatial transcriptomics (ST) platforms as input to SPLISOSM.
If you already have a compatible AnnData or SpatialData object, skip ahead to the expected data format.
Platform overview#
The table below summarizes the supported ST platforms, the type of isoform feature used, and the recommended quantification approach.
Platform type |
Example platforms |
Feature |
Quantification approach |
|---|---|---|---|
Long-read ST |
ONT + Visium (e.g., SiT [LBT+23]) |
Full-length transcript isoform |
|
Short-read 3' end |
10x Visium/Visium HD (fresh-frozen), Slide-seqV2 |
3' end diversity (TREND) event (peak) |
Sierra for de novo peak calling |
Short-read targeted |
10x Visium/Visium HD (FFPE), 10x Flex |
Exon/junction probe |
Space Ranger output directly ( |
In situ targeted |
10x Xenium Prime 5K |
Exon/junction probe (codeword) |
Xenium Ranger output directly ( |
Expected data format#
SPLISOSM expects isoform-level data in an AnnData of shape (n_spots, n_isoforms):
.layers[<layer>]: raw isoform counts..var: isoform metadata with at least a<group_iso_by>column for gene assignment.Spatial coordinates in one of:
.obsm[<spatial_key>]: a(n_spots, n_dim)array with 2/3/more columns. Scanpy/Squidpy conventions are'spatial'..obsp[<adj_key>]: a custom(n_spots, n_spots)adjacency matrix (e.g.,'connectivities'from callingscanpy.pp.neighbors).
See
splisosm.utils.prepare_inputs_from_anndata()for parsing details.
The spot-by-isoform AnnData can also be a table of a SpatialData object, which is the required input format for splisosm.SplisosmFFT (FFT-accelerated tests).
In such cases, we rely on spatialdata.rasterize_bins to
rasterize counts into square bins (i.e., padding unobserved spots with zeros) to improve memory efficiency and computation speed.
The minimum required structure of the SpatialData object is as follows:
sdata.shapes[<bins>]: aGeoDataFrameofn_binselements containing the geometry of each bin (e.g., square polygons).sdata.tables[<table_name>]: anAnnDataof shape(n_bins, n_isoforms)with.layers[<layer>]: raw isoform counts.var: isoform metadata with at least a<group_iso_by>column for gene assignment..obs: bin metadata with[<row_key>, <col_key>]for row and column indices (e.g.,'array_row','array_col') used during rasterization..uns['spatialdata_attrs']: a dictionary with keysspatial_key,row_key, andcol_keyindicating the spatial mapping information of bins. See the SpatialData documentation.
Long-read ST data#
SPLISOSM is compatible with any long-read quantification tool that produces an isoform-by-spot count matrix. Popular options include IsoQuant and kallisto.
Note
Detection power scales linearly with sequencing depth (FAQ). Any processing choice that increases captured UMIs per spot will improve results.
Tested platforms:
10x Visium + ONT (SiT [LBT+23])
10x Visium HD + ONT (the EPI2ME dataset)
The Visium HD ONT tutorial shows how to load a preprocessed ONT + Visium HD dataset and run SPLISOSM.
The dataset is publicly accessible and can be downloaded from EPI2ME.
Transcript assignment and quantification were performed using minimap2, Stringtie, and FLAMES. See the EPI2ME workflow documentation for details.
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ ├── '_hires_image': DataArray[cyx] (3, 3000, 3200)
│ └── '_lowres_image': DataArray[cyx] (3, 563, 600)
├── Shapes
│ ├── '_square_002um': GeoDataFrame shape: (7857218, 1) (2D shapes)
│ ├── '_square_008um': GeoDataFrame shape: (492663, 1) (2D shapes)
│ └── '_square_016um': GeoDataFrame shape: (123658, 1) (2D shapes)
└── Tables
├── 'square_002um': AnnData (7857218, 48001)
├── 'square_008um': AnnData (492663, 48001)
└── 'square_016um': AnnData (123658, 48001)
with coordinate systems:
▸ '', with elements:
_hires_image (Images), _lowres_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_hires', with elements:
_hires_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_lowres', with elements:
_lowres_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
Short-read 3’ end ST data#
We recommend using Sierra to call 3’ end diversity (TREND) events de novo from Space Ranger BAM files. See the Sierra vignette for installation and usage details.
Tested platforms:
10x Visium 3’ (fresh-frozen)
Slide-seqV2
10x Visium HD 3’ (fresh-frozen)
Warning
When processing multiple samples, avoid Sierra’s MergePeakCoordinates function — it can produce overlapping peak definitions.
Note
If you encounter issues when running Sierra’s CountPeaks, the following workaround quantification workflow may be helpful:
Visium HD 3’ data (custom quantification)#
For 10x Visium HD 3’ data, we have prepared a hybrid quantification workflow where Sierra is used for peak calling but not counting. See the attached bash script (scripts/visiumhd_3p_trend_quant.sh).
Step-by-step workflow explained (click to expand)
Run
spaceranger count(>=v4.0) withcreate-bam=trueto get the BAM file.Run Sierra
FindPeaks(plusAnnotatePeaksFromGTF) on the Space Ranger BAM.Convert peak definitions to SAF/BED and run
featureCounts -R BAMto add peak-level tags using Subread.Count UMIs per peak per barcode with
umi_tools count.Build a 10x-compatible feature-barcode matrix and save as
raw_probe_bc_matrix.h5under the output directoryouts/binned_outputs/square_002um.Load peak-level data into
SpatialDatawithsplisosm.io.load_visiumhd_probe().
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ ├── '_hires_image': DataArray[cyx] (3, 5492, 6000)
│ └── '_lowres_image': DataArray[cyx] (3, 549, 600)
├── Shapes
│ ├── '_cell_segmentations': GeoDataFrame shape: (84031, 2) (2D shapes)
│ ├── '_square_002um': GeoDataFrame shape: (5998466, 1) (2D shapes)
│ ├── '_square_008um': GeoDataFrame shape: (376419, 1) (2D shapes)
│ └── '_square_016um': GeoDataFrame shape: (94592, 1) (2D shapes)
└── Tables
├── 'cell_segmentations': AnnData (84031, 19575)
├── 'square_002um': AnnData (5998466, 19575)
├── 'square_008um': AnnData (376419, 19575)
└── 'square_016um': AnnData (94592, 19575)
with coordinate systems:
▸ '', with elements:
_hires_image (Images), _lowres_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_hires', with elements:
_hires_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
▸ '_downscaled_lowres', with elements:
_lowres_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
For downstream analysis of TREND spatial patterns, see the Visium HD 3’ tutorial.
Visium/Slide-seqV2 3’ data (Sierra quantification)#
Running Sierra on Space Ranger BAM (click to expand)
# load the R package
library(Sierra)
FindPeaks(
output.file = '${peak_file}',
gtf.file = '${gtf_file}', # SpaceRanger reference gtf file
bamfile = '${bam_file}', # bam file from SpaceRanger
junctions.file = '${junc_file}', # junctions bed files extracted using regtools junctions extract
# optional arguments for retaining low-abundance peaks
# min.jcutoff.prop = 0.0,
# min.cov.prop = 0.0,
# min.peak.prop = 0.0
)
CountPeaks(
peak.sites.file = '${peak_file}',
gtf.file = '${gtf_file}',
bamfile = '${bam_file}',
whitelist.file = '${whitelist_file}', # barcodes.tsv file from SpaceRanger
output.dir = '${output_dir}',
)
Converting Sierra output to AnnData (click to expand)
import scanpy as sc
import pandas as pd
from splisosm.io import load_visium_sp_meta
sierra_out_dir = "path/to/sierra/output" # 'output.dir' in CountPeaks
sp_meta_dir = "path/to/visium/spatial/metadata" # 'spatial' directory in 10X Visium data
# load the Sierra outputs as an AnnData object
adata = sc.read(
f"{sierra_out_dir}/matrix.mtx.gz",
cache_compression='cache_compression',
).T
# load TREND peak metadata
peaks = pd.read_csv(
f"{sierra_out_dir}/sitenames.tsv.gz",
header=None,
sep="\t",
)
df_var = peaks[0].str.split(':', expand=True)
df_var.columns = ['gene_symbol', 'chr', 'position', 'strand']
df_var.index = peaks[0].values
# load spatial barcode metadata
barcodes = pd.read_csv(f"{sierra_out_dir}/barcodes.tsv.gz", header=None)
# add metadata to the AnnData object
adata.var_names = peaks[0].values
adata.obs_names = barcodes[0].values
adata.var = df_var
adata.var['gene_id'] = adata.var['gene_symbol']
# load Visium spatial metadata
adata = load_visium_sp_meta(adata, f"{sp_meta_dir}/", library_id='adata_peak')
adata = adata[adata.obs['in_tissue'].astype(bool), :].copy()
Custom AnnData filtering (click to expand)
SPLISOSM compares all events associated with the same gene and is agnostic to their specific structure. Filter out low-abundance features before testing to reduce computation and improve power.
# filter out lowly expressed peaks
sc.pp.filter_genes(adata, min_cells=0.01 * adata.shape[0])
# extract gene symbols and peak ids
df_iso_meta = adata.var.copy() # gene_symbol, chr, position, strand, gene_id
df_iso_meta['peak_id'] = adata.var_names
# prepare gene-level metadata
df_gene_meta = df_iso_meta.groupby('gene_symbol').size().reset_index(name='n_peak')
df_gene_meta = df_gene_meta.set_index('gene_symbol')
print(f"Number of spots: {adata.shape[0]}")
print(f"Number of genes before QC: {df_gene_meta.shape[0]}")
print(f"Number of peaks before QC: {adata.shape[1]}")
print(f"Average number of peaks per gene before QC: {adata.shape[1] / df_gene_meta.shape[0]}")
# calculate the total counts per gene
mapping_matrix = pd.get_dummies(df_iso_meta['gene_symbol'])
mapping_matrix = mapping_matrix.loc[df_iso_meta.index, df_gene_meta.index]
isog_counts = adata[:, mapping_matrix.index].layers['counts'] @ mapping_matrix
# calculate mean and sd of total gene counts
df_gene_meta['pct_spot_on'] = (isog_counts > 0).mean(axis=0)
df_gene_meta['count_avg'] = isog_counts.mean(axis=0)
df_gene_meta['count_std'] = isog_counts.std(axis=0)
# filter out lowly expressed genes
_gene_keep = df_gene_meta['pct_spot_on'] > 0.01
# _gene_keep = (df_gene_meta['count_avg'] > 0.5) & _gene_keep
# filter out genes with single isoform
_gene_keep = (df_gene_meta['n_peak'] > 1) & _gene_keep
# filter for isoforms
_iso_keep = df_iso_meta['gene_symbol'].isin(df_gene_meta.index[_gene_keep])
# update feature meta
df_gene_meta = df_gene_meta.loc[_gene_keep, :]
adata = adata[:, _iso_keep]
adata.var = df_iso_meta.loc[_iso_keep, :].copy()
print(f"Number of genes after QC: {sum(_gene_keep)}")
print(f"Number of peaks after QC: {sum(_iso_keep)}")
print(f"Average number of peaks per gene after QC: {sum(_iso_keep) / sum(_gene_keep)}")
Short-read targeted ST data#
10x Visium HD FFPE uses a fixed pan-genome probe for read enrichment. SPLISOSM treats probe-level counts as the isoform-level quantification — probes targeting the same gene are grouped and tested jointly.
Tested platform:
10x Visium HD FFPE
Given Space Ranger output, the following code creates a SpatialData object with probe-level counts (from raw_probe_bc_matrix.h5):
from splisosm.io import load_visiumhd_probe
sdata = load_visiumhd_probe(
path=visium_hd_outs,
bin_sizes=[2, 8, 16],
filtered_counts_file=True,
load_all_images=False,
var_names_make_unique=True,
counts_layer_name="counts",
)
Note
splisosm.io.load_visiumhd_probe() uses the barcode_mappings.parquet file,
which contains the spatial mapping information of 2um barcodes to coarser bins and segmented cells.
See 10x documentation for details.
If you don’t have this file, please re-run the
Space Ranger count pipeline
with Space Ranger v4.0+.
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ ├── 'Visium_HD_Mouse_Brain_full_image': DataTree[cyx] (3, 23947, 18872), (3, 11973, 9436), (3, 5986, 4718), (3, 2993, 2359), (3, 1496, 1179)
│ ├── 'Visium_HD_Mouse_Brain_hires_image': DataArray[cyx] (3, 6000, 4729)
│ └── 'Visium_HD_Mouse_Brain_lowres_image': DataArray[cyx] (3, 600, 473)
├── Shapes
│ ├── 'Visium_HD_Mouse_Brain_cell_segmentations': GeoDataFrame shape: (40222, 2) (2D shapes)
│ ├── 'Visium_HD_Mouse_Brain_square_002um': GeoDataFrame shape: (6296688, 1) (2D shapes)
│ ├── 'Visium_HD_Mouse_Brain_square_008um': GeoDataFrame shape: (393543, 1) (2D shapes)
│ └── 'Visium_HD_Mouse_Brain_square_016um': GeoDataFrame shape: (98917, 1) (2D shapes)
└── Tables
├── 'cell_segmentations': AnnData (40222, 55538)
├── 'square_002um': AnnData (6296688, 55538)
├── 'square_008um': AnnData (393543, 55538)
└── 'square_016um': AnnData (98917, 55538)
with coordinate systems:
▸ 'Visium_HD_Mouse_Brain', with elements:
Visium_HD_Mouse_Brain_full_image (Images), Visium_HD_Mouse_Brain_hires_image (Images), Visium_HD_Mouse_Brain_lowres_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
▸ 'Visium_HD_Mouse_Brain_downscaled_hires', with elements:
Visium_HD_Mouse_Brain_hires_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
▸ 'Visium_HD_Mouse_Brain_downscaled_lowres', with elements:
Visium_HD_Mouse_Brain_lowres_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
For downstream analysis, see the Visium HD FFPE tutorial.
In situ targeted ST data#
For imaging-based platforms with exon- or junction-specific probes (e.g., 10x Xenium Prime 5K), SPLISOSM uses codeword-level counts as isoform proxies. Data can be analyzed at single-cell resolution (segmented cells) or on spatially binned spots.
Tested platform:
10x Xenium Prime 5K
Given Xenium Ranger output, the following code creates a binned SpatialData object with codeword-level counts (from transcripts.zarr.zip):
from splisosm.io import load_xenium_codeword
sdata = load_xenium_codeword(
path=xenium_ranger_outs,
spatial_resolutions=[8.0, 16.0], # None or [] if you only want segmented cell-level data
quality_threshold=20.0,
chunk_batch_size=64,
counts_layer_name="counts",
build_cell_codeword_table=True,
create_square_shapes=True,
)
Note
If your Xenium Output Bundle was generated by older Xenium Ranger versions before v3.1.0, please re-run the
Xenium Ranger relabel pipeline
to get an updated transcripts.zarr.zip file.
The generated SpatialData object has the following structure (example):
SpatialData object
├── Images
│ └── 'morphology_focus': DataTree[cyx] (4, 23912, 34154), (4, 11956, 17077), (4, 5978, 8538), (4, 2989, 4269), (4, 1494, 2134)
├── Labels
│ ├── 'cell_labels': DataTree[yx] (23912, 34154), (11956, 17077), (5978, 8538), (2989, 4269), (1494, 2134)
│ └── 'nucleus_labels': DataTree[yx] (23912, 34154), (11956, 17077), (5978, 8538), (2989, 4269), (1494, 2134)
├── Points
│ └── 'transcripts': DataFrame with shape: (<Delayed>, 13) (3D points)
├── Shapes
│ ├── 'cell_boundaries': GeoDataFrame shape: (63173, 1) (2D shapes)
│ ├── 'nucleus_boundaries': GeoDataFrame shape: (63036, 1) (2D shapes)
│ ├── 'square_008um_bins': GeoDataFrame shape: (576580, 1) (2D shapes)
│ └── 'square_016um_bins': GeoDataFrame shape: (144372, 1) (2D shapes)
└── Tables
├── 'square_008um': AnnData (576580, 11163)
├── 'square_016um': AnnData (144372, 11163)
├── 'table': AnnData (63173, 5006)
└── 'table_codeword': AnnData (63173, 11163)
with coordinate systems:
▸ 'global', with elements:
morphology_focus (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes), square_008um_bins (Shapes), square_016um_bins (Shapes)
For downstream analysis, see the Xenium Prime 5K tutorial.