Feature Quantification

Feature Quantification#

This page describes how to obtain probe/peak/isoform-level quantification for various spatial transcriptomics (ST) platforms as input to SPLISOSM. If you already have a compatible AnnData or SpatialData object, skip ahead to the expected data format.

Platform overview#

The table below summarizes the supported ST platforms, the type of isoform feature used, and the recommended quantification approach.

Platform type	Example platforms	Feature	Quantification approach
Long-read ST	ONT + Visium (e.g., SiT [LBT+23])	Full-length transcript isoform	Any long-read aligner/quantifier (e.g., IsoQuant, kallisto)
Short-read 3' end	10x Visium/Visium HD (fresh-frozen), Slide-seqV2	3' end diversity (TREND) event (peak)	Sierra for de novo peak calling
Short-read targeted	10x Visium/Visium HD (FFPE), 10x Flex	Exon/junction probe	Space Ranger output directly (`raw_probe_bc_matrix.h5`)
In situ targeted	10x Xenium Prime 5K	Exon/junction probe (codeword)	Xenium Ranger output directly (`transcripts.zarr.zip`)

Expected data format#

SPLISOSM expects isoform-level data in an AnnData of shape (n_spots, n_isoforms):

adata.layers['counts']: raw isoform counts.
adata.var: isoform metadata with at least:
- gene_symbol/gene_id: gene assignment for each isoform.
- A unique feature identifier in adata.var_names (e.g., transcript ID, peak ID, codeword ID).
Spatial coordinates in one of:
- adata.obsm['spatial']: preferred; a (n_spots, 2) array with [x (col), y (row)] columns, compatible with Scanpy/Squidpy conventions.
- adata.obs[['array_row', 'array_col']]: legacy pixel-coordinate format.
See splisosm.utils.prepare_inputs_from_anndata() for parsing details.

The spot-by-isoform AnnData can also be a table of a SpatialData object, which is the required input format for splisosm.SplisosmFFT (FFT-accelerated tests). In such cases, we rely on spatialdata.rasterize_bins to rasterize counts into square bins (i.e., padding unobserved spots with zeros) to improve memory efficiency and computation speed. See splisosm.SplisosmFFT.setup_data() for details.

Long-read ST data#

SPLISOSM is compatible with any long-read quantification tool that produces an isoform-by-spot count matrix. Popular options include IsoQuant and kallisto.

Note

Detection power scales linearly with sequencing depth (FAQ). Any processing choice that increases captured UMIs per spot will improve results.

Tested platforms:

10x Visium + ONT (SiT [LBT+23])
10x Visium HD + ONT (the EPI2ME dataset)

Visium HD ONT tutorial shows an example workflow of loading preprocessed ONT + Visium HD data and running SPLISOSM. The dataset is publicly accessible and can be downloaded from EPI2ME. Transcript assignment and quantification were performed using minimap2, Stringtie, and FLAMES. See the EPI2ME workflow documentation for details. The generated SpatialData object has the following structure (example):

SpatialData object, with associated Zarr store: /Users/jysumac/Projects/SPLISOSM_paper/data/visiumhd_ont_mouse_cbs/sdata_tx.filtered.zarr
├── Images
│     ├── '_hires_image': DataArray[cyx] (3, 3000, 3200)
│     └── '_lowres_image': DataArray[cyx] (3, 563, 600)
├── Shapes
│     ├── '_square_002um': GeoDataFrame shape: (7857218, 1) (2D shapes)
│     ├── '_square_008um': GeoDataFrame shape: (492663, 1) (2D shapes)
│     └── '_square_016um': GeoDataFrame shape: (123658, 1) (2D shapes)
└── Tables
      ├── 'square_002um': AnnData (7857218, 48001)
      ├── 'square_008um': AnnData (492663, 48001)
      └── 'square_016um': AnnData (123658, 48001)
with coordinate systems:
    ▸ '', with elements:
        _hires_image (Images), _lowres_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
    ▸ '_downscaled_hires', with elements:
        _hires_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
    ▸ '_downscaled_lowres', with elements:
        _lowres_image (Images), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)

Short-read 3’ end ST data#

Use Sierra to call 3’ end diversity (TREND) events de novo from Space Ranger BAM files. See the Sierra vignette for installation and usage details.

Tested platforms:

10x Visium (fresh-frozen)
Slide-seqV2
10x Visium HD 3’ (fresh-frozen)

Warning

When processing multiple samples, avoid Sierra’s MergePeakCoordinates function — it can produce overlapping peak definitions.

Note

There is a known issue with running Sierra’s CountPeaks on Visium HD 3' data. See the workaround workflow below and under scripts/visiumhd_3p_trend_quant.sh for details.

Visium HD 3’ data (custom quantification)#

For 10x Visium HD 3’ data, we recommend running a hybrid workflow:

Run spaceranger count (>=v4.0) with create-bam=true to get the BAM file.
Run Sierra FindPeaks (plus AnnotatePeaksFromGTF) on the Space Ranger BAM.
Convert peak definitions to SAF/BED and run featureCounts -R BAM to add peak-level tags using Subread.
Count UMIs per peak per barcode with umi_tools count.
Build a 10x-compatible feature-barcode matrix and save as raw_probe_bc_matrix.h5 under the output directory outs/binned_outputs/square_002um.
Load peak-level data into SpatialData with splisosm.io.load_visiumhd_probe().

An end-to-end implementation of the above quantification workflow is available in scripts/visiumhd_3p_trend_quant.bash under the scripts directory. The generated SpatialData object has the following structure (example):

SpatialData object, with associated Zarr store: /Users/jysumac/Projects/SPLISOSM_paper/data/visiumhd_3p_mouse_cbs/sdata_peak.filtered.zarr
├── Images
│     ├── '_hires_image': DataArray[cyx] (3, 5492, 6000)
│     └── '_lowres_image': DataArray[cyx] (3, 549, 600)
├── Shapes
│     ├── '_cell_segmentations': GeoDataFrame shape: (84031, 2) (2D shapes)
│     ├── '_square_002um': GeoDataFrame shape: (5998466, 1) (2D shapes)
│     ├── '_square_008um': GeoDataFrame shape: (376419, 1) (2D shapes)
│     └── '_square_016um': GeoDataFrame shape: (94592, 1) (2D shapes)
└── Tables
      ├── 'cell_segmentations': AnnData (84031, 19575)
      ├── 'square_002um': AnnData (5998466, 19575)
      ├── 'square_008um': AnnData (376419, 19575)
      └── 'square_016um': AnnData (94592, 19575)
with coordinate systems:
    ▸ '', with elements:
        _hires_image (Images), _lowres_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
    ▸ '_downscaled_hires', with elements:
        _hires_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)
    ▸ '_downscaled_lowres', with elements:
        _lowres_image (Images), _cell_segmentations (Shapes), _square_002um (Shapes), _square_008um (Shapes), _square_016um (Shapes)

For downstream analysis of TREND spatial patterns, see the Visium HD 3’ tutorial.

Running Sierra on Space Ranger BAM (10x Visium etc.)#

# load the R package
library(Sierra)
FindPeaks(
  output.file = '${peak_file}',
  gtf.file = '${gtf_file}', # SpaceRanger reference gtf file
  bamfile = '${bam_file}',  # bam file from SpaceRanger
  junctions.file = '${junc_file}', # junctions bed files extracted using regtools junctions extract
# optional arguments for retaining low-abundance peaks
#   min.jcutoff.prop = 0.0,
#   min.cov.prop = 0.0,
#   min.peak.prop = 0.0
)

CountPeaks(
  peak.sites.file = '${peak_file}',
  gtf.file = '${gtf_file}',
  bamfile = '${bam_file}',
  whitelist.file = '${whitelist_file}', # barcodes.tsv file from SpaceRanger
  output.dir = '${output_dir}',
)

Converting Sierra output to AnnData#

import scanpy as sc
import pandas as pd
from splisosm.utils import load_visium_sp_meta

sierra_out_dir = "path/to/sierra/output"  # 'output.dir' in CountPeaks
sp_meta_dir = "path/to/visium/spatial/metadata"  # 'spatial' directory in 10X Visium data

# load the Sierra outputs as an AnnData object
adata = sc.read(
    f"{sierra_out_dir}/matrix.mtx.gz",
    cache_compression='cache_compression',
).T

# load TREND peak metadata
peaks = pd.read_csv(
    f"{sierra_out_dir}/sitenames.tsv.gz",
    header=None,
    sep="\t",
)
df_var = peaks[0].str.split(':', expand=True)
df_var.columns = ['gene_symbol', 'chr', 'position', 'strand']
df_var.index = peaks[0].values

# load spatial barcode metadata
barcodes = pd.read_csv(f"{sierra_out_dir}/barcodes.tsv.gz", header=None)

# add metadata to the AnnData object
adata.var_names = peaks[0].values
adata.obs_names = barcodes[0].values
adata.var = df_var
adata.var['gene_id'] = adata.var['gene_symbol']

# load Visium spatial metadata
adata = load_visium_sp_meta(adata, f"{sp_meta_dir}/", library_id='adata_peak')
adata = adata[adata.obs['in_tissue'].astype(bool), :].copy()

SPLISOSM compares all events associated with the same gene and is agnostic to their specific structure. Filter out low-abundance features before testing to reduce computation and improve power.

# filter out lowly expressed peaks
sc.pp.filter_genes(adata, min_cells=0.01 * adata.shape[0])

# extract gene symbols and peak ids
df_iso_meta = adata.var.copy()  # gene_symbol, chr, position, strand, gene_id
df_iso_meta['peak_id'] = adata.var_names

# prepare gene-level metadata
df_gene_meta = df_iso_meta.groupby('gene_symbol').size().reset_index(name='n_peak')
df_gene_meta = df_gene_meta.set_index('gene_symbol')

print(f"Number of spots: {adata.shape[0]}")
print(f"Number of genes before QC: {df_gene_meta.shape[0]}")
print(f"Number of peaks before QC: {adata.shape[1]}")
print(f"Average number of peaks per gene before QC: {adata.shape[1] / df_gene_meta.shape[0]}")

# calculate the total counts per gene
mapping_matrix = pd.get_dummies(df_iso_meta['gene_symbol'])
mapping_matrix = mapping_matrix.loc[df_iso_meta.index, df_gene_meta.index]
isog_counts = adata[:, mapping_matrix.index].layers['counts'] @ mapping_matrix

# calculate mean and sd of total gene counts
df_gene_meta['pct_spot_on'] = (isog_counts > 0).mean(axis=0)
df_gene_meta['count_avg'] = isog_counts.mean(axis=0)
df_gene_meta['count_std'] = isog_counts.std(axis=0)

# filter out lowly expressed genes
_gene_keep = df_gene_meta['pct_spot_on'] > 0.01
# _gene_keep = (df_gene_meta['count_avg'] > 0.5) & _gene_keep

# filter out genes with single isoform
_gene_keep = (df_gene_meta['n_peak'] > 1) & _gene_keep

# filter for isoforms
_iso_keep = df_iso_meta['gene_symbol'].isin(df_gene_meta.index[_gene_keep])

# update feature meta
df_gene_meta = df_gene_meta.loc[_gene_keep, :]
adata = adata[:, _iso_keep]
adata.var = df_iso_meta.loc[_iso_keep, :].copy()

print(f"Number of genes after QC: {sum(_gene_keep)}")
print(f"Number of peaks after QC: {sum(_iso_keep)}")
print(f"Average number of peaks per gene after QC: {sum(_iso_keep) / sum(_gene_keep)}")

Short-read targeted ST data#

10x Visium HD FFPE uses a fixed pan-genome probe for read enrichment. SPLISOSM treats probe-level counts as the isoform-level quantification — probes targeting the same gene are grouped and tested jointly.

Tested platform:

10x Visium HD FFPE

Given Space Ranger output, the following code creates a SpatialData object with probe-level counts (from raw_probe_bc_matrix.h5):

from splisosm.io import load_visiumhd_probe

sdata = load_visiumhd_probe(
  path=visium_hd_outs,
  bin_sizes=[2, 8, 16],
  filtered_counts_file=True,
  load_all_images=False,
  var_names_make_unique=True,
  counts_layer_name="counts",
)

Note

splisosm.io.load_visiumhd_probe() uses the barcode_mappings.parquet file, which contains the spatial mapping information of 2um barcodes to coarser bins and segmented cells. See 10x documentation for details. If you don’t have this file, please re-run the Space Ranger count pipeline with Space Ranger v4.0+.

The generated SpatialData object has the following structure (example):

SpatialData object
├── Images
│     ├── 'Visium_HD_Mouse_Brain_full_image': DataTree[cyx] (3, 23947, 18872), (3, 11973, 9436), (3, 5986, 4718), (3, 2993, 2359), (3, 1496, 1179)
│     ├── 'Visium_HD_Mouse_Brain_hires_image': DataArray[cyx] (3, 6000, 4729)
│     └── 'Visium_HD_Mouse_Brain_lowres_image': DataArray[cyx] (3, 600, 473)
├── Shapes
│     ├── 'Visium_HD_Mouse_Brain_cell_segmentations': GeoDataFrame shape: (40222, 2) (2D shapes)
│     ├── 'Visium_HD_Mouse_Brain_square_002um': GeoDataFrame shape: (6296688, 1) (2D shapes)
│     ├── 'Visium_HD_Mouse_Brain_square_008um': GeoDataFrame shape: (393543, 1) (2D shapes)
│     └── 'Visium_HD_Mouse_Brain_square_016um': GeoDataFrame shape: (98917, 1) (2D shapes)
└── Tables
      ├── 'cell_segmentations': AnnData (40222, 55538)
      ├── 'square_002um': AnnData (6296688, 55538)
      ├── 'square_008um': AnnData (393543, 55538)
      └── 'square_016um': AnnData (98917, 55538)
with coordinate systems:
    ▸ 'Visium_HD_Mouse_Brain', with elements:
        Visium_HD_Mouse_Brain_full_image (Images), Visium_HD_Mouse_Brain_hires_image (Images), Visium_HD_Mouse_Brain_lowres_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
    ▸ 'Visium_HD_Mouse_Brain_downscaled_hires', with elements:
        Visium_HD_Mouse_Brain_hires_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)
    ▸ 'Visium_HD_Mouse_Brain_downscaled_lowres', with elements:
        Visium_HD_Mouse_Brain_lowres_image (Images), Visium_HD_Mouse_Brain_cell_segmentations (Shapes), Visium_HD_Mouse_Brain_square_002um (Shapes), Visium_HD_Mouse_Brain_square_008um (Shapes), Visium_HD_Mouse_Brain_square_016um (Shapes)

For downstream analysis, see the Visium HD FFPE tutorial.

In situ targeted ST data#

For imaging-based platforms with exon- or junction-specific probes (e.g., 10x Xenium Prime 5K), SPLISOSM uses codeword-level counts as isoform proxies. Data can be analyzed at single-cell resolution (segmented cells) or on spatially binned spots.

Tested platform:

10x Xenium Prime 5K

Given Xenium Ranger output, the following code creates a binned SpatialData object with codeword-level counts (from transcripts.zarr.zip):

from splisosm.io import load_xenium_codeword

sdata = load_xenium_codeword(
  path=xenium_ranger_outs,
  spatial_resolutions=[8.0, 16.0],
  quality_threshold=20.0,
  n_jobs=-1,
  chunk_batch_size=64,
  counts_layer_name="counts",
  build_cell_codeword_table=True,
  create_square_shapes=True,
)

Note

If your Xenium Output Bundle was generated by older Xenium Ranger versions before v3.1.0, please re-run the Xenium Ranger relabel pipeline to get an updated transcripts.zarr.zip file.

The generated SpatialData object has the following structure (example):

SpatialData object
├── Images
│     └── 'morphology_focus': DataTree[cyx] (4, 23912, 34154), (4, 11956, 17077), (4, 5978, 8538), (4, 2989, 4269), (4, 1494, 2134)
├── Labels
│     ├── 'cell_labels': DataTree[yx] (23912, 34154), (11956, 17077), (5978, 8538), (2989, 4269), (1494, 2134)
│     └── 'nucleus_labels': DataTree[yx] (23912, 34154), (11956, 17077), (5978, 8538), (2989, 4269), (1494, 2134)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 13) (3D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (63173, 1) (2D shapes)
│     ├── 'nucleus_boundaries': GeoDataFrame shape: (63036, 1) (2D shapes)
│     ├── 'square_008um_bins': GeoDataFrame shape: (576580, 1) (2D shapes)
│     └── 'square_016um_bins': GeoDataFrame shape: (144372, 1) (2D shapes)
└── Tables
      ├── 'square_008um': AnnData (576580, 11163)
      ├── 'square_016um': AnnData (144372, 11163)
      ├── 'table': AnnData (63173, 5006)
      └── 'table_codeword': AnnData (63173, 11163)
with coordinate systems:
    ▸ 'global', with elements:
        morphology_focus (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes), square_008um_bins (Shapes), square_016um_bins (Shapes)

For downstream analysis, see the Xenium Prime 5K tutorial.

Feature Quantification

Contents

Feature Quantification#

Platform overview#

Expected data format#

Long-read ST data#

Short-read 3’ end ST data#

Visium HD 3’ data (custom quantification)#

Running Sierra on Space Ranger BAM (10x Visium etc.)#

Converting Sierra output to AnnData#

Short-read targeted ST data#

In situ targeted ST data#