Quick start demo (Visium ONT)

Quick start demo (Visium ONT)#

This notebook showcases the SPLISOSM spatial isoform analysis workflow on a SiT (Visium-ONT) mouse olfactory bulb (MOB) dataset, which includes the following steps:

Run spatial variability (SV) tests to find genes with variable expression (SVE) or isoform usage (SVP).
Run differential isoform usage (DU) tests against spatial regions to find region markers.
Run differential isoform usage (DU) tests against RNA binding proteins (RBPs) to find potential regulators.

The preprocessed MOB dataset is available for download from Dropbox (~100 MB). For preprocessing details, see related scripts from the SPLISOSM paper.

mob_ont_filtered_1107.h5ad: AnnData object of isoform counts with spot metadata.
mob_visium_rbp_1107.h5ad: AnnData object of RBP gene counts with spot metadata.

Estimated runtime: ~10 min.

Imports#

from pathlib import Path

import warnings
from sklearn.exceptions import ConvergenceWarning
from scipy.stats import spearmanr
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt

from splisosm import SplisosmNP, SplisosmGLMM
from splisosm.utils import run_hsic_gc, add_ratio_layer

# Update this path to your local data folder
DATA_DIR = Path("path/to/data/")
ont_path = DATA_DIR / "mob_ont_filtered_1107.h5ad"
rbp_path = DATA_DIR / "mob_visium_rbp_1107.h5ad"

Load isoform-level `AnnData`#

This MOB dataset contains only 918 spots and 1871 genes. Raw UMI counts, log-normalized counts, and isoform usage ratios are precomputed and stored as separate layers.

adata_ont = sc.read(ont_path)

# # log normalize
# sc.pp.normalize_total(adata_ont, target_sum=1e4)
# sc.pp.log1p(adata_ont)
# adata_ont.layers['log1p'] = adata_ont.X.copy()

# # compute observed isoform ratios
# from splisosm.utils import extract_counts_n_ratios
# _, _, _, ratio_obs = extract_counts_n_ratios(
#     adata_ont, layer = 'counts', group_iso_by = 'gene_symbol'
# )
# adata_ont.layers['ratios_obs'] = ratio_obs.copy() # dense matrix of shape (n_spots, n_isos)

print(f"== Number of spots: {adata_ont.n_obs}")
print(f"== Number of genes: {adata_ont.var['gene_symbol'].nunique()}")
print(f"== Number of isoforms: {adata_ont.shape[1]}")

adata_ont

== Number of spots: 918
== Number of genes: 1871
== Number of isoforms: 4529

AnnData object with n_obs × n_vars = 918 × 4529
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'region', 'AtoI_total', 'AtoI_ratio', 'AtoI_detected', 'in_tissue', 'array_row', 'array_col'
    var: 'features', 'n_cells', 'gene_symbol', 'transcript_id', 'transcript_name'
    uns: 'log1p', 'spatial'
    obsm: 'spatial'
    layers: 'counts', 'log1p', 'ratios_obs'

with plt.rc_context({"figure.figsize": (3, 3)}):
    sc.pl.spatial(adata_ont, color = [None, 'region'], title=['H&E', 'Region'], img_key = 'lowres')

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/638070984.py:2: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(adata_ont, color = [None, 'region'], title=['H&E', 'Region'], img_key = 'lowres')

../_images/8b5fc94d5b05afbce6f85b57b348f6eec8eaa345ad2d69e8258295f0a6457dca.png

Spatial variability (SV) tests#

We first set up the SplisosmNP model for SV testing using the AnnData as inputs.

%%time
model_sv = SplisosmNP()
model_sv.setup_data(
    adata=adata_ont,
    spatial_key="spatial",
    layer="counts",
    group_iso_by="gene_symbol",
    gene_names="gene_symbol",
    # optionally, remove minor isoforms and single-isoform genes
    min_counts=10,
    min_bin_pct=0.01,
    filter_single_iso_genes=True,
)
model_sv

CPU times: user 132 ms, sys: 21.5 ms, total: 154 ms
Wall time: 147 ms

=== SplisosmNP
- Number of genes: 1871
- Number of spots: 918
- Number of covariates: 0
- Average isoforms per gene: 2.4
=== Model configurations
- Spatial kernel source: spatial_key='spatial'
- k_neighbors: 4, rho: 0.99
- Standardize spatial covariance: True
=== Test results
- Spatial variability: N/A
- Differential usage: N/A

# Top 5 genes with the most effective isoforms
model_sv.extract_feature_summary(level="gene").sort_values(
    "perplexity", ascending=False
).head(5)

Genes: 100%|██████████| 1871/1871 [00:00<00:00, 4178.21it/s]

	n_isos	perplexity	pct_bin_on	count_avg	count_std	major_ratio_avg
gene
Pigx	6	5.524084	0.226580	0.284314	0.589296	0.275862
0610010K14Rik	6	5.430532	0.264706	0.338780	0.639682	0.260450
Rpl41	6	5.096268	0.314815	0.391068	0.642204	0.367688
Mff	7	5.048239	0.285403	0.386710	0.730385	0.433803
Hmga1	6	4.987885	0.159041	0.201525	0.515255	0.281081

Find spatially variably expressed (SVE) genes#

Here we will aggregate isoform counts to gene level and then run the HSIC-GC test to see if the total expression of a gene is spatially variable. Because the dataset is small, exact spatial-kernel cumulants are used by default. For large implicit CAR kernels, set null_configs = {'n_probes': int} in test_spatial_variability to tune the Hutchinson trace budget.

%%time
model_sv.test_spatial_variability(method="hsic-gc", null_method='liu')
sve_res = model_sv.get_formatted_test_results(test_type="sv").sort_values("pvalue")
sve_res.head(5)

SV [hsic-gc]: 100%|██████████| 59/59 [00:00<00:00, 148.33it/s]

CPU times: user 644 ms, sys: 507 ms, total: 1.15 s
Wall time: 716 ms

	gene	statistic	pvalue	pvalue_adj
242	Apoe	0.282076	3.754841e-58	7.025307e-55
979	Tmsb4x	0.584759	1.491054e-47	1.394881e-44
753	Rps5	0.155722	1.683776e-39	1.050115e-36
682	Apod	0.240496	5.437424e-39	2.543355e-36
1177	Fth1	1.767816	6.668350e-38	2.495297e-35

Or alternatively, through a convenient standalone wrapper function splisosm.utils.run_hsic_gc():

%%time
# Aggregate isoform counts to gene level
iso_to_gene = pd.get_dummies(adata_ont.var["gene_symbol"])
gene_counts = adata_ont.layers["counts"].copy() @ iso_to_gene.values # (918, 1871)

# Extract spatial coordinates
coords = adata_ont.obsm["spatial"] # (918, 2)

# Run the HSIC-GC wrapper function splisosm.utils.run_hsic_gc
sve_res_wrapper = run_hsic_gc(
    counts_gene=gene_counts, 
    coordinates=coords,
    null_method='liu',
    # SplisosmNP default configurations for spatial kernel:
    k_neighbors=4, 
    rho=0.99,
    standardize_cov=True,
)
sve_res_wrapper = pd.DataFrame(
    {'gene': iso_to_gene.columns.values} | sve_res_wrapper
).sort_values("pvalue")
sve_res_wrapper.head(5)

Genes: 100%|██████████| 59/59 [00:00<00:00, 515.01it/s]

CPU times: user 696 ms, sys: 33.2 ms, total: 729 ms
Wall time: 504 ms

	gene	statistic	pvalue	method	null_method	n_spots	chunk_size	pvalue_adj
119	Apoe	0.282076	3.754629e-58	hsic-gc	liu	918	32	7.024911e-55
1679	Tmsb4x	0.584759	1.490958e-47	hsic-gc	liu	918	32	1.394791e-44
1387	Rps5	0.155722	1.683676e-39	hsic-gc	liu	918	32	1.050053e-36
118	Apod	0.240496	5.437580e-39	hsic-gc	liu	918	32	2.543428e-36
581	Fth1	1.767816	6.668486e-38	hsic-gc	liu	918	32	2.495347e-35

Find spatially variably processed (SVP) genes#

Here we will run the HSIC-IR test to see if the isoform usage of a gene is spatially variable, independently of its overall expression level.

%%time
model_sv.test_spatial_variability(
    method="hsic-ir", 
    null_method='liu',
    ratio_transformation="none", 
    nan_filling="mean"
)
svp_res = model_sv.get_formatted_test_results(
    test_type="sv", with_gene_summary=True
).sort_values("pvalue")
svp_res.head(5)

SV [hsic-ir]: 100%|██████████| 146/146 [00:00<00:00, 182.37it/s]

CPU times: user 906 ms, sys: 797 ms, total: 1.7 s
Wall time: 971 ms

	gene	statistic	pvalue	pvalue_adj	n_isos	perplexity	pct_bin_on	count_avg	count_std	major_ratio_avg
1023	Myl6	0.000692	3.148214e-15	5.890308e-12	4	2.568840	0.807190	2.335512	2.222548	0.476213
304	Rps24	0.000482	2.446140e-12	2.288364e-09	4	2.093565	0.844227	2.331155	1.913525	0.731776
1461	Rpl5	0.000526	9.723733e-12	6.064368e-09	3	2.143138	0.527233	0.973856	1.288195	0.510067
1539	Plp1	0.000312	8.414019e-09	3.935657e-06	2	1.998197	0.331155	0.744009	2.231397	0.521230
216	Clta	0.000430	2.332949e-05	8.729895e-03	5	2.951403	0.724401	1.820261	1.999276	0.606224

Compare SVE and SVP results#

def _neg_log10(p, floor=1e-300):
    return -np.log10(np.clip(p, floor, 1.0))

sv_genes = sve_res.merge(svp_res, on="gene", suffixes=("_sve", "_svp"))

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(
    _neg_log10(sv_genes["pvalue_sve"]),
    _neg_log10(sv_genes["pvalue_svp"]),
    alpha=0.2
)
# add pvalue = 0.05 lines
ax.axhline(_neg_log10(0.05), color="red", linestyle="--")
ax.axvline(_neg_log10(0.05), color="red", linestyle="--")

# annotate top 10 genes by SVP p-value
top_10_svp = sv_genes.nsmallest(10, "pvalue_svp")
for i, (idx, row) in enumerate(top_10_svp.iterrows()):
    ax.annotate(row["gene"], xy=(_neg_log10(row["pvalue_sve"]), _neg_log10(row["pvalue_svp"])), 
                style='italic', fontsize=10, color="red")

# annotate top 5 genes by SVE p-value
top_5_sve = sv_genes.nsmallest(5, "pvalue_sve")
for i, (idx, row) in enumerate(top_5_sve.iterrows()):
    if row["gene"] not in top_10_svp["gene"].values:
        # only annotate non-SVP genes
        ax.annotate(
            row["gene"], xy=(_neg_log10(row["pvalue_sve"]), _neg_log10(row["pvalue_svp"])), 
            style='italic', fontsize=10, color="black"
        )
    
ax.set_xlabel("-log10(p) SVE (HSIC-GC)", fontsize=10)
ax.set_ylabel("-log10(p) SVP (HSIC-IR)", fontsize=10)
ax.set_title("Variation in expression vs isoform processing", fontsize=14)
plt.show()

../_images/dae0dcfcdab428ae1eacae0237496b221cafbd909896dad98ea4e3fe1933fe14.png

We can take a look at genes with spatially variable isoform usage as well as genes with only variable expression, to see how they differ.

# Compute isoform ratios and store in a new layer "ratios_obs"
add_ratio_layer(adata_ont, group_iso_by="gene_symbol", layer="counts", ratio_layer_key="ratios_obs")

# Spatially variably processed: Myl6
_gene = "Myl6"
_iso_to_plot = adata_ont.var.index[adata_ont.var['gene_symbol'] == _gene]
_name_to_plot = adata_ont.var.loc[adata_ont.var['gene_symbol'] == _gene, 'transcript_name']
_titles_log1p = [f"{name} (log1p)" for name in _name_to_plot]
_titles_ratio = [f"{name} (ratio)" for name in _name_to_plot]
with plt.rc_context({"figure.figsize": (3, 3)}):
    sc.pl.spatial(
        adata_ont, color=_iso_to_plot, title=_titles_log1p,
        img_key=None, layer='log1p', ncols = len(_iso_to_plot)
    )
    sc.pl.spatial(
        adata_ont, color=_iso_to_plot, title=_titles_ratio,
        img_key=None, layer='ratios_obs', ncols = len(_iso_to_plot)
    )

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/3160921472.py:11: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/325791f5220dc110f6511fd47c21b0171936d4c5f4bac6fe1407c8ad37621f3d.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/3160921472.py:15: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/9388c2ff6376f451208de8f5b721c864e3a461bcdc794fdf22586c967f60f576.png

# Spatially variably expressed but not processed: Apoe
_gene = "Apoe"
_iso_to_plot = adata_ont.var.index[adata_ont.var['gene_symbol'] == _gene]
_name_to_plot = adata_ont.var.loc[adata_ont.var['gene_symbol'] == _gene, 'transcript_name']
_titles_log1p = [f"{name} (log1p)" for name in _name_to_plot]
_titles_ratio = [f"{name} (ratio)" for name in _name_to_plot]
with plt.rc_context({"figure.figsize": (3, 3)}):
    sc.pl.spatial(
        adata_ont, color=_iso_to_plot, title=_titles_log1p,
        img_key=None, layer='log1p', ncols = len(_iso_to_plot)
    )
    sc.pl.spatial(
        adata_ont, color=_iso_to_plot, title=_titles_ratio,
        img_key=None, layer='ratios_obs', ncols = len(_iso_to_plot)
    )

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/1513142415.py:8: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/95a528fd7b05d367997744be62af7870fb78de598eb1cb93c074834c81daf809.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/1513142415.py:12: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/65821d6412958279968608793f391e2ad0b9e1c3d7b750affd1123c41051ef09.png

Differential isoform usage (DU) tests#

The covariates for DU tests can be any spot-level metadata, such as spatial regions or RBP expression levels.

Find isoform switching events between spatial regions#

We will first use the spatial region annotation available from the original SiT paper.

%%time
# Subset to the significant SVP genes
svp_gene_list = svp_res.query("pvalue < 0.01")["gene"].tolist()
adata_svp = adata_ont[:, adata_ont.var["gene_symbol"].isin(svp_gene_list)].copy()

# Create a model with design matrix based on spatial regions
model_region = SplisosmNP()
model_region.setup_data(
    adata=adata_svp,
    # will one-hot encode the region column in adata.obs
    design_mtx='region',
    spatial_key="spatial",
    layer="counts",
    group_iso_by="gene_symbol",
    gene_names="gene_symbol",
    min_counts=10,
    min_bin_pct=0.01,
    skip_spatial_kernel=True, # DU test does not require initializing spatial kernel
)
model_region

CPU times: user 14.3 ms, sys: 5.21 ms, total: 19.5 ms
Wall time: 22.7 ms

=== SplisosmNP
- Number of genes: 51
- Number of spots: 918
- Number of covariates: 5
- Average isoforms per gene: 2.6
=== Model configurations
- Spatial kernel source: identity (skip_spatial_kernel=True)
- k_neighbors: N/A, rho: 0.99
- Standardize spatial covariance: True
=== Test results
- Spatial variability: N/A
- Differential usage: N/A

%%time
# Conditional DU test using sklearn's GaussianProgressRegressor (default)
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning, message=".*is close to the specified.*")
    model_region.test_differential_usage(method="hsic-gp", gpr_backend="sklearn")

du_res = model_region.get_formatted_test_results(test_type="du").sort_values("pvalue")
du_res.head(5)

Covariates: 100%|██████████| 5/5 [00:01<00:00,  3.63it/s]
DU [hsic-gp]: 100%|██████████| 51/51 [00:00<00:00, 332.53it/s]

CPU times: user 1.39 s, sys: 782 ms, total: 2.18 s
Wall time: 1.59 s

	gene	covariate	statistic	pvalue	pvalue_adj
233	Plp1	region_Olfactory Nerve Layer (ONL)	0.000973	0.000001	0.000038
130	Nnat	region_Glomerular Layer (GL)	0.003000	0.000001	0.000042
133	Nnat	region_Olfactory Nerve Layer (ONL)	0.001121	0.000002	0.000038
225	Ap3s1	region_Glomerular Layer (GL)	0.002433	0.000002	0.000042
226	Ap3s1	region_Granule Cell Layer (GCL+RMS)	0.000906	0.000002	0.000124

Let’s plot the most significant switching events for each region.

du_res.groupby("covariate").apply(
    lambda df: df.nsmallest(1, "pvalue"),
    include_groups=False
)

		gene	statistic	pvalue	pvalue_adj
covariate
region_Glomerular Layer (GL)	130	Nnat	0.003000	0.000001	0.000042
region_Granule Cell Layer (GCL+RMS)	226	Ap3s1	0.000906	0.000002	0.000124
region_Mitral Cell Layer (MCL)	227	Ap3s1	0.002586	0.000025	0.001298
region_Olfactory Nerve Layer (ONL)	233	Plp1	0.000973	0.000001	0.000038
region_Outer plexiform Layer (EPL)	94	Rps9	0.000807	0.009689	0.390307

for _gene in ['Nnat', 'Ap3s1', 'Plp1', 'Rps9']:
    _iso_to_plot = adata_ont.var.index[adata_ont.var['gene_symbol'] == _gene]
    _name_to_plot = adata_ont.var.loc[adata_ont.var['gene_symbol'] == _gene, 'transcript_name']
    with plt.rc_context({"figure.figsize": (3, 3)}):
        sc.pl.spatial(
            adata_ont, color=_iso_to_plot, title=_name_to_plot,
            img_key=None, layer='ratios_obs', ncols = len(_iso_to_plot)
        )

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/720942732.py:5: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/183ea2b87d679367b94a86b88058fc98fa78fc3842f6cc96649f2f99e1cdc1f3.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/720942732.py:5: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/af94402d63ac9e8bcc98ca889ff49384da5bd8f9c1e18a61341d1820f721925f.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/720942732.py:5: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/722a30fee1d343027bab9311c2dbf4f6c4ee7d937e0ab162c2b7e0ed582d1ab8.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/720942732.py:5: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/19a520dd90354db5b82785bd0d4ee00a0a9b2baf0661aead3ba8f4417e979668.png

Find potential RBP regulators of isoform switching events#

Next, we load the short-read-based expression matrix of all RNA binding proteins (RBPs) in the same dataset. The list of RBPs is obtained from EuRBPDB, excluding ribosomal proteins and non-canonical RBPs.

adata_rbp = sc.read(rbp_path)
adata_rbp

# # log normalize is performed with all genes
# sc.pp.normalize_total(adata_sr, target_sum=1e4)
# sc.pp.log1p(adata_sr)
# adata_sr.layers['log1p'] = adata_sr.X.copy()
# data_rbp = adata_sr[:, adata_sr.var_names.isin(rbp_genes)].copy()

AnnData object with n_obs × n_vars = 918 × 1399
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'region', 'AtoI_total', 'AtoI_ratio', 'AtoI_detected', 'in_tissue', 'array_row', 'array_col'
    var: 'features', 'n_cells', 'is_rbp', 'pvalue_adj_hsic', 'pvalue_adj_sparkx', 'is_visium_sve'
    uns: 'log1p', 'spatial'
    obsm: 'spatial'
    layers: 'counts', 'log1p'

Let’s first run SVE test (HSIC-GC) to find spatially variably expressed RBPs.

%%time
# Run the HSIC-GC wrapper function (anndata mode)
sve_rbp = run_hsic_gc(
    adata=adata_rbp,
    layer='counts',
    spatial_key='spatial',
    null_method='liu',
    k_neighbors=4, 
    rho=0.99,
    standardize_cov=True,
)
sve_rbp = pd.DataFrame(
    {'gene': adata_rbp.var_names} | sve_rbp
).sort_values("pvalue")

print(f"Number of SVE RBPs (FDR < 0.01): {(sve_rbp['pvalue_adj'] < 0.01).sum()} / {adata_rbp.shape[1]}")
sve_rbp.head(5)

Genes: 100%|██████████| 44/44 [00:00<00:00, 225.96it/s]

Number of SVE RBPs (FDR < 0.01): 455 / 1399
CPU times: user 307 ms, sys: 27.8 ms, total: 335 ms
Wall time: 283 ms

	gene	statistic	pvalue	method	null_method	n_spots	chunk_size	pvalue_adj
1326	Ndufa2	0.244349	1.237645e-29	hsic-gc	liu	918	32	1.731465e-26
1185	Arf3	0.018160	1.160708e-27	hsic-gc	liu	918	32	8.119149e-25
876	Eef1a1	0.392478	2.804746e-27	hsic-gc	liu	918	32	1.307946e-24
911	Ewsr1	0.033771	7.899712e-25	hsic-gc	liu	918	32	2.762924e-22
300	Rab2a	0.058817	4.460292e-23	hsic-gc	liu	918	32	1.247990e-20

We can then run conditional HSIC (method="hsic-gp") to find potential RBP regulators of the isoform switching events.

%%time
# Subset to the significant SVP genes
svp_gene_list = svp_res.query("pvalue < 0.01")["gene"].tolist()
adata_svp = adata_ont[:, adata_ont.var["gene_symbol"].isin(svp_gene_list)].copy()

# Subset to the significant SVE RBPs
sve_rbp_list = sve_rbp.query("pvalue_adj < 0.01")["gene"].tolist()
adata_rbp_sve = adata_rbp[
    adata_svp.obs_names, 
    adata_rbp.var_names.isin(sve_rbp_list)
]
design_mtx = adata_rbp_sve.layers['log1p'].copy()
covariate_names = adata_rbp_sve.var_names.tolist()

# Create a model with RBP expression as covariates
model_rbp = SplisosmNP()
model_rbp.setup_data(
    adata=adata_svp,
    design_mtx=design_mtx,
    covariate_names=covariate_names,
    spatial_key="spatial",
    layer="counts",
    group_iso_by="gene_symbol",
    gene_names="gene_symbol",
    min_counts=10,
    min_bin_pct=0.01,
    skip_spatial_kernel=True, # DU test does not require initializing spatial kernel
)
model_rbp

CPU times: user 17.5 ms, sys: 7.24 ms, total: 24.7 ms
Wall time: 24.9 ms

=== SplisosmNP
- Number of genes: 51
- Number of spots: 918
- Number of covariates: 455
- Average isoforms per gene: 2.6
=== Model configurations
- Spatial kernel source: identity (skip_spatial_kernel=True)
- k_neighbors: N/A, rho: 0.99
- Standardize spatial covariance: True
=== Test results
- Spatial variability: N/A
- Differential usage: N/A

%%time
# Conditional: spatially residualize both probe usage and RBP expression
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning, message=".*is close to the specified.*")
    model_rbp.test_differential_usage(method="hsic-gp", gpr_backend='sklearn')
    du_rbp_sklearn = model_rbp.get_formatted_test_results(test_type="du").rename(columns={"pvalue": "pvalue_np"})

Covariates: 100%|██████████| 455/455 [02:06<00:00,  3.58it/s]
DU [hsic-gp]: 100%|██████████| 51/51 [00:13<00:00,  3.82it/s]

CPU times: user 2min 11s, sys: 1min 16s, total: 3min 27s
Wall time: 2min 24s

du_rbp_sklearn.query("gene == 'Myl6'").sort_values("pvalue_np").head(5)

	gene	covariate	statistic	pvalue_np	pvalue_adj
13967	Myl6	Zfp385c	0.003972	0.000032	0.001618
13809	Myl6	Lsm5	0.003395	0.000127	0.006488
13725	Myl6	Mbnl1	0.003372	0.000130	0.006628
13999	Myl6	Srsf5	0.003094	0.000270	0.013778
14097	Myl6	Hnrnpul2	0.002612	0.000678	0.030727

Scalable GP fitting via NUFFT#

Fitting Gaussian Processes (GPs) can become computationally intensive at larger sample sizes. To mitigate this, when using sklearn’s GaussianProcessRegressor, hyperparameter optimization is performed on a subset of samples by default (n_inducing=5000).

For better scalability on large irregular 2-D coordinates, consider the NUFFT backend (requires the finufft package). Tune lml_approx_rank to control the eigensummary used for GP hyperparameter fitting; larger values are usually more accurate but use more memory and time.

For more details on scalable GP fitting, please see the Methods page.

%%time
model_rbp.test_differential_usage(method="hsic-gp", gpr_backend='nufft', gpr_configs={"covariate": {"lml_approx_rank": 32}})
du_rbp_nufft = model_rbp.get_formatted_test_results(test_type="du").rename(columns={"pvalue": "pvalue_np"})

Covariates: 100%|██████████| 455/455 [00:02<00:00, 160.15it/s]
DU [hsic-gp]: 100%|██████████| 51/51 [00:12<00:00,  3.94it/s]

CPU times: user 11.7 s, sys: 46.3 s, total: 58 s
Wall time: 19.9 s

The two backends give largely concordant results, with NUFFT being much faster.

_df = pd.merge(
    du_rbp_sklearn, du_rbp_nufft, on=["gene", "covariate"], suffixes=("_sklearn", "_nufft")
).assign(
    stat_diff=lambda df: np.abs(df["statistic_sklearn"] - df["statistic_nufft"]),
    pvalue_diff=lambda df: np.abs(df["pvalue_np_sklearn"] - df["pvalue_np_nufft"])
)

print(f"Spearman correlation of test statistics: {spearmanr(_df['statistic_sklearn'], _df['statistic_nufft']).correlation:.4f}")
print(f"Spearman correlation of p-values: {spearmanr(_df['pvalue_np_sklearn'], _df['pvalue_np_nufft']).correlation:.4f}")
print(f"Significant associations (p < 0.01) by sklearn: {_df.query('pvalue_np_sklearn < 0.01').shape[0]} / {_df.shape[0]}")
print(f"Significant associations (p < 0.01) by nufft: {_df.query('pvalue_np_nufft < 0.01').shape[0]} / {_df.shape[0]}")

Spearman correlation of test statistics: 1.0000
Spearman correlation of p-values: 1.0000
Significant associations (p < 0.01) by sklearn: 377 / 23205
Significant associations (p < 0.01) by nufft: 379 / 23205

Visualize RBP-SVP gene pairs#

_rbps = ['Mbnl1', 'Srsf5', 'Hnrnpul2']
_gene = 'Myl6'
with plt.rc_context({"figure.figsize": (3, 3)}):
    sc.pl.spatial(
        adata_rbp, color=_rbps, title=_rbps,
        img_key=None, layer='log1p', ncols = len(_rbps)
    )

_iso_to_plot = adata_ont.var.index[adata_ont.var['gene_symbol'] == _gene]
_name_to_plot = adata_ont.var.loc[adata_ont.var['gene_symbol'] == _gene, 'transcript_name']
with plt.rc_context({"figure.figsize": (3, 3)}):
    sc.pl.spatial(
        adata_ont, color=_iso_to_plot, title=_name_to_plot,
        img_key=None, layer='ratios_obs', ncols = len(_iso_to_plot)
    )

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/1108538305.py:4: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/d532933e32de54e8f113cf8529de43af77dbfc51ea94d1006462bc7cf2e171c6.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/1108538305.py:12: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/e2b5281e48d78fa5165b0c1922530a10f18d0cd31bea09cb43e247e3c593d99c.png

(Optional) Differential usage tests with `SplisosmGLMM`#

Conventionally, differential splicing analysis is often performed using parametric models such as generalized linear models (GLM), which are implemented in SplisosmGLMM. We can run the same DU tests using SplisosmGLMM and compare the results with the non-parametric HSIC-based tests (results will be different).

GLM-based differential usage test (analogous to the unconditional `method='hsic'`)#

Let’s first run the unconditional SplisosmNP HSIC test as baseline.

%%time
# Model setup
model_hsic = SplisosmNP()
model_hsic.setup_data(
    adata=adata_svp,
    design_mtx=design_mtx,
    covariate_names=covariate_names,
    spatial_key="spatial",
    layer="counts",
    group_iso_by="gene_symbol",
    gene_names="gene_symbol",
    min_counts=10,
    min_bin_pct=0.01,
    skip_spatial_kernel=True, # DU test does not require initializing spatial kernel
)

# Unconditional HSIC test
model_hsic.test_differential_usage(method="hsic")
du_hsic_res = model_hsic.get_formatted_test_results(
    test_type="du"
).rename(columns={"pvalue": "pvalue_hsic"})

DU [hsic]: 100%|██████████| 51/51 [00:13<00:00,  3.74it/s]

CPU times: user 10.9 s, sys: 38 s, total: 48.8 s
Wall time: 17.4 s

Next, set up a Multinomial GLM model and run the score test for association.

%%time
# Model setup for GLM-based DU test
model_glm = SplisosmGLMM(
    model_type="glm", # the unconditional DU test
)
model_glm.setup_data(
    adata=adata_svp,
    design_mtx=design_mtx,
    covariate_names=covariate_names,
    spatial_key="spatial",
    layer="counts",
    group_iso_by="gene_symbol",
    gene_names="gene_symbol",
    min_counts=10,
    min_bin_pct=0.01,
)

# Model fitting
model_glm.fit(with_design_mtx=False)

# DU test with the score statistic
model_glm.test_differential_usage(method="score")
du_glm = model_glm.get_formatted_test_results(test_type="du").sort_values("pvalue")
du_glm.rename(columns={"pvalue": "pvalue_glm"}, inplace=True)

Fitting with single core for 51 genes (batch_size=1).

Fitting:   0%|          | 0/51 [00:00<?, ?it/s]

Fitting: 100%|██████████| 51/51 [00:00<00:00, 334.48it/s]

Fitting finished. Time elapsed: 0.16 seconds.

DU [score]: 100%|██████████| 51/51 [00:00<00:00, 70.70it/s]

CPU times: user 892 ms, sys: 109 ms, total: 1 s
Wall time: 900 ms

GLMM-based differential usage test (analogous to the conditional `method='hsic-gp'`)#

%%time
# Model setup for GLMM-based DU test
model_glmm = SplisosmGLMM(
    model_type="glmm-full",
    fitting_method="joint_gd",
    var_fix_sigma=True,
    fitting_configs={"max_epochs": -1}, # until convergence
    device="cpu", # optional GPU support
    approx_rank=None # optional low-rank kernel approximation
)
model_glmm.setup_data(
    adata=adata_svp,
    design_mtx=design_mtx,
    covariate_names=covariate_names,
    spatial_key="spatial",
    layer="counts",
    group_iso_by="gene_symbol",
    gene_names="gene_symbol",
    min_counts=10,
    min_bin_pct=0.01,
    group_gene_by_n_iso=False, # optional batching
)

# Model fitting
model_glmm.fit(
    n_jobs=1, # optional joblib parallelization
    batch_size=1, # optional batching
    with_design_mtx=False,
    quiet=True, # suppress non-convergence warnings
)

# DU test with the score statistic
model_glmm.test_differential_usage(method="score")
du_glmm = model_glmm.get_formatted_test_results(test_type="du").sort_values("pvalue")
du_glmm.rename(columns={"pvalue": "pvalue_glmm"}, inplace=True)

Fitting with single core for 51 genes (batch_size=1).

Fitting: 100%|██████████| 51/51 [02:40<00:00,  3.15s/it]

Fitting finished. Time elapsed: 160.84 seconds.

DU [score]: 100%|██████████| 51/51 [00:00<00:00, 71.70it/s]

CPU times: user 2min 43s, sys: 19 s, total: 3min 2s
Wall time: 2min 41s

Let’s check model fitting summary and see how long it takes for the per-gene GLMMs to converge.

model_glmm.get_training_summary().head(5)

	model_type	converged	best_loss	best_epoch	fitting_time_s
gene
Ift20	glmm-full	True	597.396851	6820	4.770505
Zmat2	glmm-full	True	-96.564835	5447	3.082019
Nt5c3	glmm-full	True	-221.691910	4606	2.612573
Btbd6	glmm-full	True	-4024.287354	2722	1.977117
Comt	glmm-full	True	-342.045227	4685	2.669552

We can visualize the inferred posterior isoform ratios from the GLMM model. Due to data sparsity, the model almost always underestimates the random effect variance (\(\sigma^2\)), and the resulting ratios look over-smoothed. Setting var_fix_sigma=True (our default) helps mitigate this issue.

# Extract anndata with fitted ratios for visualization
adata_fit = model_glmm.get_fitted_ratios_anndata(layer_name="ratios_glmm")

# Observed vs fitted ratios for Myl6
_iso_to_plot = adata_fit.var.index[adata_fit.var['gene_symbol'] == _gene]
_name_to_plot = adata_fit.var.loc[adata_fit.var['gene_symbol'] == _gene, 'transcript_name']
_titles_obs = [f"{name} (observed ratio)" for name in _name_to_plot]
_titles_glmm = [f"{name} (fitted ratio)" for name in _name_to_plot]
with plt.rc_context({"figure.figsize": (3, 3)}):
    sc.pl.spatial(
        adata_fit, color=_iso_to_plot, title=_titles_obs,
        img_key=None, layer='ratios_obs', ncols = len(_iso_to_plot)
    )
    sc.pl.spatial(
        adata_fit, color=_iso_to_plot, title=_titles_glmm,
        img_key=None, layer='ratios_glmm', ncols = len(_iso_to_plot)
    )

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/3009630499.py:10: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/a812301d0902f5fc1ca1e2bc41fe3d4359e8c53900a638ae006fbb033c4cb775.png

/var/folders/_f/m4v2g8c54gdfks59bp2f2cm80000gn/T/ipykernel_16814/3009630499.py:14: FutureWarning: Use `squidpy.pl.spatial_scatter` instead.
  sc.pl.spatial(

../_images/9b8eff79d330e23e939b24b178436470f22576a26f0af1b316e1fdb863f137c4.png

Compare GLM/GLMM results with HSIC-based results#

def _neg_log10(p, floor=1e-300):
    return -np.log10(np.clip(p, floor, 1.0))

merge_cols = ["gene", "covariate"]

du_all = (
    du_rbp_sklearn[merge_cols + ["pvalue_np"]]
    .merge(du_hsic_res[merge_cols + ["pvalue_hsic"]], on=merge_cols, how="inner")
    .merge(du_glm[merge_cols + ["pvalue_glm"]], on=merge_cols, how="inner")
    .merge(du_glmm[merge_cols + ["pvalue_glmm"]], on=merge_cols, how="inner")
)

# Spearman correlations between DU methods
rho_hsic, _ = spearmanr(du_all["pvalue_hsic"],    du_all["pvalue_np"])
rho_glmm, _  = spearmanr(du_all["pvalue_glm"],    du_all["pvalue_glmm"])
rho_np_glm, _  = spearmanr(du_all["pvalue_hsic"],    du_all["pvalue_glm"])
rho_np_glmm, _  = spearmanr(du_all["pvalue_np"],    du_all["pvalue_glmm"])

print(f"Spearman ρ(HSIC vs HSIC-GP) = {rho_hsic:.3f}")
print(f"Spearman ρ(GLM vs GLMM) = {rho_glmm:.3f}")
print(f"Spearman ρ(HSIC vs GLM) = {rho_np_glm:.3f}")
print(f"Spearman ρ(HSIC-GP vs GLMM) = {rho_np_glmm:.3f}")

Spearman ρ(HSIC vs HSIC-GP) = 0.868
Spearman ρ(GLM vs GLMM) = 0.921
Spearman ρ(HSIC vs GLM) = 0.709
Spearman ρ(HSIC-GP vs GLMM) = 0.626

pairs_du = [
    ("pvalue_hsic",   "pvalue_np",
     "SplisosmNP (HSIC)", "SplisosmNP (HSIC-GP)",  rho_hsic),
    ("pvalue_glm",   "pvalue_glmm",
     "SplisosmGLMM (GLM)", "SplisosmGLMM (GLMM)",  rho_glmm),
    ("pvalue_hsic",    "pvalue_glm",
     "SplisosmNP (HSIC)", "SplisosmGLMM (GLM)",  rho_np_glm),
    ("pvalue_np",    "pvalue_glmm",
     "SplisosmNP (HSIC-GP)", "SplisosmGLMM (GLMM)",  rho_np_glmm),
]

fig, axes = plt.subplots(1, 4, figsize=(18, 5))
for ax, (cx, cy, lx, ly, rho) in zip(axes, pairs_du):
    x = _neg_log10(du_all[cx].astype(float).values, floor=1e-100)
    y = _neg_log10(du_all[cy].astype(float).values, floor=1e-100)
    ax.scatter(x, y, s=6, alpha=0.4, rasterized=True)
    lim = max(x.max(), y.max()) * 1.05
    ax.plot([0, lim], [0, lim], "k--", label="y = x")
    ax.axhline(_neg_log10(0.01), color="red", linestyle="--", label="pvalue = 0.01")
    ax.axvline(_neg_log10(0.01), color="red", linestyle="--")
    ax.set_xlabel(f"-log10(p)  {lx}", fontsize=10)
    ax.set_ylabel(f"-log10(p)  {ly}", fontsize=10)
    ax.set_title(f"Spearman ρ = {rho:.2f}", fontsize=14)
    ax.legend(fontsize=8)

fig.suptitle("SplisosmNP vs SplisosmGLMM: DU method comparison (per gene-RBP pair)", fontsize=14)
fig.tight_layout()
plt.show()

../_images/ce9b05762b920ab62a6b060c6bf96c1975ece380075798bd81ab0ff54d5f469b.png

For reproducibility#

import sys
from datetime import date
import splisosm

print("Last updated:", date.today())
print("Python:", sys.version.split()[0])
print("splisosm:", getattr(splisosm, "__version__", "unknown"))

Last updated: 2026-04-27
Python: 3.12.13
splisosm: 1.2.0rc2.dev4+g83955c42e.d20260427