Core Classes#

Use these top-level classes for complete SPLISOSM workflows. The recommended imports are from splisosm import SplisosmNP, SplisosmFFT, SplisosmGLMM.

splisosm.SplisosmNP

Non-parametric SPLISOSM model for arbitrary spatial geometries.

splisosm.SplisosmFFT

FFT-accelerated SPLISOSM model for rasterized spatial isoform testing.

splisosm.SplisosmGLMM

Parametric spatial isoform statistical modeling using GLMM.

class splisosm.SplisosmNP(k_neighbors=4, rho=0.99, standardize_cov=True)#

Bases: _ResultsMixin, _FeatureSummaryMixin

Non-parametric SPLISOSM model for arbitrary spatial geometries.

SplisosmNP works directly on spot- or cell-level coordinates and a sparse CAR spatial kernel. Use setup_data() to load an AnnData object, then run spatial-variability (SV) or differential-usage (DU) tests.

Examples

Spatial variability test:

>>> from splisosm import SplisosmNP
>>> # adata : AnnData of shape (n_spots, n_isoforms)
>>> #   adata.layers["counts"]    — raw isoform counts
>>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
>>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
>>> model = SplisosmNP()
>>> model.setup_data(adata, layer="counts", group_iso_by="gene_symbol")
>>> model.test_spatial_variability(method="hsic-ir")
>>> sv_results = model.get_formatted_test_results("sv")

Differential usage test:

>>> model = SplisosmNP()
>>> model.setup_data(
...     adata, layer="counts", group_iso_by="gene_symbol",
...     design_mtx="covariate",  # obs column name, or (n_spots, n_factors) array
... )
>>> model.test_differential_usage(method="hsic-gp", residualize="cov_only")
>>> du_results = model.get_formatted_test_results("du")
Parameters:
  • k_neighbors (int)

  • rho (float)

  • standardize_cov (bool)

property filtered_adata: AnnData#

The filtered AnnData of shape (n_spots, sum(n_isos_per_gene)).

This is the data used internally after setup_data(). It is a copy of the input adata, subsetted to the retained spots and isoforms after filtering.

Raises:

RuntimeError – If setup_data() has not been called.

setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, min_component_size=1, skip_spatial_kernel=False)#

Set up isoform-level spatial data for hypothesis testing.

This method extracts isoform count tensors, optionally filters small disconnected graph components, builds the spatial kernel, and resolves the design matrix for DU tests.

Parameters:
  • adata (AnnData) – Annotated data matrix. Counts are read from adata.layers[layer] grouped by group_iso_by, and spatial coordinates from adata.obsm[spatial_key]. See splisosm.utils.preprocessing.prepare_inputs_from_anndata() for full preprocessing details.

  • spatial_key (str, optional) – Key in adata.obsm for spatial coordinates (default "spatial"). Optional when adj_key is provided: if the key is missing from adata.obsm the spatial kernel is built from the adjacency alone and coordinate-free SV tests ("hsic-ir" / "hsic-ic" / "hsic-gc") and DU tests still run. method="spark-x" (SV) and method="hsic-gp" (DU) require raw coordinates and raise a clear error at call time when they are absent.

  • adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and is used directly to build the spatial kernel. It also makes spatial_key optional (see above). The adjacency matrix is symmetrized internally.

  • layer (str, optional) – Layer in adata.layers that stores isoform counts (default "counts").

  • group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene (default "gene_symbol").

  • gene_names (str or None, optional) – Column name in adata.var used as display names for genes. If None, the values of group_iso_by are used.

  • design_mtx (tensor, array, sparse matrix, DataFrame, str, or list of str, optional) –

    Design matrix for differential-usage tests. Accepts an array/tensor/DataFrame of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names. Categorical obs columns are one-hot encoded automatically.

    When a scipy sparse matrix is passed directly, it is stored as scipy CSR internally and all differential-usage methods handle it without densifying: "hsic" uses a sparse matrix-multiply path in linear_hsic_test(); "t-fisher" and "t-tippett" extract group indices directly from the sparse non-zero structure. "hsic-gp" densifies each column via _get_design_col() before GPR fitting (GPR residuals are always dense).

    All other input types (obs column names, array, tensor, DataFrame) are converted to a dense torch float32 tensor.

  • covariate_names (list of str or None, optional) – Explicit covariate names. When design_mtx is given as column name(s) and this is None, the column names are used automatically; otherwise auto-generated as factor_1, etc.

  • min_counts (int, optional) – Minimum total isoform count across spots required to retain an isoform (default 10).

  • min_bin_pct (float, optional) – Minimum spot prevalence required to retain an isoform. Values in [0, 1] are fractions; values in (1, 100] are percentages. Default 0.0.

  • filter_single_iso_genes (bool, optional) – Whether to remove genes with fewer than two retained isoforms (default True).

  • min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A UserWarning is issued when spots are removed.

  • skip_spatial_kernel (bool, optional) – If True, skip construction of the CAR spatial kernel and store an IdentityKernel placeholder as self.sp_kernel instead. Use this when only test_differential_usage() is needed; method="hsic-gp" fits its own GP for spatial residualization. Calling test_spatial_variability() on a model set up with skip_spatial_kernel=True will raise a RuntimeError. Default False.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', null_method='liu', null_configs=None, chunk_size='auto', n_jobs=-1, return_results=False, print_progress=True)#

Test each gene for spatial variability.

The HSIC-based methods test association between a centered spatial kernel and one gene-level response at a time:

  • "hsic-ir": isoform usage ratios.

  • "hsic-ic": isoform counts.

  • "hsic-gc": gene-level total counts.

  • "spark-x": SPARK-X gene-count test [ZSZ21].

Parameters:
  • method ({"hsic-ir", "hsic-ic", "hsic-gc", "spark-x"}, optional) – Test target: "hsic-ir" (isoform usage ratios), "hsic-ic" (isoform counts), "hsic-gc" (gene-level counts), or "spark-x" (SPARK-X [ZSZ21]).

  • ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation applied to isoform ratios when method="hsic-ir". See splisosm.utils.preprocessing.counts_to_ratios() and [PYPA22] for details.

  • nan_filling ({"mean", "none"}, optional) – Strategy for NaN values in isoform ratios. See splisosm.utils.preprocessing.counts_to_ratios() for details.

  • null_method ({"liu", "welch", "perm"}, optional) –

    Method for computing the null distribution of the test statistic:

    • "liu" (default): asymptotic chi-square mixture using cumulants with Liu’s method [LTZ09]. Exact eigenvalue cumulants are used when cheap; large implicit kernels use Hutchinson Rademacher trace estimates controlled by null_configs["n_probes"].

    • "welch": Welch-Satterthwaite moment matching. It uses the first two null cumulants and approximates the null by a scaled chi-square variable g * chi2(h), where g = Var / (2 * E) and h = 2 * E**2 / Var.

    • "perm": permutation null distribution. Supports optional null_configs["n_perms_per_gene"] (default 1000), and null_configs["perm_batch_size"] (default 50) for batched null-statistic computation.

    "eig" is accepted as a deprecated alias for "liu". "clt" and "trace" are accepted as deprecated aliases for "welch".

  • null_configs (dict or None, optional) –

    Extra keyword arguments for the chosen null_method. For null_method="liu", SPLISOSM evaluates Liu’s approximation from cumulants instead of materializing all pairwise eigenvalue products. Supported keys include:

    • "n_probes": int. Estimate spatial kernel cumulants with this many Hutchinson Rademacher probes when exact eigenvalue or trace cumulants are unavailable. The same budget is used by "welch" when the spatial kernel does not expose exact tr(K) and tr(K**2) (for example implicit CAR kernels). Large implicit kernels use 60 probes by default.

    • "approx_rank": int or None. Advanced diagnostic override to use the top-k spatial eigenvalues and a rank-consistent statistic. This is normally unnecessary for SV tests because the default Liu path uses direct cumulant estimates.

  • chunk_size (int or {"auto"}, optional) – Maximum number of response columns to process per worker task. "auto" (default) estimates a memory-safe column cap using a 2 GiB live-memory budget per worker and then caps the result at 32 columns for per-feature runtime. Genes are never split across chunks; a single gene with more response columns than the cap is processed as a singleton chunk. For method="hsic-gc", each gene contributes one response column.

  • n_jobs (int, optional) – Number of parallel workers for the per-gene loop. -1 uses all available CPUs. With chunk_size="auto", each worker targets roughly a 2 GiB live-memory budget. Default -1.

  • return_results (bool, optional) – If True, return the result dict. Otherwise store results in self._sv_test_results and return None.

  • print_progress (bool, optional) – Whether to show a progress bar.

Returns:

If return_results is True, return a result dictionary with test statistics and p-values. Otherwise, store results in self._sv_test_results and return None.

Return type:

dict or None

Notes

method="spark-x" requires the R package SPARK and Python package rpy2.

test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', n_jobs=-1, print_progress=True, return_results=False)#

Test each gene for differential isoform usage.

Call setup_data() with design_mtx before running this method. Each design-matrix column is tested against each gene’s isoform usage ratios, producing one statistic and p-value per (gene, covariate) pair.

Two types of association tests are supported:

  • Unconditional ("hsic", "t-fisher", "t-tippett"): test the association between isoform usage ratios and the covariate.

  • Conditional ("hsic-gp"): test the association conditioned on spatial coordinates by residualizing with spatial GP regression. See [ZPJScholkopf12].

Parameters:
  • method ({"hsic", "hsic-gp", "t-fisher", "t-tippett"}, optional) –

    Method for association testing:

    • "hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test.

    • "hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.

    • "t-fisher", "t-tippett": each isoform is tested independently for binary covariates only; p-values are combined gene-wise via Fisher’s or Tippett’s method.

  • ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation for isoform ratios. One of 'none', 'clr', 'ilr', 'alr', 'radial' [PYPA22]. See splisosm.utils.preprocessing.counts_to_ratios().

  • nan_filling ({"mean", "none"}, optional) – How to fill NaN values in isoform ratios. One of 'mean' or 'none'. See splisosm.utils.preprocessing.counts_to_ratios().

  • gpr_backend ({"sklearn", "gpytorch", "nufft", "finufft"}, optional) – GPR backend to use for method='hsic-gp'. One of 'sklearn' (default), 'gpytorch', 'nufft', or 'finufft'. The NUFFT aliases use FINUFFT for irregular 2-D coordinates with an implicit periodic RBF kernel. For FFT-accelerated spatial GP on regular grids use SplisosmFFT instead.

  • gpr_configs (dict, optional) –

    Nested configuration dict for the GPR objects, with optional keys "covariate" and/or "isoform". Each sub-dict configures the selected gpr_backend. Unspecified keys use SPLISOSM defaults. The shared defaults include constant_value=1.0, constant_value_bounds=(1e-3, 1e3), length_scale=1.0, and length_scale_bounds="fixed". Backend-irrelevant known keys are ignored.

    "n_inducing" (int or None) controls the scale of spatial GP fitting for the dense/sparse GP backends:

    • sklearn — maximum number of observations used for hyperparameter fitting. Full exact GP when n_spots <= n_inducing (or None); a randomly sub-sampled subset-of-data of n_inducing points otherwise (not the same inducing-point approximation as gpytorch). Default: 5000. Set to None to use all observations (warns when n_spots > 10_000).

    • gpytorch — FITC sparse-GP inducing-point approximation with n_inducing points; set to None for exact GP. Default: 5000.

    nufft / finufft has additional NUFFT-specific options for the RBF-GP fitting. See splisosm.gpr.NUFFTKernelGPR for the full backend signature. Common options include:

    • max_auto_modes - cap the size of the automatically inferred grid;

    • lml_approx_rank - approximate the irregular-coordinate GP likelihoods using only the leading eigenvalues and eigensummaries. Ignored when the input grid is already regular (full spectrum available via FFT). It costs O(n_spots * lml_approx_rank) memory and uses a trace/trace(K^2)-corrected tail for the omitted spectrum. In practice, ranks around 32-64 often beat same-time sklearn subset fits by a wide margin; increase the rank when memory permits and hyperparameter accuracy is important.

  • residualize ({"cov_only", "both"}, optional) –

    Controls which signals are spatially residualized when method="hsic-gp":

    • "cov_only" (default): residualize covariates only; test HSIC(Z_res, Y). Fastest; calibration matches "both" when covariate GPR captures most spatial confounding.

    • "both": residualize both covariates and isoform ratios.

  • n_jobs (int, optional) – Number of parallel workers for the per-gene loop. -1 uses all available CPUs. Each worker densifies one sparse count tensor (about 4-40 MB for 100K-1M spots by 10 isoforms). When gpr_backend="gpytorch" and device != "cpu", the GPU is not thread-safe; parallelism is automatically disabled. Default -1.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self._du_test_results.

Returns:

results – If return_results is True, return a dictionary with test statistics and p-values. Otherwise, return None and store results in self._du_test_results.

Return type:

dict or None

get_formatted_test_results(test_type, with_gene_summary=False)#

Get formatted test results as a pandas.DataFrame.

Parameters:
  • test_type ({"sv", "du"}) – Which results to retrieve: "sv" for spatial variability or "du" for differential usage.

  • with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary().

Returns:

Result table. SV results contain gene, statistic, pvalue, and pvalue_adj. DU results additionally contain covariate and include one row per gene-covariate pair.

Return type:

DataFrame

Raises:

ValueError – If test_type is invalid or the requested test has not been run.

extract_feature_summary(level='gene', print_progress=True)#

Compute filtered feature-level summary statistics.

Gene-level statistics are aggregated across all isoforms retained by setup_data(). Isoform-level statistics are computed per isoform and joined to the corresponding rows of adata.var.

Results are cached: repeated calls with the same level return the cached pandas.DataFrame without recomputation.

Parameters:
  • level ({"gene", "isoform"}, optional) – Summary granularity. "gene" returns one row per gene; "isoform" returns one row per retained isoform.

  • print_progress (bool, optional) – Whether to show a progress bar while computing summaries.

Returns:

Feature summary table. Gene-level summaries include n_isos, perplexity, pct_bin_on, count_avg, and count_std. Isoform-level summaries include the retained adata.var columns plus count and usage-ratio summaries.

Return type:

DataFrame

Raises:
  • RuntimeError – If setup_data() has not been called.

  • ValueError – If level is not "gene" or "isoform".

class splisosm.SplisosmFFT(rho=0.99, neighbor_degree=1, spacing=(1.0, 1.0), workers=None)#

Bases: _ResultsMixin, _FeatureSummaryMixin

FFT-accelerated SPLISOSM model for rasterized spatial isoform testing.

SplisosmFFT follows the same non-parametric workflow as splisosm.SplisosmNP, but it operates on regular raster grids stored in SpatialData and applies spatial kernels with FFTs.

Examples

Spatial variability test:

>>> from splisosm import SplisosmFFT
>>> model = SplisosmFFT(rho=0.9, neighbor_degree=1)
>>> model.setup_data(
...     sdata=sdata,
...     bins="ID_square_016um",
...     table_name="square_016um",
...     col_key="array_col",
...     row_key="array_row",
...     layer="counts",
...     group_iso_by="gene_ids",
...     gene_names="gene_name",
...     min_counts=10,
...     min_bin_pct=0.0,
... )
>>> model.test_spatial_variability(method="hsic-ir")
>>> sv_results = model.get_formatted_test_results(test_type="sv")

Differential usage test:

>>> model = SplisosmFFT(rho=0.9, neighbor_degree=1)
>>> model.setup_data(
...     sdata=sdata,
...     bins="ID_square_016um",
...     table_name="square_016um_svp",
...     design_mtx="square_016um_rbp_sve",
...     col_key="array_col",
...     row_key="array_row",
...     layer="counts",
...     group_iso_by="gene_ids",
...     gene_names="gene_name",
...     min_counts=10,
...     min_bin_pct=0.0,
... )
>>> model.test_differential_usage(method="hsic-gp", residualize="cov_only")
>>> du_results = model.get_formatted_test_results("du")
Parameters:
  • rho (float)

  • neighbor_degree (int)

  • spacing (tuple[float, float])

  • workers (int | None)

setup_data(sdata, bins, table_name, col_key, row_key, layer='counts', group_iso_by='gene_symbol', gene_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, design_mtx=None, covariate_names=None)#

Set up SpatialData-backed isoform data for FFT-based testing.

bins, table_name, col_key, and row_key are passed to spatialdata.rasterize_bins() to rasterize isoform counts.

Parameters:
  • sdata (SpatialData) – SpatialData-like object with tables mapping.

  • bins (str) – Name of the SpatialData bin geometry for rasterization.

  • table_name (str) – Key of the table in sdata.tables.

  • col_key (str) – Column-coordinate key in adata.obs for rasterization.

  • row_key (str) – Row-coordinate key in adata.obs for rasterization.

  • layer (str, optional) – AnnData layer that stores isoform count matrix.

  • group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene. The unique values of this column define the gene-level groups.

  • gene_names (str or None, optional) – Optional column name in adata.var whose values are used as display gene names in results. If None, the values of group_iso_by are used directly.

  • min_counts (int, optional) – Minimum total count (summed across all spots) required for an isoform to be retained. Isoforms below this threshold are excluded before gene grouping. Genes with fewer than two remaining isoforms after filtering are also excluded.

  • min_bin_pct (float, optional) – Minimum percentage of bins in which an isoform must be expressed (count greater than zero) to be retained. Values in [0, 1] are interpreted as fractions of bins, and values in (1, 100] are interpreted as percentages.

  • filter_single_iso_genes (bool, optional) – If True (default), genes with fewer than two isoforms passing QC filters are removed — they cannot contribute to within-gene ratio tests. Set to False to keep single-isoform genes, e.g. when testing gene-level expression variability with test_spatial_variability(method="hsic-gc").

  • design_mtx (str, list[str], array, sparse matrix, DataFrame, or None) –

    Design matrix specification. Three input modes:

    1. Table name (str matching a key in sdata.tables): Use the existing AnnData table’s X as the design matrix. Must have the same number of observations as the isoform table.

    2. Obs column names (str or list[str] not matching a table): Extract the named columns from the isoform table’s adata.obs. Categorical columns are one-hot encoded automatically.

    3. Pre-computed matrix (array, sparse matrix, or DataFrame of shape (n_spots, n_factors)): Used as-is.

    In cases 2 and 3, the design matrix is stored as a new AnnData table inside sdata and rasterized when test_differential_usage() is called.

  • covariate_names (list[str] or None, optional) – Factor names. These override inferred names. If None, names are inferred from design_mtx when possible; otherwise they are generated as ["factor_0", ...].

Raises:

ValueError – If required table/layer/metadata is missing.

Return type:

None

See also

splisosm.SplisosmNP.setup_data()

AnnData-based setup for data with general geometry.

test_spatial_variability(method='hsic-ir', ratio_transformation='none', chunk_size='auto', n_jobs=-1, return_results=False, print_progress=True)#

Test each gene for spatial variability using FFT-accelerated HSIC.

Parameters:
  • method ({"hsic-ir", "hsic-ic", "hsic-gc"}, optional) – One of "hsic-ir" (isoform ratios), "hsic-ic" (isoform counts), or "hsic-gc" (gene counts).

  • ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Ratio transform used when method="hsic-ir".

  • chunk_size (int or {"auto"}, optional) – Maximum number of response channels to process in one FFT kernel call. "auto" (default) estimates a memory-safe cap using a 2 GiB live-memory budget per worker and caps the result at 32 channels for per-feature runtime. Genes are never split across chunks; a single gene with more channels than the cap is processed as a singleton chunk.

  • n_jobs (int, optional) – Number of joblib workers. -1 uses all available CPUs.

  • return_results (bool, optional) – If True, return the result dictionary.

  • print_progress (bool, optional) – Whether to show a progress bar.

Returns:

Result dictionary when return_results=True; otherwise None.

Return type:

dict or None

See also

splisosm.SplisosmNP.test_spatial_variability()

Non-FFT implementation for arbitrary spatial geometries.

test_differential_usage(method='hsic-gp', ratio_transformation='none', gpr_configs=None, residualize='cov_only', n_jobs=-1, return_results=False, print_progress=True)#

Test each gene for differential isoform usage on a raster grid.

Call setup_data() with design_mtx before running this method. Each design-matrix column is tested against each gene’s isoform usage ratios, producing one statistic and p-value per (gene, covariate) pair.

Four test strategies are supported, all operating on rasterized grid data to avoid densifying the full isoform or covariate matrix in memory:

  • "hsic-gp" (default): spatially residualize covariates (and optionally isoform ratios) with splisosm.gpr.FFTKernelGPR, then compute linear HSIC. Controlled by residualize.

  • "hsic": linear HSIC between raw centered isoform ratios and raw centered covariates, with no spatial residualization.

  • "t-fisher": per-isoform two-sample t-tests (binary covariates only) combined by Fisher’s method (chi-square with df = 2 * n_isoforms).

  • "t-tippett": per-isoform two-sample t-tests (binary covariates only) combined by Tippett’s corrected minimum p-value.

Regardless of method, covariates are processed in chunks of at most 100 at a time and isoform data is loaded on-the-fly per gene so that neither the full covariate grid nor the full isoform matrix is held in memory simultaneously.

Parameters:
  • method ({"hsic", "hsic-gp", "t-fisher", "t-tippett"}, optional) –

    Method for association testing:

    • "hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test.

    • "hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.

    • "t-fisher", "t-tippett": two-sample t-test per isoform (binary covariates only; exactly two distinct non-NaN values required); p-values are combined gene-wise via Fisher’s chi-squared or Tippett’s corrected minimum method.

  • ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation for isoform ratios. One of 'none', 'clr', 'ilr', 'alr', 'radial' [PYPA22]. See splisosm.utils.preprocessing.counts_to_ratios().

  • gpr_configs (dict, optional) – Nested configuration dict for the GPR objects, with optional keys 'covariate' and/or 'isoform'. Each sub-dict is forwarded to the FFT GPR path. Unspecified keys use the common GP defaults from SPLISOSM’s backend configuration: constant_value, constant_value_bounds, length_scale, and length_scale_bounds. Large-n sklearn/gpytorch and NUFFT-only keys in the shared defaults are ignored by this FFT path. See splisosm.gpr.FFTKernelGPR for backend-specific options.

  • residualize ({"cov_only", "both"}, optional) –

    Controls which signals are spatially residualized when method="hsic-gp":

    • "cov_only" (default): residualize covariates only; test HSIC(Z_res, Y). Fastest; calibration matches "both" when covariate GPR captures most spatial confounding.

    • "both": residualize both covariates and isoform ratios.

  • n_jobs (int, optional) – Number of parallel jobs. -1 uses all available CPUs.

  • print_progress (bool, optional) – Whether to show the progress bar. Default True.

  • return_results (bool, optional) – If True, return the test statistics and p-values. Otherwise, store results in self._du_test_results.

Returns:

results – If return_results is True, return a dictionary with test statistics and p-values. Otherwise, return None and store results in self._du_test_results.

Return type:

dict or None

Raises:
  • RuntimeError – If setup_data() or the design_mtx argument has not been set.

  • ValueError – If method, residualize, or ratio_transformation is invalid.

See also

splisosm.SplisosmNP.test_differential_usage()

Non-FFT implementation for arbitrary spatial geometries.

get_formatted_test_results(test_type, with_gene_summary=False)#

Get formatted test results as a pandas.DataFrame.

Parameters:
  • test_type ({"sv", "du"}) – Which results to retrieve: "sv" for spatial variability or "du" for differential usage.

  • with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary().

Returns:

Result table. SV results contain gene, statistic, pvalue, and pvalue_adj. DU results additionally contain covariate and include one row per gene-covariate pair.

Return type:

DataFrame

Raises:

ValueError – If test_type is invalid or the requested test has not been run.

extract_feature_summary(level='gene', print_progress=True)#

Compute filtered feature-level summary statistics.

Gene-level statistics are aggregated across all isoforms retained by setup_data(). Isoform-level statistics are computed per isoform and joined to the corresponding rows of adata.var.

Results are cached: repeated calls with the same level return the cached pandas.DataFrame without recomputation.

Parameters:
  • level ({"gene", "isoform"}, optional) – Summary granularity. "gene" returns one row per gene; "isoform" returns one row per retained isoform.

  • print_progress (bool, optional) – Whether to show a progress bar while computing summaries.

Returns:

Feature summary table. Gene-level summaries include n_isos, perplexity, pct_bin_on, count_avg, and count_std. Isoform-level summaries include the retained adata.var columns plus count and usage-ratio summaries.

Return type:

DataFrame

Raises:
  • RuntimeError – If setup_data() has not been called.

  • ValueError – If level is not "gene" or "isoform".

class splisosm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_fix_sigma=True, var_prior_model='none', var_prior_model_params=None, init_ratio='uniform', fitting_method='joint_gd', fitting_configs=None, k_neighbors=4, rho=0.99, approx_rank=<object object>, device='cpu')#

Bases: _ResultsMixin, _FeatureSummaryMixin

Parametric spatial isoform statistical modeling using GLMM.

This is a convenience class that wraps around the splisosm.glmm.MultinomGLMM for batched model fitting and spatial variability and differential usage testing.

Examples

Prepare an AnnData object and fit a GLMM:

>>> from splisosm import SplisosmGLMM
>>> # adata : AnnData of shape (n_spots, n_isoforms)
>>> #   adata.layers["counts"]    — raw isoform counts
>>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
>>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
>>> #   adata.obs["covariate"]    — optional covariate column for DU testing
>>> model = SplisosmGLMM(
...     model_type="glmm-full",    # 'glmm-full' | 'glmm-null' | 'glm'
...     fitting_method="joint_gd", # 'joint_gd' | 'joint_newton' | 'marginal_gd' | 'marginal_newton'
...     device="cpu",              # 'cpu' | 'cuda' (NVIDIA) | 'mps' (Apple Silicon)
...     # approx_rank: omit for auto | None (force full rank) | int (fixed low-rank)
... )
>>> model.setup_data(
...     adata,
...     layer="counts",
...     group_iso_by="gene_symbol",
...     group_gene_by_n_iso=True,  # required for batch_size > 1 in fit()
...     design_mtx="covariate",    # obs column name, or (n_spots, n_factors) array
... )
>>> model.fit(
...     n_jobs=1, batch_size=20,
...     with_design_mtx=False,     # False → score test (recommended for DU)
... )
>>> fitted_models = model.get_fitted_models()

Differential usage test:

>>> model.test_differential_usage(method="score")  # or "wald" if with_design_mtx=True
>>> du_results = model.get_formatted_test_results("du")
>>> print(du_results.head())
Parameters:
  • model_type (Literal['glmm-full', 'glmm-null', 'glm'])

  • share_variance (bool)

  • var_fix_sigma (bool)

  • var_prior_model (str)

  • var_prior_model_params (dict | None)

  • init_ratio (str)

  • fitting_method (str)

  • fitting_configs (dict | None)

  • k_neighbors (int)

  • rho (float)

  • device (Literal['cpu', 'cuda', 'mps'])

property filtered_adata: AnnData#

The filtered AnnData of shape (n_spots, sum(n_isos_per_gene)).

This is the data used internally after setup_data(). It is a copy of the input adata, subsetted to the retained spots and isoforms after filtering.

Raises:

RuntimeError – If setup_data() has not been called.

setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, group_gene_by_n_iso=False, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, min_component_size=1)#

Setup the data for the GLMM model.

Parameters:
  • adata (AnnData) – Annotated data matrix. Counts are read from adata.layers[layer] grouped by group_iso_by, and spatial coordinates from adata.obsm[spatial_key].

  • spatial_key (str, optional) – Key in adata.obsm for spatial coordinates (default "spatial"). Optional when adj_key is provided: if the key is missing from adata.obsm the spatial kernel is built from the adjacency alone. SplisosmGLMM does not require raw coordinates past kernel construction.

  • adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. Also makes spatial_key optional (see above). The adjacency matrix is symmetrized internally.

  • layer (str, optional) – Layer in adata.layers storing isoform counts (default "counts").

  • group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene (default "gene_symbol").

  • gene_names (str or None, optional) – Column name in adata.var used as display names for genes. If None, the values of group_iso_by are used.

  • group_gene_by_n_iso (bool, optional) – If True, genes are sorted and grouped by their isoform count before batching. Required for batch_size > 1 in fit() because the GLMM model expects a fixed isoform count per batch (default False).

  • design_mtx (tensor, array, DataFrame, str, or list of str, optional) – Design matrix for fixed effects. Accepts an array/tensor/DataFrame of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.

  • covariate_names (list of str or None, optional) – Explicit covariate names. When not provided, names are inferred from column names or auto-generated.

  • min_counts (int, optional) – Minimum total isoform count across spots to retain an isoform (default 10).

  • min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).

  • min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A UserWarning is issued when spots are removed.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

fit(n_jobs=1, batch_size=1, quiet=True, print_progress=True, with_design_mtx=False, from_null=False, refit_null=True, random_seed=None)#

Model fitting.

Parameters:
  • n_jobs (int, optional) – The number of cores to use for parallel fitting. Default to 1.

  • batch_size (int, optional) – The maximum number of genes per job to fit in parallel. Default to 1.

  • quiet (bool, optional) – Whether to suppress the fitting logs. Default to True.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • with_design_mtx (bool, optional) – Whether to include the design matrix for the fixed effects. Default to False.

  • from_null (bool, optional) – Whether to initialize the full model from a null model with zero spatial variability (white-noise random effect).

  • refit_null (bool, optional) – Whether to refit the null model after fitting the full model. Only applicable when from_null=True.

  • random_seed (int or None, optional) – The random seed for reproducibility. Default to None.

Return type:

None

test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#

Parametric test for spatial variability.

Caution

The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in splisosm.hyptest.np.SplisosmNP for spatial variability testing.

Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.

Parameters:
  • method ({"llr"}, optional) – The test method. Currently only support "llr", the likelihood ratio test (H_0: theta = 0, i.e. no spatial variance).

  • use_perm_null (bool, optional) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results will be stored in self._sv_test_results.

  • print_progress (bool, optional) – Whether to show the progress bar for permutation. Default to True.

  • n_perms_per_gene (int or None, optional) – Number of permutations per gene when use_perm_null=True. Default to 20.

  • **kwargs (Any) – Additional arguments passed to self._fit_sv_llr_perm() if use_perm_null = True.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise returns None.

Return type:

dict or None

test_differential_usage(method='score', print_progress=True, return_results=False)#

Parametric test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Similar to splisosm.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:

  • Unconditional (when model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0: beta = 0).

  • Conditional (when model_type='glmm-full'): test for association (H_0: beta = 0) conditioned on the spatial random effect.

Parameters:
  • method ({"score", "wald"}, optional) –

    Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:

    • "wald" when models were fit with fit(..., with_design_mtx=True).

    • "score" when models were fit with fit(..., with_design_mtx=False).

    Caution

    The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use method="score" when possible.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self._du_test_results.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.

Return type:

dict or None

get_formatted_test_results(test_type, with_gene_summary=False)#

Get formatted test results as a pandas.DataFrame.

Parameters:
  • test_type ({"sv", "du"}) – Which results to retrieve: "sv" for spatial variability or "du" for differential usage.

  • with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary().

Returns:

Result table. SV results contain gene, statistic, pvalue, and pvalue_adj. DU results additionally contain covariate and include one row per gene-covariate pair.

Return type:

DataFrame

Raises:

ValueError – If test_type is invalid or the requested test has not been run.

extract_feature_summary(level='gene', print_progress=True)#

Compute filtered feature-level summary statistics.

Gene-level statistics are aggregated across all isoforms retained by setup_data(). Isoform-level statistics are computed per isoform and joined to the corresponding rows of adata.var.

Results are cached: repeated calls with the same level return the cached pandas.DataFrame without recomputation.

Parameters:
  • level ({"gene", "isoform"}, optional) – Summary granularity. "gene" returns one row per gene; "isoform" returns one row per retained isoform.

  • print_progress (bool, optional) – Whether to show a progress bar while computing summaries.

Returns:

Feature summary table. Gene-level summaries include n_isos, perplexity, pct_bin_on, count_avg, and count_std. Isoform-level summaries include the retained adata.var columns plus count and usage-ratio summaries.

Return type:

DataFrame

Raises:
  • RuntimeError – If setup_data() has not been called.

  • ValueError – If level is not "gene" or "isoform".

get_fitted_models()#

Get the fitted models after running fit().

Each call reconstructs the full model objects from the stored lightweight state. The returned list is a fresh reconstruction; mutations to the returned models are not persisted.

Returns:

models

  • model_type='glmm-full': list[splisosm.hyptest.glmm.IsoFullModel]

  • model_type='glmm-null': list[splisosm.hyptest.glmm.IsoNullNoSpVar]

  • model_type='glm': list[splisosm.glmm.MultinomGLM]

Return type:

list of fitted models

get_gene_model(gene_name)#

Return the fitted model for a single gene by display name.

Each call reconstructs a fresh model from stored state.

Parameters:

gene_name (str) – Display name of the gene (must appear in gene_names).

Returns:

  • Fitted MultinomGLM, IsoFullModel, or IsoNullNoSpVar instance

  • corresponding to gene_name.

Raises:
  • RuntimeError – If fit() has not been called yet.

  • KeyError – If gene_name is not found in gene_names.

Return type:

Any

get_training_summary()#

Return a per-gene training summary as a pandas.DataFrame.

The DataFrame is indexed by gene name and contains the following columns:

  • 'model_type': str. Model type ('glmm-full', 'glmm-null', or 'glm').

  • 'converged': bool. Whether the gene converged during training.

  • 'best_loss': float. Best (lowest) negative log-likelihood achieved.

  • 'best_epoch': int. Epoch at which the best loss was recorded.

  • 'fitting_time_s': float. Wall-clock seconds spent fitting this gene.

Only available after calling fit().

Raises:

RuntimeError – If fit() has not been called yet.

Return type:

DataFrame

get_fitted_ratios_anndata(layer_name='fitted_ratios')#

Return a copy of the filtered AnnData with fitted isoform ratios as a new layer.

For each gene the per-spot softmax isoform ratios are taken from the fitted model (model.get_isoform_ratio()) and placed into the corresponding isoform columns of a (n_spots, n_filtered_isoforms) dense matrix, which is stored under layer_name.

Isoform ordering guarantee: the columns of the new layer follow the exact row order of self._filtered_adata.var. This holds because setup_data() builds per-gene count tensors by slicing filtered_adata isoforms in their var row order (via groupby(sort=False)), and get_isoform_ratio() returns ratios in the same column order as the input counts.

Parameters:

layer_name (str) – Name of the new layer. Defaults to 'fitted_ratios'.

Returns:

A copy of self._filtered_adata (shape (n_spots, n_filtered_isoforms)) with layer_name added. The var DataFrame is identical to self._filtered_adata.var, so isoform metadata (e.g. the gene-grouping column) is intact.

Return type:

AnnData

Raises:

RuntimeError – If fit() has not been called, or if setup_data() was not called with an AnnData object (i.e. raw tensors were provided instead).

save(path)#

Save the fitted model to a file.

Only lightweight per-gene _FittedGeneState objects (fitted parameters + convergence metadata) are stored – no full nn.Module objects. The raw .adata is not saved; only _filtered_adata is persisted. After loading, .adata will be None.

Parameters:

path (str) – The path to save the fitted models.

Return type:

None

static load(path, map_location=None)#

Load a SplisosmGLMM model previously saved with save().

Parameters:
  • path (str) – Path to the file written by save().

  • map_location (str | None) – Optional device string (e.g. 'cpu', 'cuda:0') to remap tensor storage when loading a model that was saved on a different device. Defaults to None (preserves original device mapping).

Returns:

The fully restored model, including fitted parameters, test results, and all metadata.

Return type:

SplisosmGLMM

Examples

>>> model.save("model.pt")
>>> loaded = SplisosmGLMM.load("model.pt")
>>> loaded.get_training_summary()