splisosm.hyptest_glmm#

GLMM-based hypothesis tests for spatial isoform usage.

Classes#

SplisosmGLMM

Parametric spatial isoform statistical modeling using GLMM.

Module Contents#

class splisosm.hyptest_glmm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_fix_sigma=True, var_prior_model='none', var_prior_model_params=None, init_ratio='uniform', fitting_method='joint_gd', fitting_configs=None, k_neighbors=4, rho=0.99, approx_rank=_APPROX_RANK_AUTO, device='cpu')#

Parametric spatial isoform statistical modeling using GLMM.

This is a convenience class that wraps around the splisosm.model.MultinomGLMM for batched model fitting and spatial variability and differential usage testing.

Examples

Prepare an AnnData object and fit a GLMM:

>>> from splisosm import SplisosmGLMM
>>> # adata : AnnData of shape (n_spots, n_isoforms)
>>> #   adata.layers["counts"]    — raw isoform counts
>>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
>>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
>>> #   adata.obs["covariate"]    — optional covariate column for DU testing
>>> model = SplisosmGLMM(
...     model_type="glmm-full",    # 'glmm-full' | 'glmm-null' | 'glm'
...     fitting_method="joint_gd", # 'joint_gd' | 'joint_newton' | 'marginal_gd' | 'marginal_newton'
...     device="cpu",              # 'cpu' | 'cuda' (NVIDIA) | 'mps' (Apple Silicon)
...     # approx_rank: omit for auto | None (force full rank) | int (fixed low-rank)
... )
>>> model.setup_data(
...     adata,
...     layer="counts",
...     group_iso_by="gene_symbol",
...     group_gene_by_n_iso=True,  # required for batch_size > 1 in fit()
...     design_mtx="covariate",    # obs column name, or (n_spots, n_factors) array
... )
>>> model.fit(
...     n_jobs=1, batch_size=20,
...     with_design_mtx=False,     # False → score test (recommended for DU)
... )
>>> fitted_models = model.get_fitted_models()

Differential usage test:

>>> model.test_differential_usage(method="score")  # or "wald" if with_design_mtx=True
>>> du_results = model.get_formatted_test_results("du")
>>> print(du_results.head())
Parameters:
  • model_type (Literal['glmm-full', 'glmm-null', 'glm']) – Which model to fit. Can be one of 'glmm-full' (Multinomial GLMM with spatial random effects), 'glmm-null' (Multinomial GLMM with white noise), 'glm' (Multinomial GLM).

  • share_variance (bool) – Whether to share the variance component across isoforms.

  • var_fix_sigma (bool) – Whether to fix the total variance (sigma) to the Fano-factor initial estimate. When True (default), only theta_logit is learned, producing conservative but well-calibrated hypothesis tests. Set to False to learn sigma jointly with other parameters; this may yield higher power for the SV test but can inflate false positive rates for both SV and DU tests.

  • var_prior_model (str) – The prior model on the total variance sigma. Default is 'none' with no prior. Other options are 'gamma' (Gamma prior) and 'inv_gamma' (Inverse Gamma prior).

  • var_prior_model_params (dict | None) – The parameters for the prior model on the total variance sigma. For 'gamma', the default parameters are {'alpha': 2.0, 'beta': 0.3}. For 'inv_gamma', the default parameters are {'alpha': 3, 'beta': 0.5}.

  • init_ratio (str) – The initialization method for the logit isoform usage ratio. Options are 'observed' (initialize using observed counts) and 'uniform' (equal isoform usage across space).

  • fitting_method (str) – The fitting method to use when model_type='glmm-full' or 'glmm-null'. Options are 'joint_gd' (joint likelihood with gradient descent), 'joint_newton' (joint likelihood with Newton’s method), 'marginal_gd' (marginal likelihood with gradient descent), and 'marginal_newton' (marginal likelihood with Newton’s method).

  • fitting_configs (dict | None) –

    A dictionary of fitting configurations with the following keys:

    • 'lr': float, learning rate

    • 'optim': str, optimization method (one of 'adam', 'sgd', or 'lbfgs')

    • 'tol': float, tolerance for convergence

    • 'max_epochs': int, maximum number of epochs

    • 'patience': int, number of epochs to wait for improvement before stopping

    • 'update_nu_every_k': int, number of iterations to update nu when using fitting_method='marginal_newton'

  • k_neighbors (int, optional) – Number of nearest neighbours used to build the spatial k-NN adjacency graph (default 4). Passed to splisosm.kernel.SpatialCovKernel. Ignored when adj_key is provided to setup_data().

  • rho (float, optional) – Spectral smoothing parameter for the ICAR-like spatial kernel (default 0.99). Values close to 1 produce smoother spatial covariance.

  • approx_rank (int or None, optional) –

    Number of leading eigenvectors of the spatial kernel to retain.

    • Not specified (default): automatically selects the rank at setup_data() time — full rank when n_spots 5000, otherwise ceil(sqrt(n_spots) * 4).

    • None: always use full rank regardless of n_spots (a warning is emitted when n_spots > 5000 because the eigendecomposition of a large dense matrix is expensive).

    • Positive integer: use exactly that many eigenvectors.

  • device ({"cpu", "cuda", "mps"}, optional) – Device for all model computation (default "cpu"). Use "cuda" for NVIDIA GPU or "mps" for Apple Silicon. Parallel fitting (n_jobs > 1) is not supported for non-CPU devices and will automatically fall back to single-core with a warning.

extract_feature_summary(level='gene', print_progress=True)#

Compute filtered feature-level summary statistics.

Gene-level statistics are aggregated across all isoforms that passed the filters applied in setup_data(). Isoform-level statistics are computed per isoform and augmented onto the corresponding rows of adata.var.

Results are cached: repeated calls with the same level return the cached pandas.DataFrame without recomputation.

Parameters:
  • level (Literal['gene', 'isoform']) – Summary granularity. 'gene': one row per gene. 'isoform': one row per isoform that passed filtering.

  • print_progress (bool) – Whether to show a progress bar.

Returns:

For level='gene', the index is the gene display name and the columns are:

  • 'n_isos': int. Number of isoforms retained after filtering.

  • 'perplexity': float. Effective number of isoforms based on the marginal isoform usage entropy.

  • 'pct_bin_on': float. Fraction of spots with non-zero total gene counts.

  • 'count_avg': float. Mean per-spot total count for the gene.

  • 'count_std': float. Std of per-spot total count for the gene.

For level='isoform', the index is the isoform name (matching adata.var_names) and the columns are the original adata.var columns plus:

  • 'pct_bin_on': float. Fraction of spots with count > 0.

  • 'count_total': float. Total counts across all spots.

  • 'count_avg': float. Mean count per spot.

  • 'count_std': float. Std of count per spot.

  • 'ratio_total': float. Fraction of total gene counts attributable to this isoform.

  • 'ratio_avg': float. Mean per-spot isoform usage ratio (computed over spots with non-zero gene coverage).

  • 'ratio_std': float. Std of per-spot isoform usage ratio (computed over spots with non-zero gene coverage).

Return type:

DataFrame

Raises:
  • RuntimeError – If setup_data() has not been called.

  • ValueError – If level is not 'gene' or 'isoform'.

fit(n_jobs=1, batch_size=1, quiet=True, print_progress=True, with_design_mtx=False, from_null=False, refit_null=True, random_seed=None)#

Model fitting.

Parameters:
  • n_jobs (int, optional) – The number of cores to use for parallel fitting. Default to 1.

  • batch_size (int, optional) – The maximum number of genes per job to fit in parallel. Default to 1.

  • quiet (bool, optional) – Whether to suppress the fitting logs. Default to True.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • with_design_mtx (bool, optional) – Whether to include the design matrix for the fixed effects. Default to False.

  • from_null (bool, optional) – Whether to initialize the full model from a null model with zero spatial variability (white-noise random effect).

  • refit_null (bool, optional) – Whether to refit the null model after fitting the full model. Only applicable when from_null=True.

  • random_seed (int or None, optional) – The random seed for reproducibility. Default to None.

Return type:

None

get_fitted_models()#

Get the fitted models after running fit().

Each call reconstructs the full model objects from the stored lightweight state. The returned list is a fresh reconstruction; mutations to the returned models are not persisted.

Returns:

models

  • model_type='glmm-full': list[splisosm.hyptest_glmm.IsoFullModel]

  • model_type='glmm-null': list[splisosm.hyptest_glmm.IsoNullNoSpVar]

  • model_type='glm': list[splisosm.model.MultinomGLM]

Return type:

list of fitted models

get_fitted_ratios_anndata(layer_name='fitted_ratios')#

Return a copy of the filtered AnnData with fitted isoform ratios as a new layer.

For each gene the per-spot softmax isoform ratios are taken from the fitted model (model.get_isoform_ratio()) and placed into the corresponding isoform columns of a (n_spots, n_filtered_isoforms) dense matrix, which is stored under layer_name.

Isoform ordering guarantee: the columns of the new layer follow the exact row order of self._filtered_adata.var. This holds because setup_data() builds per-gene count tensors by slicing filtered_adata isoforms in their var row order (via groupby(sort=False)), and get_isoform_ratio() returns ratios in the same column order as the input counts.

Parameters:

layer_name (str) – Name of the new layer. Defaults to 'fitted_ratios'.

Returns:

A copy of self._filtered_adata (shape (n_spots, n_filtered_isoforms)) with layer_name added. The var DataFrame is identical to self._filtered_adata.var, so isoform metadata (e.g. the gene-grouping column) is intact.

Return type:

AnnData

Raises:

RuntimeError – If fit() has not been called, or if setup_data() was not called with an AnnData object (i.e. raw tensors were provided instead).

get_formatted_test_results(test_type, with_gene_summary=False)#

Get the formatted test results as data frame.

Parameters:
  • test_type ({"sv", "du"}) – Type of test results to retrieve.

  • with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary().

Returns:

Formatted test results.

Return type:

DataFrame

get_gene_model(gene_name)#

Return the fitted model for a single gene by display name.

Each call reconstructs a fresh model from stored state.

Parameters:

gene_name (str) – Display name of the gene (must appear in gene_names).

Returns:

  • Fitted MultinomGLM, IsoFullModel, or IsoNullNoSpVar instance

  • corresponding to gene_name.

Raises:
  • RuntimeError – If fit() has not been called yet.

  • KeyError – If gene_name is not found in gene_names.

Return type:

Any

get_training_summary()#

Return a per-gene training summary as a pandas.DataFrame.

The DataFrame is indexed by gene name and contains the following columns:

  • 'model_type': str. Model type ('glmm-full', 'glmm-null', or 'glm').

  • 'converged': bool. Whether the gene converged during training.

  • 'best_loss': float. Best (lowest) negative log-likelihood achieved.

  • 'best_epoch': int. Epoch at which the best loss was recorded.

  • 'fitting_time_s': float. Wall-clock seconds spent fitting this gene.

Only available after calling fit().

Raises:

RuntimeError – If fit() has not been called yet.

Return type:

DataFrame

static load(path, map_location=None)#

Load a SplisosmGLMM model previously saved with save().

Parameters:
  • path (str) – Path to the file written by save().

  • map_location (Optional[str]) – Optional device string (e.g. 'cpu', 'cuda:0') to remap tensor storage when loading a model that was saved on a different device. Defaults to None (preserves original device mapping).

Returns:

The fully restored model, including fitted parameters, test results, and all metadata.

Return type:

SplisosmGLMM

Examples

>>> model.save("model.pt")
>>> loaded = SplisosmGLMM.load("model.pt")
>>> loaded.get_training_summary()
save(path)#

Save the fitted model to a file.

Only lightweight per-gene _FittedGeneState objects (fitted parameters + convergence metadata) are stored – no full nn.Module objects. The raw .adata is not saved; only _filtered_adata is persisted. After loading, .adata will be None.

Parameters:

path (str) – The path to save the fitted models.

Return type:

None

setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, group_gene_by_n_iso=False, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, min_component_size=1)#

Setup the data for the GLMM model.

Parameters:
  • adata (AnnData) – Annotated data matrix. Counts are read from adata.layers[layer] grouped by group_iso_by, and spatial coordinates from adata.obsm[spatial_key].

  • spatial_key (str, optional) – Key in adata.obsm for spatial coordinates (default "spatial"). Optional when adj_key is provided: if the key is missing from adata.obsm the spatial kernel is built from the adjacency alone. SplisosmGLMM does not require raw coordinates past kernel construction.

  • adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. Also makes spatial_key optional (see above). The adjacency matrix is symmetrized internally.

  • layer (str, optional) – Layer in adata.layers storing isoform counts (default "counts").

  • group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene (default "gene_symbol").

  • gene_names (str or None, optional) – Column name in adata.var used as display names for genes. If None, the values of group_iso_by are used.

  • group_gene_by_n_iso (bool, optional) – If True, genes are sorted and grouped by their isoform count before batching. Required for batch_size > 1 in fit() because the GLMM model expects a fixed isoform count per batch (default False).

  • design_mtx (tensor, array, DataFrame, str, or list of str, optional) – Design matrix for fixed effects. Accepts an array/tensor/DataFrame of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.

  • covariate_names (list of str or None, optional) – Explicit covariate names. When not provided, names are inferred from column names or auto-generated.

  • min_counts (int, optional) – Minimum total isoform count across spots to retain an isoform (default 10).

  • min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).

  • min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A UserWarning is issued when spots are removed.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='score', print_progress=True, return_results=False)#

Parametric test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Similar to splisosm.hyptest_np.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:

  • Unconditional (when model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0: beta = 0).

  • Conditional (when model_type='glmm-full'): test for association (H_0: beta = 0) conditioned on the spatial random effect.

Parameters:
  • method ({"score", "wald"}, optional) –

    Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:

    • "wald" when models were fit with fit(..., with_design_mtx=True).

    • "score" when models were fit with fit(..., with_design_mtx=False).

    Caution

    The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use method="score" when possible.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self._du_test_results.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.

Return type:

dict or None

test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#

Parametric test for spatial variability.

Caution

The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in splisosm.hyptest_np.SplisosmNP for spatial variability testing.

Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.

Parameters:
  • method ({"llr"}, optional) – The test method. Currently only support "llr", the likelihood ratio test (H_0: theta = 0, i.e. no spatial variance).

  • use_perm_null (bool, optional) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results will be stored in self._sv_test_results.

  • print_progress (bool, optional) – Whether to show the progress bar for permutation. Default to True.

  • n_perms_per_gene (int or None, optional) – Number of permutations per gene when use_perm_null=True. Default to 20.

  • **kwargs (Any) – Additional arguments passed to self._fit_sv_llr_perm() if use_perm_null = True.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise returns None.

Return type:

dict or None

adata: AnnData | None#

Source AnnData object; None before setup_data().

covariate_names: list[str]#

Covariate display names (length n_factors).

design_mtx: Tensor | None#

Design matrix (n_spots, n_factors); None if no covariates.

property filtered_adata: AnnData#

The filtered AnnData of shape (n_spots, sum(n_isos_per_gene)).

This is the data used internally after setup_data(). It is a copy of the input adata, subsetted to the retained spots and isoforms after filtering.

Raises:

RuntimeError – If setup_data() has not been called.

Return type:

AnnData

gene_names: list[str]#

Gene display names (length n_genes).

n_factors: int#

Number of covariates for differential usage testing.

n_genes: int#

Number of genes after filtering.

n_isos_per_gene: list[int]#

Number of isoforms per gene (list of length n_genes).

n_spots: int#

Number of spatial spots.

sp_kernel: Any#

Spatial kernel (SpatialCovKernel for 'glmm-full', IdentityKernel for 'glmm-null', None for 'glm'). Set by setup_data().