splisosm.hyptest_glmm

splisosm.hyptest_glmm#

GLMM-based hypothesis tests for spatial isoform usage.

Classes#

SplisosmGLMM

Parametric spatial isoform statistical modeling using GLMM.

Module Contents#

class splisosm.hyptest_glmm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_fix_sigma=True, var_prior_model='none', var_prior_model_params=None, init_ratio='uniform', fitting_method='joint_gd', fitting_configs=None, k_neighbors=4, rho=0.99, approx_rank=_APPROX_RANK_AUTO, device='cpu')#

Parametric spatial isoform statistical modeling using GLMM.

This is a convenience class that wraps around the splisosm.model.MultinomGLMM for batched model fitting and spatial variability and differential usage testing.

Examples

Prepare an AnnData object and fit a GLMM:

>>> from splisosm import SplisosmGLMM
>>> # adata : AnnData of shape (n_spots, n_isoforms)
>>> #   adata.layers["counts"]    — raw isoform counts
>>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
>>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
>>> #   adata.obs["covariate"]    — optional covariate column for DU testing
>>> model = SplisosmGLMM(
...     model_type="glmm-full",    # 'glmm-full' | 'glmm-null' | 'glm'
...     fitting_method="joint_gd", # 'joint_gd' | 'joint_newton' | 'marginal_gd' | 'marginal_newton'
...     device="cpu",              # 'cpu' | 'cuda' (NVIDIA) | 'mps' (Apple Silicon)
...     # approx_rank: omit for auto | None (force full rank) | int (fixed low-rank)
... )
>>> model.setup_data(
...     adata,
...     layer="counts",
...     group_iso_by="gene_symbol",
...     group_gene_by_n_iso=True,  # required for batch_size > 1 in fit()
...     design_mtx="covariate",    # obs column name, or (n_spots, n_factors) array
... )
>>> model.fit(
...     n_jobs=1, batch_size=20,
...     with_design_mtx=False,     # False → score test (recommended for DU)
... )
>>> fitted_models = model.get_fitted_models()

Differential usage test:

>>> model.test_differential_usage(method="score")  # or "wald" if with_design_mtx=True
>>> du_results = model.get_formatted_test_results("du")
>>> print(du_results.head())

Parameters:

model_type (Literal['glmm-full', 'glmm-null', 'glm']) – Which model to fit. Can be one of 'glmm-full' (Multinomial GLMM with spatial random effects), 'glmm-null' (Multinomial GLMM with white noise), 'glm' (Multinomial GLM).
share_variance (bool) – Whether to share the variance component across isoforms.
var_fix_sigma (bool) – Whether to fix the total variance (sigma) to the Fano-factor initial estimate. When True (default), only theta_logit is learned, producing conservative but well-calibrated hypothesis tests. Set to False to learn sigma jointly with other parameters; this may yield higher power for the SV test but can inflate false positive rates for both SV and DU tests.
var_prior_model (str) – The prior model on the total variance sigma. Default is 'none' with no prior. Other options are 'gamma' (Gamma prior) and 'inv_gamma' (Inverse Gamma prior).
var_prior_model_params (dict | None) – The parameters for the prior model on the total variance sigma. For 'gamma', the default parameters are {'alpha': 2.0, 'beta': 0.3}. For 'inv_gamma', the default parameters are {'alpha': 3, 'beta': 0.5}.
init_ratio (str) – The initialization method for the logit isoform usage ratio. Options are 'observed' (initialize using observed counts) and 'uniform' (equal isoform usage across space).
fitting_method (str) – The fitting method to use when model_type='glmm-full' or 'glmm-null'. Options are 'joint_gd' (joint likelihood with gradient descent), 'joint_newton' (joint likelihood with Newton’s method), 'marginal_gd' (marginal likelihood with gradient descent), and 'marginal_newton' (marginal likelihood with Newton’s method).
fitting_configs (dict | None) –
A dictionary of fitting configurations with the following keys:
- 'lr': float, learning rate
- 'optim': str, optimization method (one of 'adam', 'sgd', or 'lbfgs')
- 'tol': float, tolerance for convergence
- 'max_epochs': int, maximum number of epochs
- 'patience': int, number of epochs to wait for improvement before stopping
- 'update_nu_every_k': int, number of iterations to update nu when using fitting_method='marginal_newton'
k_neighbors (int, optional) – Number of nearest neighbours used to build the spatial k-NN adjacency graph (default 4). Passed to splisosm.kernel.SpatialCovKernel. Ignored when adj_key is provided to setup_data().
rho (float, optional) – Spectral smoothing parameter for the ICAR-like spatial kernel (default 0.99). Values close to 1 produce smoother spatial covariance.
approx_rank (int or None, optional) –
Number of leading eigenvectors of the spatial kernel to retain.
- Not specified (default): automatically selects the rank at setup_data() time — full rank when n_spots ≤ 5000, otherwise ceil(sqrt(n_spots) * 4).
- None: always use full rank regardless of n_spots (a warning is emitted when n_spots > 5000 because the eigendecomposition of a large dense matrix is expensive).
- Positive integer: use exactly that many eigenvectors.
device ({"cpu", "cuda", "mps"}, optional) – Device for all model computation (default "cpu"). Use "cuda" for NVIDIA GPU or "mps" for Apple Silicon. Parallel fitting (n_jobs > 1) is not supported for non-CPU devices and will automatically fall back to single-core with a warning.

See also

splisosm.model.MultinomGLMM.fit()

get_fitted_models()#

Get the fitted models after running fit().

Each call reconstructs the full model objects from the stored lightweight state. The returned list is a fresh reconstruction; mutations to the returned models are not persisted.

Returns:

models –

model_type='glmm-full': list[splisosm.hyptest_glmm.IsoFullModel]
model_type='glmm-null': list[splisosm.hyptest_glmm.IsoNullNoSpVar]
model_type='glm': list[splisosm.model.MultinomGLM]

Return type:

list of fitted models

get_fitted_ratios_anndata(layer_name='fitted_ratios')#

Return a copy of the filtered AnnData with fitted isoform ratios as a new layer.

For each gene the per-spot softmax isoform ratios are taken from the fitted model (model.get_isoform_ratio()) and placed into the corresponding isoform columns of a (n_spots, n_filtered_isoforms) dense matrix, which is stored under layer_name.

Isoform ordering guarantee: the columns of the new layer follow the exact row order of self._filtered_adata.var. This holds because setup_data() builds per-gene count tensors by slicing filtered_adata isoforms in their var row order (via groupby(sort=False)), and get_isoform_ratio() returns ratios in the same column order as the input counts.

Parameters:: layer_name (str) – Name of the new layer. Defaults to 'fitted_ratios'.
Returns:: A copy of self._filtered_adata (shape (n_spots, n_filtered_isoforms)) with layer_name added. The var DataFrame is identical to self._filtered_adata.var, so isoform metadata (e.g. the gene-grouping column) is intact.
Return type:: AnnData
Raises:: RuntimeError – If fit() has not been called, or if setup_data() was not called with an AnnData object (i.e. raw tensors were provided instead).

get_formatted_test_results(test_type, with_gene_summary=False)#

Get the formatted test results as data frame.

Parameters:

test_type ({"sv", "du"}) – Type of test results to retrieve.
with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary().

Returns:

Formatted test results.

Return type:

DataFrame

get_gene_model(gene_name)#

Return the fitted model for a single gene by display name.

Each call reconstructs a fresh model from stored state.

Parameters:

gene_name (str) – Display name of the gene (must appear in gene_names).

Returns:

Fitted MultinomGLM, IsoFullModel, or IsoNullNoSpVar instance
corresponding to gene_name.

Raises:

RuntimeError – If fit() has not been called yet.
KeyError – If gene_name is not found in gene_names.

Return type:

Any

get_training_summary()#

Return a per-gene training summary as a pandas.DataFrame.

The DataFrame is indexed by gene name and contains the following columns:

'model_type': str. Model type ('glmm-full', 'glmm-null', or 'glm').
'converged': bool. Whether the gene converged during training.
'best_loss': float. Best (lowest) negative log-likelihood achieved.
'best_epoch': int. Epoch at which the best loss was recorded.
'fitting_time_s': float. Wall-clock seconds spent fitting this gene.

Only available after calling fit().

Raises:: RuntimeError – If fit() has not been called yet.
Return type:: DataFrame

static load(path, map_location=None)#

Load a SplisosmGLMM model previously saved with save().

Parameters:

path (str) – Path to the file written by save().
map_location (Optional[str]) – Optional device string (e.g. 'cpu', 'cuda:0') to remap tensor storage when loading a model that was saved on a different device. Defaults to None (preserves original device mapping).

Returns:

The fully restored model, including fitted parameters, test results, and all metadata.

Return type:

SplisosmGLMM

Examples

>>> model.save("model.pt")
>>> loaded = SplisosmGLMM.load("model.pt")
>>> loaded.get_training_summary()

save(path)#

Save the fitted model to a file.

Only lightweight per-gene _FittedGeneState objects (fitted parameters + convergence metadata) are stored – no full nn.Module objects. The raw .adata is not saved; only _filtered_adata is persisted. After loading, .adata will be None.

Parameters:: path (str) – The path to save the fitted models.
Return type:: None

setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, group_gene_by_n_iso=False, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, min_component_size=1)#

Setup the data for the GLMM model.

Parameters:

adata (AnnData) – Annotated data matrix. Counts are read from adata.layers[layer] grouped by group_iso_by, and spatial coordinates from adata.obsm[spatial_key].
spatial_key (str, optional) – Key in adata.obsm for spatial coordinates (default "spatial"). Optional when adj_key is provided: if the key is missing from adata.obsm the spatial kernel is built from the adjacency alone. SplisosmGLMM does not require raw coordinates past kernel construction.
adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. Also makes spatial_key optional (see above). The adjacency matrix is symmetrized internally.
layer (str, optional) – Layer in adata.layers storing isoform counts (default "counts").
group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene (default "gene_symbol").
gene_names (str or None, optional) – Column name in adata.var used as display names for genes. If None, the values of group_iso_by are used.
group_gene_by_n_iso (bool, optional) – If True, genes are sorted and grouped by their isoform count before batching. Required for batch_size > 1 in fit() because the GLMM model expects a fixed isoform count per batch (default False).
design_mtx (tensor, array, DataFrame, str, or list of str, optional) – Design matrix for fixed effects. Accepts an array/tensor/DataFrame of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.
covariate_names (list of str or None, optional) – Explicit covariate names. When not provided, names are inferred from column names or auto-generated.
min_counts (int, optional) – Minimum total isoform count across spots to retain an isoform (default 10).
min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).
min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A UserWarning is issued when spots are removed.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='score', print_progress=True, return_results=False)#

Parametric test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Similar to splisosm.hyptest_np.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:

Unconditional (when model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0: beta = 0).
Conditional (when model_type='glmm-full'): test for association (H_0: beta = 0) conditioned on the spatial random effect.

Parameters:

method ({"score", "wald"}, optional) –
Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:
- "wald" when models were fit with fit(..., with_design_mtx=True).
- "score" when models were fit with fit(..., with_design_mtx=False).
Caution

The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use method="score" when possible.
print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self._du_test_results.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.

Return type:

dict or None

test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#

Parametric test for spatial variability.

Caution

The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in splisosm.hyptest_np.SplisosmNP for spatial variability testing.

Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.

Parameters:

method ({"llr"}, optional) – The test method. Currently only support "llr", the likelihood ratio test (H_0: theta = 0, i.e. no spatial variance).
use_perm_null (bool, optional) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results will be stored in self._sv_test_results.
print_progress (bool, optional) – Whether to show the progress bar for permutation. Default to True.
n_perms_per_gene (int or None, optional) – Number of permutations per gene when use_perm_null=True. Default to 20.
**kwargs (Any) – Additional arguments passed to self._fit_sv_llr_perm() if use_perm_null = True.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise returns None.

Return type:

dict or None