splisosm.hyptest_glmm#
GLMM-based hypothesis tests for spatial isoform usage.
Classes#
Parametric spatial isoform statistical modeling using GLMM. |
Module Contents#
- class splisosm.hyptest_glmm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_fix_sigma=True, var_prior_model='none', var_prior_model_params=None, init_ratio='uniform', fitting_method='joint_gd', fitting_configs=None, k_neighbors=4, rho=0.99, approx_rank=_APPROX_RANK_AUTO, device='cpu')#
Parametric spatial isoform statistical modeling using GLMM.
This is a convenience class that wraps around the
splisosm.model.MultinomGLMMfor batched model fitting and spatial variability and differential usage testing.Examples
Prepare an AnnData object and fit a GLMM:
>>> from splisosm import SplisosmGLMM >>> # adata : AnnData of shape (n_spots, n_isoforms) >>> # adata.layers["counts"] — raw isoform counts >>> # adata.var["gene_symbol"] — column grouping isoforms by gene >>> # adata.obsm["spatial"] — (n_spots, 2) spatial coordinates >>> # adata.obs["covariate"] — optional covariate column for DU testing >>> model = SplisosmGLMM( ... model_type="glmm-full", # 'glmm-full' | 'glmm-null' | 'glm' ... fitting_method="joint_gd", # 'joint_gd' | 'joint_newton' | 'marginal_gd' | 'marginal_newton' ... device="cpu", # 'cpu' | 'cuda' (NVIDIA) | 'mps' (Apple Silicon) ... # approx_rank: omit for auto | None (force full rank) | int (fixed low-rank) ... ) >>> model.setup_data( ... adata, ... layer="counts", ... group_iso_by="gene_symbol", ... group_gene_by_n_iso=True, # required for batch_size > 1 in fit() ... design_mtx="covariate", # obs column name, or (n_spots, n_factors) array ... ) >>> model.fit( ... n_jobs=1, batch_size=20, ... with_design_mtx=False, # False → score test (recommended for DU) ... ) >>> fitted_models = model.get_fitted_models()
Differential usage test:
>>> model.test_differential_usage(method="score") # or "wald" if with_design_mtx=True >>> du_results = model.get_formatted_test_results("du") >>> print(du_results.head())
- Parameters:
model_type (Literal['glmm-full', 'glmm-null', 'glm']) – Which model to fit. Can be one of
'glmm-full'(Multinomial GLMM with spatial random effects),'glmm-null'(Multinomial GLMM with white noise),'glm'(Multinomial GLM).share_variance (bool) – Whether to share the variance component across isoforms.
var_fix_sigma (bool) – Whether to fix the total variance (
sigma) to the Fano-factor initial estimate. WhenTrue(default), onlytheta_logitis learned, producing conservative but well-calibrated hypothesis tests. Set toFalseto learn sigma jointly with other parameters; this may yield higher power for the SV test but can inflate false positive rates for both SV and DU tests.var_prior_model (str) – The prior model on the total variance
sigma. Default is'none'with no prior. Other options are'gamma'(Gamma prior) and'inv_gamma'(Inverse Gamma prior).var_prior_model_params (dict | None) – The parameters for the prior model on the total variance
sigma. For'gamma', the default parameters are{'alpha': 2.0, 'beta': 0.3}. For'inv_gamma', the default parameters are{'alpha': 3, 'beta': 0.5}.init_ratio (str) – The initialization method for the logit isoform usage ratio. Options are
'observed'(initialize using observed counts) and'uniform'(equal isoform usage across space).fitting_method (str) – The fitting method to use when
model_type='glmm-full'or'glmm-null'. Options are'joint_gd'(joint likelihood with gradient descent),'joint_newton'(joint likelihood with Newton’s method),'marginal_gd'(marginal likelihood with gradient descent), and'marginal_newton'(marginal likelihood with Newton’s method).fitting_configs (dict | None) –
A dictionary of fitting configurations with the following keys:
'lr': float, learning rate'optim': str, optimization method (one of'adam','sgd', or'lbfgs')'tol': float, tolerance for convergence'max_epochs': int, maximum number of epochs'patience': int, number of epochs to wait for improvement before stopping'update_nu_every_k': int, number of iterations to updatenuwhen usingfitting_method='marginal_newton'
k_neighbors (int, optional) – Number of nearest neighbours used to build the spatial k-NN adjacency graph (default 4). Passed to
splisosm.kernel.SpatialCovKernel. Ignored whenadj_keyis provided tosetup_data().rho (float, optional) – Spectral smoothing parameter for the ICAR-like spatial kernel (default 0.99). Values close to 1 produce smoother spatial covariance.
approx_rank (int or None, optional) –
Number of leading eigenvectors of the spatial kernel to retain.
Not specified (default): automatically selects the rank at
setup_data()time — full rank whenn_spots ≤ 5000, otherwiseceil(sqrt(n_spots) * 4).None: always use full rank regardless ofn_spots(a warning is emitted whenn_spots > 5000because the eigendecomposition of a large dense matrix is expensive).Positive integer: use exactly that many eigenvectors.
device ({"cpu", "cuda", "mps"}, optional) – Device for all model computation (default
"cpu"). Use"cuda"for NVIDIA GPU or"mps"for Apple Silicon. Parallel fitting (n_jobs > 1) is not supported for non-CPU devices and will automatically fall back to single-core with a warning.
See also
- extract_feature_summary(level='gene', print_progress=True)#
Compute filtered feature-level summary statistics.
Gene-level statistics are aggregated across all isoforms that passed the filters applied in
setup_data(). Isoform-level statistics are computed per isoform and augmented onto the corresponding rows ofadata.var.Results are cached: repeated calls with the same
levelreturn the cachedpandas.DataFramewithout recomputation.- Parameters:
level (Literal['gene', 'isoform']) – Summary granularity.
'gene': one row per gene.'isoform': one row per isoform that passed filtering.print_progress (bool) – Whether to show a progress bar.
- Returns:
For
level='gene', the index is the gene display name and the columns are:'n_isos': int. Number of isoforms retained after filtering.'perplexity': float. Effective number of isoforms based on the marginal isoform usage entropy.'pct_bin_on': float. Fraction of spots with non-zero total gene counts.'count_avg': float. Mean per-spot total count for the gene.'count_std': float. Std of per-spot total count for the gene.
For
level='isoform', the index is the isoform name (matchingadata.var_names) and the columns are the originaladata.varcolumns plus:'pct_bin_on': float. Fraction of spots with count > 0.'count_total': float. Total counts across all spots.'count_avg': float. Mean count per spot.'count_std': float. Std of count per spot.'ratio_total': float. Fraction of total gene counts attributable to this isoform.'ratio_avg': float. Mean per-spot isoform usage ratio (computed over spots with non-zero gene coverage).'ratio_std': float. Std of per-spot isoform usage ratio (computed over spots with non-zero gene coverage).
- Return type:
- Raises:
RuntimeError – If
setup_data()has not been called.ValueError – If
levelis not'gene'or'isoform'.
- fit(n_jobs=1, batch_size=1, quiet=True, print_progress=True, with_design_mtx=False, from_null=False, refit_null=True, random_seed=None)#
Model fitting.
- Parameters:
n_jobs (int, optional) – The number of cores to use for parallel fitting. Default to 1.
batch_size (int, optional) – The maximum number of genes per job to fit in parallel. Default to 1.
quiet (bool, optional) – Whether to suppress the fitting logs. Default to True.
print_progress (bool, optional) – Whether to show the progress bar. Default to True.
with_design_mtx (bool, optional) – Whether to include the design matrix for the fixed effects. Default to False.
from_null (bool, optional) – Whether to initialize the full model from a null model with zero spatial variability (white-noise random effect).
refit_null (bool, optional) – Whether to refit the null model after fitting the full model. Only applicable when
from_null=True.random_seed (int or None, optional) – The random seed for reproducibility. Default to None.
- Return type:
None
See also
- get_fitted_models()#
Get the fitted models after running fit().
Each call reconstructs the full model objects from the stored lightweight state. The returned list is a fresh reconstruction; mutations to the returned models are not persisted.
- Returns:
models –
model_type='glmm-full': list[splisosm.hyptest_glmm.IsoFullModel]model_type='glmm-null': list[splisosm.hyptest_glmm.IsoNullNoSpVar]model_type='glm': list[splisosm.model.MultinomGLM]
- Return type:
list of fitted models
- get_fitted_ratios_anndata(layer_name='fitted_ratios')#
Return a copy of the filtered AnnData with fitted isoform ratios as a new layer.
For each gene the per-spot softmax isoform ratios are taken from the fitted model (
model.get_isoform_ratio()) and placed into the corresponding isoform columns of a(n_spots, n_filtered_isoforms)dense matrix, which is stored underlayer_name.Isoform ordering guarantee: the columns of the new layer follow the exact row order of
self._filtered_adata.var. This holds becausesetup_data()builds per-gene count tensors by slicingfiltered_adataisoforms in theirvarrow order (viagroupby(sort=False)), andget_isoform_ratio()returns ratios in the same column order as the input counts.- Parameters:
layer_name (str) – Name of the new layer. Defaults to
'fitted_ratios'.- Returns:
A copy of
self._filtered_adata(shape(n_spots, n_filtered_isoforms)) withlayer_nameadded. ThevarDataFrame is identical toself._filtered_adata.var, so isoform metadata (e.g. the gene-grouping column) is intact.- Return type:
AnnData
- Raises:
RuntimeError – If
fit()has not been called, or ifsetup_data()was not called with anAnnDataobject (i.e. raw tensors were provided instead).
- get_formatted_test_results(test_type, with_gene_summary=False)#
Get the formatted test results as data frame.
- Parameters:
test_type ({"sv", "du"}) – Type of test results to retrieve.
with_gene_summary (bool, optional) – If
True, append gene-level summary statistics fromextract_feature_summary().
- Returns:
Formatted test results.
- Return type:
- get_gene_model(gene_name)#
Return the fitted model for a single gene by display name.
Each call reconstructs a fresh model from stored state.
- Parameters:
gene_name (str) – Display name of the gene (must appear in
gene_names).- Returns:
Fitted
MultinomGLM,IsoFullModel, orIsoNullNoSpVarinstancecorresponding to
gene_name.
- Raises:
RuntimeError – If
fit()has not been called yet.KeyError – If
gene_nameis not found ingene_names.
- Return type:
Any
- get_training_summary()#
Return a per-gene training summary as a
pandas.DataFrame.The DataFrame is indexed by gene name and contains the following columns:
'model_type': str. Model type ('glmm-full','glmm-null', or'glm').'converged': bool. Whether the gene converged during training.'best_loss': float. Best (lowest) negative log-likelihood achieved.'best_epoch': int. Epoch at which the best loss was recorded.'fitting_time_s': float. Wall-clock seconds spent fitting this gene.
Only available after calling
fit().
- static load(path, map_location=None)#
Load a
SplisosmGLMMmodel previously saved withsave().- Parameters:
path (str) – Path to the file written by
save().map_location (Optional[str]) – Optional device string (e.g.
'cpu','cuda:0') to remap tensor storage when loading a model that was saved on a different device. Defaults toNone(preserves original device mapping).
- Returns:
The fully restored model, including fitted parameters, test results, and all metadata.
- Return type:
Examples
>>> model.save("model.pt") >>> loaded = SplisosmGLMM.load("model.pt") >>> loaded.get_training_summary()
- save(path)#
Save the fitted model to a file.
Only lightweight per-gene
_FittedGeneStateobjects (fitted parameters + convergence metadata) are stored – no fullnn.Moduleobjects. The raw.adatais not saved; only_filtered_adatais persisted. After loading,.adatawill beNone.- Parameters:
path (str) – The path to save the fitted models.
- Return type:
None
- setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, group_gene_by_n_iso=False, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, min_component_size=1)#
Setup the data for the GLMM model.
- Parameters:
adata (AnnData) – Annotated data matrix. Counts are read from
adata.layers[layer]grouped bygroup_iso_by, and spatial coordinates fromadata.obsm[spatial_key].spatial_key (str, optional) – Key in
adata.obsmfor spatial coordinates (default"spatial").adj_key (str or None, optional) – Key in
adata.obspfor a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. The adjacency matrix is symmetrized internally.layer (str, optional) – Layer in
adata.layersstoring isoform counts (default"counts").group_iso_by (str, optional) – Column in
adata.varused to group isoforms by gene (default"gene_symbol").gene_names (str or None, optional) – Column name in
adata.varused as display names for genes. IfNone, the values ofgroup_iso_byare used.group_gene_by_n_iso (bool, optional) – If
True, genes are sorted and grouped by their isoform count before batching. Required forbatch_size > 1infit()because the GLMM model expects a fixed isoform count per batch (defaultFalse).design_mtx (tensor, array, DataFrame, str, or list of str, optional) – Design matrix for fixed effects. Accepts an array/tensor/DataFrame of shape
(n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.covariate_names (list of str or None, optional) – Explicit covariate names. When not provided, names are inferred from column names or auto-generated.
min_counts (int, optional) – Minimum total isoform count across spots to retain an isoform (default 10).
min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).
min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A
UserWarningis issued when spots are removed.
- Raises:
ValueError – If input arguments are invalid or required fields are missing.
- Return type:
None
- test_differential_usage(method='score', print_progress=True, return_results=False)#
Parametric test for spatial isoform differential usage.
Before running this function, the design matrix must be set up using
setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.Similar to
splisosm.hyptest_np.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:Unconditional (when
model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0:beta= 0).Conditional (when
model_type='glmm-full'): test for association (H_0:beta= 0) conditioned on the spatial random effect.
- Parameters:
method ({"score", "wald"}, optional) –
Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:
"wald"when models were fit withfit(..., with_design_mtx=True)."score"when models were fit withfit(..., with_design_mtx=False).
Caution
The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use
method="score"when possible.print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in
self._du_test_results.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.- Return type:
dict or None
- test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#
Parametric test for spatial variability.
Caution
The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in
splisosm.hyptest_np.SplisosmNPfor spatial variability testing.Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.
- Parameters:
method ({"llr"}, optional) – The test method. Currently only support
"llr", the likelihood ratio test (H_0:theta= 0, i.e. no spatial variance).use_perm_null (bool, optional) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results will be stored in
self._sv_test_results.print_progress (bool, optional) – Whether to show the progress bar for permutation. Default to True.
n_perms_per_gene (int or None, optional) – Number of permutations per gene when
use_perm_null=True. Default to 20.**kwargs (Any) – Additional arguments passed to
self._fit_sv_llr_perm()ifuse_perm_null = True.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise returns None.- Return type:
dict or None
- adata: AnnData | None#
Source
AnnDataobject;Nonebeforesetup_data().
- property filtered_adata: AnnData#
The filtered AnnData of shape (
n_spots, sum(n_isos_per_gene)).This is the data used internally after
setup_data(). It is a copy of the inputadata, subsetted to the retained spots and isoforms after filtering.- Raises:
RuntimeError – If
setup_data()has not been called.- Return type:
- n_factors: int#
Number of covariates for differential usage testing.
- n_genes: int#
Number of genes after filtering.
- n_spots: int#
Number of spatial spots.
- sp_kernel: Any#
Spatial kernel (
SpatialCovKernelfor'glmm-full',IdentityKernelfor'glmm-null',Nonefor'glm'). Set bysetup_data().