splisosm.hyptest_glmm#

GLMM-based hypothesis tests for spatial isoform usage.

Classes#

SplisosmGLMM

Parametric spatial isoform statistical modeling using GLMM.

Module Contents#

class splisosm.hyptest_glmm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_parameterization_sigma_theta=True, var_fix_sigma=False, var_prior_model='none', var_prior_model_params={}, init_ratio='observed', fitting_method='joint_gd', fitting_configs={'max_epochs': -1})#

Parametric spatial isoform statistical modeling using GLMM.

This is a convenience class that wraps around the splisosm.model.MultinomGLMM for batched model fitting and spatial variability and differential usage testing.

Examples

Setup data:

>>> from splisosm import SplisosmGLMM
>>> import torch
>>> # Simulate data for 10 genes with different number of isoforms
>>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)]  # 5 genes with 3 isoforms
>>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)]  # 5 genes with 4 isoforms
>>> data = data_3_iso + data_4_iso
>>> coordinates = torch.rand(100, 2)  # 100 spots with 2D coordinates
>>> design_mtx = torch.rand(100, 2)  # 100 spots with 2 covariates

Model fitting:

>>> model = SplisosmGLMM(model_type='glmm-full')
>>> model.setup_data(data, coordinates, design_mtx=design_mtx, group_gene_by_n_iso=True)
>>> model.fit(n_jobs=1, batch_size=5, with_design_mtx=False)
>>> fitted_models = model.get_fitted_models()
>>> print(fitted_models[0])  # print the fitted model for the first gene

Differential usage test:

>>> model.test_differential_usage(method='score')
>>> du_results = model.get_formatted_test_results('du')
>>> print(du_results.head())
Parameters:
  • model_type (Literal['glmm-full', 'glmm-null', 'glm']) – Which model to fit. Can be one of 'glmm-full' (Multinomial GLMM with spatial random effects), 'glmm-null' (Multinomial GLMM with white noise), 'glm' (Multinomial GLM).

  • share_variance (bool) – Whether to share the variance component across isoforms.

  • var_parameterization_sigma_theta (bool) – Whether to parameterize the variance components as (sigma, theta_logit) or (sigma_sp, sigma_nsp). If True, the variance components will be (sigma, theta_logit), where sigma is the total variance and theta_logit is the logit of the spatial variance proportion. If False, the variance components will be (sigma_sp, sigma_nsp), where sigma_sp is the spatial variance and sigma_nsp is the non-spatial variance.

  • var_fix_sigma (bool) – Whether to fix the total variance (sigma) or not. If True, the total variance will be fixed to the initial value, which is the average per-spot variance of isoform counts normalized by its mean expression. See MultinomGLMM._initialize_params for details.

  • var_prior_model (str) – The prior model on the total variance sigma. Default is 'none' with no prior. Other options are 'gamma' (Gamma prior) and 'inv_gamma' (Inverse Gamma prior).

  • var_prior_model_params (dict) – The parameters for the prior model on the total variance sigma. For 'gamma', the default parameters are {'alpha': 2.0, 'beta': 0.3}. For 'inv_gamma', the default parameters are {'alpha': 3, 'beta': 0.5}.

  • init_ratio (str) – The initialization method for the logit isoform usage ratio. Options are 'observed' (initialize using observed counts) and 'uniform' (equal isoform usage across space).

  • fitting_method (str) – The fitting method to use when model_type='glmm-full' or 'glmm-null'. Options are 'joint_gd' (joint likelihood with gradient descent), 'joint_newton' (joint likelihood with Newton’s method), 'marginal_gd' (marginal likelihood with gradient descent), and 'marginal_newton' (marginal likelihood with Newton’s method).

  • fitting_configs (dict) –

    A dictionary of fitting configurations with the following keys:

    • 'lr': float, learning rate

    • 'optim': str, optimization method (one of 'adam', 'sgd', or 'lbfgs')

    • 'tol': float, tolerance for convergence

    • 'max_epochs': int, maximum number of epochs

    • 'patience': int, number of epochs to wait for improvement before stopping

    • 'update_nu_every_k': int, number of iterations to update nu when using fitting_method='marginal_newton'

fit(n_jobs=1, batch_size=1, quiet=True, print_progress=True, with_design_mtx=False, from_null=False, refit_null=True, random_seed=None)#

Model fitting.

Parameters:
  • n_jobs (int) – The number of cores to use for parallel fitting. Default to 1.

  • batch_size (int) – The maximum number of genes per job to fit in parallel. Default to 1.

  • quiet (bool) – Whether to suppress the fitting logs. Default to True.

  • print_progress (bool) – Whether to show the progress bar. Default to True.

  • with_design_mtx (bool) – Whether to include the design matrix for the fixed effects. Default to False.

  • from_null (bool) – Whether to initialize the full model from a null model with zero spatial variability (white-noise random effect).

  • refit_null (bool) – Whether to refit the null model after fitting the full model.

  • random_seed (Optional[int]) – The random seed for reproducibility. Default to None.

Return type:

None

get_fitted_models()#

Get the fitted models after running fit().

Returns:

models

  • model_type='glmm-full': list[splisosm.hyptest_glmm.IsoFullModel]

  • model_type='glmm-null': list[splisosm.hyptest_glmm.IsoNullNoSpVar]

  • model_type='glm': list[splisosm.model.MultinomGLM]

Return type:

list of fitted models

get_formatted_test_results(test_type)#

Get the formatted test results as data frame.

Parameters:

test_type (Literal['sv', 'du']) – Type of test results to retrieve. Can be one of 'sv' (spatial variability) or 'du' (differential usage).

Returns:

Formatted test results.

Return type:

DataFrame

save(path)#

Save the fitted models to a file.

This function is designed to overcome a limitation of torch.save(), which naively saves n_genes copies of the spatial kernel matrix self.corr_sp. The kernel matrix is reconstructed per fitted model using self._corr_sp_eigvals and self._corr_sp_eigvecs.

Parameters:

path (str) – The path to save the fitted models.

Return type:

None

setup_data(data=None, coordinates=None, design_mtx=None, gene_names=None, group_gene_by_n_iso=False, covariate_names=None, *, adata=None, spatial_key='spatial', layer='counts', group_iso_by='gene_symbol', min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True)#

Setup the data for the model.

This method supports two input modes for backward compatibility.

  • Legacy mode: pass data and coordinates directly.

  • AnnData mode: pass adata, where counts are extracted from adata.layers[layer] grouped by group_iso_by, and coordinates are read from adata.obsm[spatial_key]. See splisosm.utils.prepare_inputs_from_anndata() for details.

Parameters:
  • data (Optional[list[Union[Tensor, ndarray]]]) – Legacy mode only. List of tensors/arrays with shape (n_spots, n_isos) containing isoform counts for each gene.

  • coordinates (Optional[Union[ndarray, Tensor, DataFrame]]) – Legacy mode only. Shape (n_spots, 2), the spatial coordinates.

  • design_mtx (Optional[Union[ndarray, Tensor, DataFrame, str, list[str]]]) –

    Design matrix for fixed effects.

    • Legacy mode: tensor/array/dataframe of shape (n_spots, n_factors).

    • AnnData mode: tensor/array/dataframe, or one obs-column name (str), or a list of obs-column names.

  • gene_names (Optional[Union[list[str], str]]) –

    Gene names.

    • Legacy mode: list of gene names.

    • AnnData mode: optional column name in adata.var used as display names for grouped genes; if None, use grouped gene IDs.

  • group_gene_by_n_iso (bool) – Whether to group genes by the number of isoforms.

  • covariate_names (Optional[list[str]]) – List of covariate names.

  • adata (Optional[AnnData]) – AnnData object used in the new input mode.

  • spatial_key (str) – Key in adata.obsm for spatial coordinates.

  • layer (str) – Counts layer in adata.layers.

  • group_iso_by (str) – Column in adata.var used to group isoforms by gene.

  • min_counts (int) – Minimum total isoform count across spots required to retain an isoform in AnnData mode.

  • min_bin_pct (float) – Minimum percentage/fraction of spots where an isoform is expressed in AnnData mode. Values in [0, 1] are treated as fractions; values in (1, 100] are treated as percentages.

  • filter_single_iso_genes (bool) – AnnData mode only. Whether to remove genes with fewer than two retained isoforms.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='score', print_progress=True, return_results=False)#

Parametric test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Similar to splisosm.hyptest_np.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:

  • Unconditional (when model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0: beta = 0).

  • Conditional (when model_type='glmm-full'): test for association (H_0: beta = 0) conditioned on the spatial random effect.

Parameters:
  • method (Literal['score', 'wald']) –

    Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:

    • "wald" when models were fit with fit(..., with_design_mtx=True).

    • "score" when models were fit with fit(..., with_design_mtx=False).

    Caution

    The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use method="score" when possible.

  • print_progress (bool) – Whether to show the progress bar. Default to True.

  • return_results (bool) – Whether to return the test statistics and p-values. If False, the results are stored in self.du_test_results.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.du_test_results.

Return type:

dict or None

test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#

Parametric test for spatial variability.

Caution

The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in splisosm.hyptest_np.SplisosmNP for spatial variability testing.

Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.

Parameters:
  • method (str) – The test method. Currently only support "llr", the likelihood ratio test (H_0: sigma_sp = 0).

  • use_perm_null (bool) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.

  • return_results (bool) – Whether to return the test statistics and p-values. If False, the results will be stored in self.sv_test_results.

  • print_progress (bool) – Whether to show the progress bar for permutation. Default to True.

  • **kwargs (Any) – Additional arguments passed to self._fit_sv_llr_perm() if use_perm_null = True.

  • n_perms_per_gene (Optional[int])

  • **kwargs

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise returns None.

Return type:

dict or None

adata = None#
du_test_results: dict#

Dictionary to store the differential usage test results after running test_differential_usage(). It contains the following keys:

  • 'method': str, the method used for the test.

  • 'statistic': numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.

  • 'pvalue': numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.

  • 'pvalue_adj': numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.

fitting_results: dict#

Dictionary to store lists of fitted models, with keys 'glmm-full', 'glmm-null', 'glm'.

model_configs: dict#

Dictionary of model configurations, with keys share_variance, var_parameterization_sigma_theta, var_fix_sigma, var_prior_model, var_prior_model_params, init_ratio, fitting_method, and fitting_configs.

model_type: Literal['glmm-full', 'glmm-null', 'glm']#

The type of GLM or GLMM model.

n_factors: int#

Number of covariates to test for differential usage.

n_genes: int#

Number of genes.

n_isos_per_gene: list[int]#

List of numbers of isoforms per gene.

n_spots: int#

Number of spots.

setup_input_mode = None#
sv_test_results: dict#

Dictionary to store the spatial variability test results after running test_spatial_variability(). It contains the following keys:

  • 'method': str, the method used for the test.

  • 'statistic': numpy.ndarray of shape (n_genes,), the log likelihood ratio test statistic for each gene.

  • 'df': int, the degrees of freedom for the test statistic.

  • 'use_perm_null': bool, whether the null distribution is estimated using permutation.

  • 'pvalue': numpy.ndarray of shape (n_genes,), the p-value for each gene.

  • 'pvalue_adj': numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.