splisosm.hyptest_glmm#
GLMM-based hypothesis tests for spatial isoform usage.
Classes#
Parametric spatial isoform statistical modeling using GLMM. |
Module Contents#
- class splisosm.hyptest_glmm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_parameterization_sigma_theta=True, var_fix_sigma=False, var_prior_model='none', var_prior_model_params={}, init_ratio='observed', fitting_method='joint_gd', fitting_configs={'max_epochs': -1})#
Parametric spatial isoform statistical modeling using GLMM.
This is a convenience class that wraps around the
splisosm.model.MultinomGLMMfor batched model fitting and spatial variability and differential usage testing.Examples
Setup data:
>>> from splisosm import SplisosmGLMM >>> import torch >>> # Simulate data for 10 genes with different number of isoforms >>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)] # 5 genes with 3 isoforms >>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)] # 5 genes with 4 isoforms >>> data = data_3_iso + data_4_iso >>> coordinates = torch.rand(100, 2) # 100 spots with 2D coordinates >>> design_mtx = torch.rand(100, 2) # 100 spots with 2 covariates
Model fitting:
>>> model = SplisosmGLMM(model_type='glmm-full') >>> model.setup_data(data, coordinates, design_mtx=design_mtx, group_gene_by_n_iso=True) >>> model.fit(n_jobs=1, batch_size=5, with_design_mtx=False) >>> fitted_models = model.get_fitted_models() >>> print(fitted_models[0]) # print the fitted model for the first gene
Differential usage test:
>>> model.test_differential_usage(method='score') >>> du_results = model.get_formatted_test_results('du') >>> print(du_results.head())
- Parameters:
model_type (Literal['glmm-full', 'glmm-null', 'glm']) – Which model to fit. Can be one of
'glmm-full'(Multinomial GLMM with spatial random effects),'glmm-null'(Multinomial GLMM with white noise),'glm'(Multinomial GLM).share_variance (bool) – Whether to share the variance component across isoforms.
var_parameterization_sigma_theta (bool) – Whether to parameterize the variance components as (
sigma,theta_logit) or (sigma_sp,sigma_nsp). If True, the variance components will be (sigma,theta_logit), wheresigmais the total variance andtheta_logitis the logit of the spatial variance proportion. If False, the variance components will be (sigma_sp,sigma_nsp), wheresigma_spis the spatial variance andsigma_nspis the non-spatial variance.var_fix_sigma (bool) – Whether to fix the total variance (
sigma) or not. If True, the total variance will be fixed to the initial value, which is the average per-spot variance of isoform counts normalized by its mean expression. SeeMultinomGLMM._initialize_paramsfor details.var_prior_model (str) – The prior model on the total variance
sigma. Default is'none'with no prior. Other options are'gamma'(Gamma prior) and'inv_gamma'(Inverse Gamma prior).var_prior_model_params (dict) – The parameters for the prior model on the total variance
sigma. For'gamma', the default parameters are{'alpha': 2.0, 'beta': 0.3}. For'inv_gamma', the default parameters are{'alpha': 3, 'beta': 0.5}.init_ratio (str) – The initialization method for the logit isoform usage ratio. Options are
'observed'(initialize using observed counts) and'uniform'(equal isoform usage across space).fitting_method (str) – The fitting method to use when
model_type='glmm-full'or'glmm-null'. Options are'joint_gd'(joint likelihood with gradient descent),'joint_newton'(joint likelihood with Newton’s method),'marginal_gd'(marginal likelihood with gradient descent), and'marginal_newton'(marginal likelihood with Newton’s method).fitting_configs (dict) –
A dictionary of fitting configurations with the following keys:
'lr': float, learning rate'optim': str, optimization method (one of'adam','sgd', or'lbfgs')'tol': float, tolerance for convergence'max_epochs': int, maximum number of epochs'patience': int, number of epochs to wait for improvement before stopping'update_nu_every_k': int, number of iterations to updatenuwhen usingfitting_method='marginal_newton'
See also
- fit(n_jobs=1, batch_size=1, quiet=True, print_progress=True, with_design_mtx=False, from_null=False, refit_null=True, random_seed=None)#
Model fitting.
- Parameters:
n_jobs (int) – The number of cores to use for parallel fitting. Default to 1.
batch_size (int) – The maximum number of genes per job to fit in parallel. Default to 1.
quiet (bool) – Whether to suppress the fitting logs. Default to True.
print_progress (bool) – Whether to show the progress bar. Default to True.
with_design_mtx (bool) – Whether to include the design matrix for the fixed effects. Default to False.
from_null (bool) – Whether to initialize the full model from a null model with zero spatial variability (white-noise random effect).
refit_null (bool) – Whether to refit the null model after fitting the full model.
random_seed (Optional[int]) – The random seed for reproducibility. Default to None.
- Return type:
None
See also
- get_fitted_models()#
Get the fitted models after running fit().
- Returns:
models –
model_type='glmm-full': list[splisosm.hyptest_glmm.IsoFullModel]model_type='glmm-null': list[splisosm.hyptest_glmm.IsoNullNoSpVar]model_type='glm': list[splisosm.model.MultinomGLM]
- Return type:
list of fitted models
- get_formatted_test_results(test_type)#
Get the formatted test results as data frame.
- Parameters:
test_type (Literal['sv', 'du']) – Type of test results to retrieve. Can be one of
'sv'(spatial variability) or'du'(differential usage).- Returns:
Formatted test results.
- Return type:
- save(path)#
Save the fitted models to a file.
This function is designed to overcome a limitation of
torch.save(), which naively savesn_genescopies of the spatial kernel matrixself.corr_sp. The kernel matrix is reconstructed per fitted model usingself._corr_sp_eigvalsandself._corr_sp_eigvecs.- Parameters:
path (str) – The path to save the fitted models.
- Return type:
None
- setup_data(data=None, coordinates=None, design_mtx=None, gene_names=None, group_gene_by_n_iso=False, covariate_names=None, *, adata=None, spatial_key='spatial', layer='counts', group_iso_by='gene_symbol', min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True)#
Setup the data for the model.
This method supports two input modes for backward compatibility.
Legacy mode: pass
dataandcoordinatesdirectly.AnnData mode: pass
adata, where counts are extracted fromadata.layers[layer]grouped bygroup_iso_by, and coordinates are read fromadata.obsm[spatial_key]. Seesplisosm.utils.prepare_inputs_from_anndata()for details.
- Parameters:
data (Optional[list[Union[Tensor, ndarray]]]) – Legacy mode only. List of tensors/arrays with shape
(n_spots, n_isos)containing isoform counts for each gene.coordinates (Optional[Union[ndarray, Tensor, DataFrame]]) – Legacy mode only. Shape
(n_spots, 2), the spatial coordinates.design_mtx (Optional[Union[ndarray, Tensor, DataFrame, str, list[str]]]) –
Design matrix for fixed effects.
Legacy mode: tensor/array/dataframe of shape
(n_spots, n_factors).AnnData mode: tensor/array/dataframe, or one obs-column name (str), or a list of obs-column names.
gene_names (Optional[Union[list[str], str]]) –
Gene names.
Legacy mode: list of gene names.
AnnData mode: optional column name in
adata.varused as display names for grouped genes; if None, use grouped gene IDs.
group_gene_by_n_iso (bool) – Whether to group genes by the number of isoforms.
covariate_names (Optional[list[str]]) – List of covariate names.
adata (Optional[AnnData]) – AnnData object used in the new input mode.
spatial_key (str) – Key in
adata.obsmfor spatial coordinates.layer (str) – Counts layer in
adata.layers.group_iso_by (str) – Column in
adata.varused to group isoforms by gene.min_counts (int) – Minimum total isoform count across spots required to retain an isoform in AnnData mode.
min_bin_pct (float) – Minimum percentage/fraction of spots where an isoform is expressed in AnnData mode. Values in
[0, 1]are treated as fractions; values in(1, 100]are treated as percentages.filter_single_iso_genes (bool) – AnnData mode only. Whether to remove genes with fewer than two retained isoforms.
- Raises:
ValueError – If input arguments are invalid or required fields are missing.
- Return type:
None
- test_differential_usage(method='score', print_progress=True, return_results=False)#
Parametric test for spatial isoform differential usage.
Before running this function, the design matrix must be set up using
setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.Similar to
splisosm.hyptest_np.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:Unconditional (when
model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0:beta= 0).Conditional (when
model_type='glmm-full'): test for association (H_0:beta= 0) conditioned on the spatial random effect.
- Parameters:
method (Literal['score', 'wald']) –
Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:
"wald"when models were fit withfit(..., with_design_mtx=True)."score"when models were fit withfit(..., with_design_mtx=False).
Caution
The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use
method="score"when possible.print_progress (bool) – Whether to show the progress bar. Default to True.
return_results (bool) – Whether to return the test statistics and p-values. If False, the results are stored in
self.du_test_results.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.du_test_results.- Return type:
dict or None
- test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#
Parametric test for spatial variability.
Caution
The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in
splisosm.hyptest_np.SplisosmNPfor spatial variability testing.Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.
- Parameters:
method (str) – The test method. Currently only support
"llr", the likelihood ratio test (H_0:sigma_sp= 0).use_perm_null (bool) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.
return_results (bool) – Whether to return the test statistics and p-values. If False, the results will be stored in
self.sv_test_results.print_progress (bool) – Whether to show the progress bar for permutation. Default to True.
**kwargs (Any) – Additional arguments passed to
self._fit_sv_llr_perm()ifuse_perm_null = True.n_perms_per_gene (Optional[int])
**kwargs
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise returns None.- Return type:
dict or None
- adata = None#
- du_test_results: dict#
Dictionary to store the differential usage test results after running test_differential_usage(). It contains the following keys:
'method': str, the method used for the test.'statistic': numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.'pvalue': numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.'pvalue_adj': numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.
- fitting_results: dict#
Dictionary to store lists of fitted models, with keys
'glmm-full','glmm-null','glm'.
- model_configs: dict#
Dictionary of model configurations, with keys
share_variance,var_parameterization_sigma_theta,var_fix_sigma,var_prior_model,var_prior_model_params,init_ratio,fitting_method, andfitting_configs.
- model_type: Literal['glmm-full', 'glmm-null', 'glm']#
The type of GLM or GLMM model.
- n_factors: int#
Number of covariates to test for differential usage.
- n_genes: int#
Number of genes.
- n_isos_per_gene: list[int]#
List of numbers of isoforms per gene.
- n_spots: int#
Number of spots.
- setup_input_mode = None#
- sv_test_results: dict#
Dictionary to store the spatial variability test results after running test_spatial_variability(). It contains the following keys:
'method': str, the method used for the test.'statistic': numpy.ndarray of shape (n_genes,), the log likelihood ratio test statistic for each gene.'df': int, the degrees of freedom for the test statistic.'use_perm_null': bool, whether the null distribution is estimated using permutation.'pvalue': numpy.ndarray of shape (n_genes,), the p-value for each gene.'pvalue_adj': numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.