Core Classes#
Use these top-level classes for complete SPLISOSM workflows. The recommended
imports are from splisosm import SplisosmNP, SplisosmFFT, SplisosmGLMM.
Non-parametric SPLISOSM model for arbitrary spatial geometries. |
|
FFT-accelerated SPLISOSM model for rasterized spatial isoform testing. |
|
Parametric spatial isoform statistical modeling using GLMM. |
- class splisosm.SplisosmNP(k_neighbors=4, rho=0.99, standardize_cov=True)#
Bases:
_ResultsMixin,_FeatureSummaryMixinNon-parametric SPLISOSM model for arbitrary spatial geometries.
SplisosmNPworks directly on spot- or cell-level coordinates and a sparse CAR spatial kernel. Usesetup_data()to load anAnnDataobject, then run spatial-variability (SV) or differential-usage (DU) tests.Examples
Spatial variability test:
>>> from splisosm import SplisosmNP >>> # adata : AnnData of shape (n_spots, n_isoforms) >>> # adata.layers["counts"] — raw isoform counts >>> # adata.var["gene_symbol"] — column grouping isoforms by gene >>> # adata.obsm["spatial"] — (n_spots, 2) spatial coordinates >>> model = SplisosmNP() >>> model.setup_data(adata, layer="counts", group_iso_by="gene_symbol") >>> model.test_spatial_variability(method="hsic-ir") >>> sv_results = model.get_formatted_test_results("sv")
Differential usage test:
>>> model = SplisosmNP() >>> model.setup_data( ... adata, layer="counts", group_iso_by="gene_symbol", ... design_mtx="covariate", # obs column name, or (n_spots, n_factors) array ... ) >>> model.test_differential_usage(method="hsic-gp", residualize="cov_only") >>> du_results = model.get_formatted_test_results("du")
- Parameters:
k_neighbors (int)
rho (float)
standardize_cov (bool)
- property filtered_adata: AnnData#
The filtered AnnData of shape (
n_spots, sum(n_isos_per_gene)).This is the data used internally after
setup_data(). It is a copy of the inputadata, subsetted to the retained spots and isoforms after filtering.- Raises:
RuntimeError – If
setup_data()has not been called.
- setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, min_component_size=1, skip_spatial_kernel=False)#
Set up isoform-level spatial data for hypothesis testing.
This method extracts isoform count tensors, optionally filters small disconnected graph components, builds the spatial kernel, and resolves the design matrix for DU tests.
- Parameters:
adata (AnnData) – Annotated data matrix. Counts are read from
adata.layers[layer]grouped bygroup_iso_by, and spatial coordinates fromadata.obsm[spatial_key]. Seesplisosm.utils.preprocessing.prepare_inputs_from_anndata()for full preprocessing details.spatial_key (str, optional) – Key in
adata.obsmfor spatial coordinates (default"spatial"). Optional whenadj_keyis provided: if the key is missing fromadata.obsmthe spatial kernel is built from the adjacency alone and coordinate-free SV tests ("hsic-ir"/"hsic-ic"/"hsic-gc") and DU tests still run.method="spark-x"(SV) andmethod="hsic-gp"(DU) require raw coordinates and raise a clear error at call time when they are absent.adj_key (str or None, optional) – Key in
adata.obspfor a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and is used directly to build the spatial kernel. It also makesspatial_keyoptional (see above). The adjacency matrix is symmetrized internally.layer (str, optional) – Layer in
adata.layersthat stores isoform counts (default"counts").group_iso_by (str, optional) – Column in
adata.varused to group isoforms by gene (default"gene_symbol").gene_names (str or None, optional) – Column name in
adata.varused as display names for genes. IfNone, the values ofgroup_iso_byare used.design_mtx (tensor, array, sparse matrix, DataFrame, str, or list of str, optional) –
Design matrix for differential-usage tests. Accepts an array/tensor/DataFrame of shape
(n_spots, n_factors), a single obs-column name (str), or a list of obs-column names. Categorical obs columns are one-hot encoded automatically.When a scipy sparse matrix is passed directly, it is stored as scipy CSR internally and all differential-usage methods handle it without densifying:
"hsic"uses a sparse matrix-multiply path inlinear_hsic_test();"t-fisher"and"t-tippett"extract group indices directly from the sparse non-zero structure."hsic-gp"densifies each column via_get_design_col()before GPR fitting (GPR residuals are always dense).All other input types (obs column names, array, tensor, DataFrame) are converted to a dense torch float32 tensor.
covariate_names (list of str or None, optional) – Explicit covariate names. When
design_mtxis given as column name(s) and this isNone, the column names are used automatically; otherwise auto-generated asfactor_1, etc.min_counts (int, optional) – Minimum total isoform count across spots required to retain an isoform (default 10).
min_bin_pct (float, optional) – Minimum spot prevalence required to retain an isoform. Values in
[0, 1]are fractions; values in(1, 100]are percentages. Default0.0.filter_single_iso_genes (bool, optional) – Whether to remove genes with fewer than two retained isoforms (default
True).min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A
UserWarningis issued when spots are removed.skip_spatial_kernel (bool, optional) – If
True, skip construction of the CAR spatial kernel and store anIdentityKernelplaceholder asself.sp_kernelinstead. Use this when onlytest_differential_usage()is needed;method="hsic-gp"fits its own GP for spatial residualization. Callingtest_spatial_variability()on a model set up withskip_spatial_kernel=Truewill raise aRuntimeError. DefaultFalse.
- Raises:
ValueError – If input arguments are invalid or required fields are missing.
- Return type:
None
- test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', null_method='liu', null_configs=None, chunk_size='auto', n_jobs=-1, return_results=False, print_progress=True)#
Test each gene for spatial variability.
The HSIC-based methods test association between a centered spatial kernel and one gene-level response at a time:
"hsic-ir": isoform usage ratios."hsic-ic": isoform counts."hsic-gc": gene-level total counts."spark-x": SPARK-X gene-count test [ZSZ21].
- Parameters:
method ({"hsic-ir", "hsic-ic", "hsic-gc", "spark-x"}, optional) – Test target:
"hsic-ir"(isoform usage ratios),"hsic-ic"(isoform counts),"hsic-gc"(gene-level counts), or"spark-x"(SPARK-X [ZSZ21]).ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation applied to isoform ratios when
method="hsic-ir". Seesplisosm.utils.preprocessing.counts_to_ratios()and [PYPA22] for details.nan_filling ({"mean", "none"}, optional) – Strategy for NaN values in isoform ratios. See
splisosm.utils.preprocessing.counts_to_ratios()for details.null_method ({"liu", "welch", "perm"}, optional) –
Method for computing the null distribution of the test statistic:
"liu"(default): asymptotic chi-square mixture using cumulants with Liu’s method [LTZ09]. Exact eigenvalue cumulants are used when cheap; large implicit kernels use Hutchinson Rademacher trace estimates controlled bynull_configs["n_probes"]."welch": Welch-Satterthwaite moment matching. It uses the first two null cumulants and approximates the null by a scaled chi-square variableg * chi2(h), whereg = Var / (2 * E)andh = 2 * E**2 / Var."perm": permutation null distribution. Supports optionalnull_configs["n_perms_per_gene"](default 1000), andnull_configs["perm_batch_size"](default 50) for batched null-statistic computation.
"eig"is accepted as a deprecated alias for"liu"."clt"and"trace"are accepted as deprecated aliases for"welch".null_configs (dict or None, optional) –
Extra keyword arguments for the chosen
null_method. Fornull_method="liu", SPLISOSM evaluates Liu’s approximation from cumulants instead of materializing all pairwise eigenvalue products. Supported keys include:"n_probes": int. Estimate spatial kernel cumulants with this many Hutchinson Rademacher probes when exact eigenvalue or trace cumulants are unavailable. The same budget is used by"welch"when the spatial kernel does not expose exacttr(K)andtr(K**2)(for example implicit CAR kernels). Large implicit kernels use 60 probes by default."approx_rank": int or None. Advanced diagnostic override to use the top-k spatial eigenvalues and a rank-consistent statistic. This is normally unnecessary for SV tests because the default Liu path uses direct cumulant estimates.
chunk_size (int or {"auto"}, optional) – Maximum number of response columns to process per worker task.
"auto"(default) estimates a memory-safe column cap using a 2 GiB live-memory budget per worker and then caps the result at 32 columns for per-feature runtime. Genes are never split across chunks; a single gene with more response columns than the cap is processed as a singleton chunk. Formethod="hsic-gc", each gene contributes one response column.n_jobs (int, optional) – Number of parallel workers for the per-gene loop.
-1uses all available CPUs. Withchunk_size="auto", each worker targets roughly a 2 GiB live-memory budget. Default-1.return_results (bool, optional) – If
True, return the result dict. Otherwise store results inself._sv_test_resultsand returnNone.print_progress (bool, optional) – Whether to show a progress bar.
- Returns:
If
return_resultsisTrue, return a result dictionary with test statistics and p-values. Otherwise, store results inself._sv_test_resultsand returnNone.- Return type:
dict or None
Notes
method="spark-x"requires the R packageSPARKand Python packagerpy2.
- test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', n_jobs=-1, print_progress=True, return_results=False)#
Test each gene for differential isoform usage.
Call
setup_data()withdesign_mtxbefore running this method. Each design-matrix column is tested against each gene’s isoform usage ratios, producing one statistic and p-value per(gene, covariate)pair.Two types of association tests are supported:
Unconditional (
"hsic","t-fisher","t-tippett"): test the association between isoform usage ratios and the covariate.Conditional (
"hsic-gp"): test the association conditioned on spatial coordinates by residualizing with spatial GP regression. See [ZPJScholkopf12].
- Parameters:
method ({"hsic", "hsic-gp", "t-fisher", "t-tippett"}, optional) –
Method for association testing:
"hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test."hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic."t-fisher","t-tippett": each isoform is tested independently for binary covariates only; p-values are combined gene-wise via Fisher’s or Tippett’s method.
ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation for isoform ratios. One of
'none','clr','ilr','alr','radial'[PYPA22]. Seesplisosm.utils.preprocessing.counts_to_ratios().nan_filling ({"mean", "none"}, optional) – How to fill NaN values in isoform ratios. One of
'mean'or'none'. Seesplisosm.utils.preprocessing.counts_to_ratios().gpr_backend ({"sklearn", "gpytorch", "nufft", "finufft"}, optional) – GPR backend to use for
method='hsic-gp'. One of'sklearn'(default),'gpytorch','nufft', or'finufft'. The NUFFT aliases use FINUFFT for irregular 2-D coordinates with an implicit periodic RBF kernel. For FFT-accelerated spatial GP on regular grids useSplisosmFFTinstead.gpr_configs (dict, optional) –
Nested configuration dict for the GPR objects, with optional keys
"covariate"and/or"isoform". Each sub-dict configures the selectedgpr_backend. Unspecified keys use SPLISOSM defaults. The shared defaults includeconstant_value=1.0,constant_value_bounds=(1e-3, 1e3),length_scale=1.0, andlength_scale_bounds="fixed". Backend-irrelevant known keys are ignored."n_inducing"(int or None) controls the scale of spatial GP fitting for the dense/sparse GP backends:sklearn — maximum number of observations used for hyperparameter fitting. Full exact GP when
n_spots <= n_inducing(orNone); a randomly sub-sampled subset-of-data ofn_inducingpoints otherwise (not the same inducing-point approximation as gpytorch). Default:5000. Set toNoneto use all observations (warns whenn_spots > 10_000).gpytorch — FITC sparse-GP inducing-point approximation with
n_inducingpoints; set toNonefor exact GP. Default:5000.
nufft / finufft has additional NUFFT-specific options for the RBF-GP fitting. See
splisosm.gpr.NUFFTKernelGPRfor the full backend signature. Common options include:max_auto_modes- cap the size of the automatically inferred grid;lml_approx_rank- approximate the irregular-coordinate GP likelihoods using only the leading eigenvalues and eigensummaries. Ignored when the input grid is already regular (full spectrum available via FFT). It costsO(n_spots * lml_approx_rank)memory and uses a trace/trace(K^2)-corrected tail for the omitted spectrum. In practice, ranks around32-64often beat same-time sklearn subset fits by a wide margin; increase the rank when memory permits and hyperparameter accuracy is important.
residualize ({"cov_only", "both"}, optional) –
Controls which signals are spatially residualized when
method="hsic-gp":"cov_only"(default): residualize covariates only; testHSIC(Z_res, Y). Fastest; calibration matches"both"when covariate GPR captures most spatial confounding."both": residualize both covariates and isoform ratios.
n_jobs (int, optional) – Number of parallel workers for the per-gene loop.
-1uses all available CPUs. Each worker densifies one sparse count tensor (about 4-40 MB for 100K-1M spots by 10 isoforms). Whengpr_backend="gpytorch"anddevice != "cpu", the GPU is not thread-safe; parallelism is automatically disabled. Default-1.print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in
self._du_test_results.
- Returns:
results – If
return_resultsisTrue, return a dictionary with test statistics and p-values. Otherwise, returnNoneand store results inself._du_test_results.- Return type:
dict or None
- get_formatted_test_results(test_type, with_gene_summary=False)#
Get formatted test results as a
pandas.DataFrame.- Parameters:
test_type ({"sv", "du"}) – Which results to retrieve:
"sv"for spatial variability or"du"for differential usage.with_gene_summary (bool, optional) – If
True, append gene-level summary statistics fromextract_feature_summary().
- Returns:
Result table. SV results contain
gene,statistic,pvalue, andpvalue_adj. DU results additionally containcovariateand include one row per gene-covariate pair.- Return type:
- Raises:
ValueError – If
test_typeis invalid or the requested test has not been run.
- extract_feature_summary(level='gene', print_progress=True)#
Compute filtered feature-level summary statistics.
Gene-level statistics are aggregated across all isoforms retained by
setup_data(). Isoform-level statistics are computed per isoform and joined to the corresponding rows ofadata.var.Results are cached: repeated calls with the same
levelreturn the cachedpandas.DataFramewithout recomputation.- Parameters:
level ({"gene", "isoform"}, optional) – Summary granularity.
"gene"returns one row per gene;"isoform"returns one row per retained isoform.print_progress (bool, optional) – Whether to show a progress bar while computing summaries.
- Returns:
Feature summary table. Gene-level summaries include
n_isos,perplexity,pct_bin_on,count_avg, andcount_std. Isoform-level summaries include the retainedadata.varcolumns plus count and usage-ratio summaries.- Return type:
- Raises:
RuntimeError – If
setup_data()has not been called.ValueError – If
levelis not"gene"or"isoform".
- class splisosm.SplisosmFFT(rho=0.99, neighbor_degree=1, spacing=(1.0, 1.0), workers=None)#
Bases:
_ResultsMixin,_FeatureSummaryMixinFFT-accelerated SPLISOSM model for rasterized spatial isoform testing.
SplisosmFFTfollows the same non-parametric workflow assplisosm.SplisosmNP, but it operates on regular raster grids stored inSpatialDataand applies spatial kernels with FFTs.Examples
Spatial variability test:
>>> from splisosm import SplisosmFFT >>> model = SplisosmFFT(rho=0.9, neighbor_degree=1) >>> model.setup_data( ... sdata=sdata, ... bins="ID_square_016um", ... table_name="square_016um", ... col_key="array_col", ... row_key="array_row", ... layer="counts", ... group_iso_by="gene_ids", ... gene_names="gene_name", ... min_counts=10, ... min_bin_pct=0.0, ... ) >>> model.test_spatial_variability(method="hsic-ir") >>> sv_results = model.get_formatted_test_results(test_type="sv")
Differential usage test:
>>> model = SplisosmFFT(rho=0.9, neighbor_degree=1) >>> model.setup_data( ... sdata=sdata, ... bins="ID_square_016um", ... table_name="square_016um_svp", ... design_mtx="square_016um_rbp_sve", ... col_key="array_col", ... row_key="array_row", ... layer="counts", ... group_iso_by="gene_ids", ... gene_names="gene_name", ... min_counts=10, ... min_bin_pct=0.0, ... ) >>> model.test_differential_usage(method="hsic-gp", residualize="cov_only") >>> du_results = model.get_formatted_test_results("du")
- Parameters:
rho (float)
neighbor_degree (int)
spacing (tuple[float, float])
workers (int | None)
- setup_data(sdata, bins, table_name, col_key, row_key, layer='counts', group_iso_by='gene_symbol', gene_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, design_mtx=None, covariate_names=None)#
Set up SpatialData-backed isoform data for FFT-based testing.
bins,table_name,col_key, androw_keyare passed tospatialdata.rasterize_bins()to rasterize isoform counts.- Parameters:
sdata (SpatialData) – SpatialData-like object with
tablesmapping.bins (str) – Name of the SpatialData bin geometry for rasterization.
table_name (str) – Key of the table in
sdata.tables.col_key (str) – Column-coordinate key in
adata.obsfor rasterization.row_key (str) – Row-coordinate key in
adata.obsfor rasterization.layer (str, optional) – AnnData layer that stores isoform count matrix.
group_iso_by (str, optional) – Column in
adata.varused to group isoforms by gene. The unique values of this column define the gene-level groups.gene_names (str or None, optional) – Optional column name in
adata.varwhose values are used as display gene names in results. IfNone, the values ofgroup_iso_byare used directly.min_counts (int, optional) – Minimum total count (summed across all spots) required for an isoform to be retained. Isoforms below this threshold are excluded before gene grouping. Genes with fewer than two remaining isoforms after filtering are also excluded.
min_bin_pct (float, optional) – Minimum percentage of bins in which an isoform must be expressed (count greater than zero) to be retained. Values in
[0, 1]are interpreted as fractions of bins, and values in(1, 100]are interpreted as percentages.filter_single_iso_genes (bool, optional) – If
True(default), genes with fewer than two isoforms passing QC filters are removed — they cannot contribute to within-gene ratio tests. Set toFalseto keep single-isoform genes, e.g. when testing gene-level expression variability withtest_spatial_variability(method="hsic-gc").design_mtx (str, list[str], array, sparse matrix, DataFrame, or None) –
Design matrix specification. Three input modes:
Table name (
strmatching a key insdata.tables): Use the existing AnnData table’sXas the design matrix. Must have the same number of observations as the isoform table.Obs column names (
strorlist[str]not matching a table): Extract the named columns from the isoform table’sadata.obs. Categorical columns are one-hot encoded automatically.Pre-computed matrix (array, sparse matrix, or DataFrame of shape
(n_spots, n_factors)): Used as-is.
In cases 2 and 3, the design matrix is stored as a new AnnData table inside
sdataand rasterized whentest_differential_usage()is called.covariate_names (list[str] or None, optional) – Factor names. These override inferred names. If
None, names are inferred fromdesign_mtxwhen possible; otherwise they are generated as["factor_0", ...].
- Raises:
ValueError – If required table/layer/metadata is missing.
- Return type:
None
See also
splisosm.SplisosmNP.setup_data()AnnData-based setup for data with general geometry.
- test_spatial_variability(method='hsic-ir', ratio_transformation='none', chunk_size='auto', n_jobs=-1, return_results=False, print_progress=True)#
Test each gene for spatial variability using FFT-accelerated HSIC.
- Parameters:
method ({"hsic-ir", "hsic-ic", "hsic-gc"}, optional) – One of
"hsic-ir"(isoform ratios),"hsic-ic"(isoform counts), or"hsic-gc"(gene counts).ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Ratio transform used when
method="hsic-ir".chunk_size (int or {"auto"}, optional) – Maximum number of response channels to process in one FFT kernel call.
"auto"(default) estimates a memory-safe cap using a 2 GiB live-memory budget per worker and caps the result at 32 channels for per-feature runtime. Genes are never split across chunks; a single gene with more channels than the cap is processed as a singleton chunk.n_jobs (int, optional) – Number of joblib workers.
-1uses all available CPUs.return_results (bool, optional) – If
True, return the result dictionary.print_progress (bool, optional) – Whether to show a progress bar.
- Returns:
Result dictionary when
return_results=True; otherwiseNone.- Return type:
dict or None
See also
splisosm.SplisosmNP.test_spatial_variability()Non-FFT implementation for arbitrary spatial geometries.
- test_differential_usage(method='hsic-gp', ratio_transformation='none', gpr_configs=None, residualize='cov_only', n_jobs=-1, return_results=False, print_progress=True)#
Test each gene for differential isoform usage on a raster grid.
Call
setup_data()withdesign_mtxbefore running this method. Each design-matrix column is tested against each gene’s isoform usage ratios, producing one statistic and p-value per(gene, covariate)pair.Four test strategies are supported, all operating on rasterized grid data to avoid densifying the full isoform or covariate matrix in memory:
"hsic-gp"(default): spatially residualize covariates (and optionally isoform ratios) withsplisosm.gpr.FFTKernelGPR, then compute linear HSIC. Controlled byresidualize."hsic": linear HSIC between raw centered isoform ratios and raw centered covariates, with no spatial residualization."t-fisher": per-isoform two-sample t-tests (binary covariates only) combined by Fisher’s method (chi-square withdf = 2 * n_isoforms)."t-tippett": per-isoform two-sample t-tests (binary covariates only) combined by Tippett’s corrected minimum p-value.
Regardless of method, covariates are processed in chunks of at most 100 at a time and isoform data is loaded on-the-fly per gene so that neither the full covariate grid nor the full isoform matrix is held in memory simultaneously.
- Parameters:
method ({"hsic", "hsic-gp", "t-fisher", "t-tippett"}, optional) –
Method for association testing:
"hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test."hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic."t-fisher","t-tippett": two-sample t-test per isoform (binary covariates only; exactly two distinct non-NaN values required); p-values are combined gene-wise via Fisher’s chi-squared or Tippett’s corrected minimum method.
ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation for isoform ratios. One of
'none','clr','ilr','alr','radial'[PYPA22]. Seesplisosm.utils.preprocessing.counts_to_ratios().gpr_configs (dict, optional) – Nested configuration dict for the GPR objects, with optional keys
'covariate'and/or'isoform'. Each sub-dict is forwarded to the FFT GPR path. Unspecified keys use the common GP defaults from SPLISOSM’s backend configuration:constant_value,constant_value_bounds,length_scale, andlength_scale_bounds. Large-n sklearn/gpytorch and NUFFT-only keys in the shared defaults are ignored by this FFT path. Seesplisosm.gpr.FFTKernelGPRfor backend-specific options.residualize ({"cov_only", "both"}, optional) –
Controls which signals are spatially residualized when
method="hsic-gp":"cov_only"(default): residualize covariates only; testHSIC(Z_res, Y). Fastest; calibration matches"both"when covariate GPR captures most spatial confounding."both": residualize both covariates and isoform ratios.
n_jobs (int, optional) – Number of parallel jobs.
-1uses all available CPUs.print_progress (bool, optional) – Whether to show the progress bar. Default
True.return_results (bool, optional) – If
True, return the test statistics and p-values. Otherwise, store results inself._du_test_results.
- Returns:
results – If
return_resultsisTrue, return a dictionary with test statistics and p-values. Otherwise, returnNoneand store results inself._du_test_results.- Return type:
dict or None
- Raises:
RuntimeError – If
setup_data()or thedesign_mtxargument has not been set.ValueError – If
method,residualize, orratio_transformationis invalid.
See also
splisosm.SplisosmNP.test_differential_usage()Non-FFT implementation for arbitrary spatial geometries.
- get_formatted_test_results(test_type, with_gene_summary=False)#
Get formatted test results as a
pandas.DataFrame.- Parameters:
test_type ({"sv", "du"}) – Which results to retrieve:
"sv"for spatial variability or"du"for differential usage.with_gene_summary (bool, optional) – If
True, append gene-level summary statistics fromextract_feature_summary().
- Returns:
Result table. SV results contain
gene,statistic,pvalue, andpvalue_adj. DU results additionally containcovariateand include one row per gene-covariate pair.- Return type:
- Raises:
ValueError – If
test_typeis invalid or the requested test has not been run.
- extract_feature_summary(level='gene', print_progress=True)#
Compute filtered feature-level summary statistics.
Gene-level statistics are aggregated across all isoforms retained by
setup_data(). Isoform-level statistics are computed per isoform and joined to the corresponding rows ofadata.var.Results are cached: repeated calls with the same
levelreturn the cachedpandas.DataFramewithout recomputation.- Parameters:
level ({"gene", "isoform"}, optional) – Summary granularity.
"gene"returns one row per gene;"isoform"returns one row per retained isoform.print_progress (bool, optional) – Whether to show a progress bar while computing summaries.
- Returns:
Feature summary table. Gene-level summaries include
n_isos,perplexity,pct_bin_on,count_avg, andcount_std. Isoform-level summaries include the retainedadata.varcolumns plus count and usage-ratio summaries.- Return type:
- Raises:
RuntimeError – If
setup_data()has not been called.ValueError – If
levelis not"gene"or"isoform".
- class splisosm.SplisosmGLMM(model_type='glmm-full', share_variance=True, var_fix_sigma=True, var_prior_model='none', var_prior_model_params=None, init_ratio='uniform', fitting_method='joint_gd', fitting_configs=None, k_neighbors=4, rho=0.99, approx_rank=<object object>, device='cpu')#
Bases:
_ResultsMixin,_FeatureSummaryMixinParametric spatial isoform statistical modeling using GLMM.
This is a convenience class that wraps around the
splisosm.glmm.MultinomGLMMfor batched model fitting and spatial variability and differential usage testing.Examples
Prepare an AnnData object and fit a GLMM:
>>> from splisosm import SplisosmGLMM >>> # adata : AnnData of shape (n_spots, n_isoforms) >>> # adata.layers["counts"] — raw isoform counts >>> # adata.var["gene_symbol"] — column grouping isoforms by gene >>> # adata.obsm["spatial"] — (n_spots, 2) spatial coordinates >>> # adata.obs["covariate"] — optional covariate column for DU testing >>> model = SplisosmGLMM( ... model_type="glmm-full", # 'glmm-full' | 'glmm-null' | 'glm' ... fitting_method="joint_gd", # 'joint_gd' | 'joint_newton' | 'marginal_gd' | 'marginal_newton' ... device="cpu", # 'cpu' | 'cuda' (NVIDIA) | 'mps' (Apple Silicon) ... # approx_rank: omit for auto | None (force full rank) | int (fixed low-rank) ... ) >>> model.setup_data( ... adata, ... layer="counts", ... group_iso_by="gene_symbol", ... group_gene_by_n_iso=True, # required for batch_size > 1 in fit() ... design_mtx="covariate", # obs column name, or (n_spots, n_factors) array ... ) >>> model.fit( ... n_jobs=1, batch_size=20, ... with_design_mtx=False, # False → score test (recommended for DU) ... ) >>> fitted_models = model.get_fitted_models()
Differential usage test:
>>> model.test_differential_usage(method="score") # or "wald" if with_design_mtx=True >>> du_results = model.get_formatted_test_results("du") >>> print(du_results.head())
- Parameters:
model_type (Literal['glmm-full', 'glmm-null', 'glm'])
share_variance (bool)
var_fix_sigma (bool)
var_prior_model (str)
var_prior_model_params (dict | None)
init_ratio (str)
fitting_method (str)
fitting_configs (dict | None)
k_neighbors (int)
rho (float)
device (Literal['cpu', 'cuda', 'mps'])
- property filtered_adata: AnnData#
The filtered AnnData of shape (
n_spots, sum(n_isos_per_gene)).This is the data used internally after
setup_data(). It is a copy of the inputadata, subsetted to the retained spots and isoforms after filtering.- Raises:
RuntimeError – If
setup_data()has not been called.
- setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, group_gene_by_n_iso=False, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, min_component_size=1)#
Setup the data for the GLMM model.
- Parameters:
adata (AnnData) – Annotated data matrix. Counts are read from
adata.layers[layer]grouped bygroup_iso_by, and spatial coordinates fromadata.obsm[spatial_key].spatial_key (str, optional) – Key in
adata.obsmfor spatial coordinates (default"spatial"). Optional whenadj_keyis provided: if the key is missing fromadata.obsmthe spatial kernel is built from the adjacency alone. SplisosmGLMM does not require raw coordinates past kernel construction.adj_key (str or None, optional) – Key in
adata.obspfor a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. Also makesspatial_keyoptional (see above). The adjacency matrix is symmetrized internally.layer (str, optional) – Layer in
adata.layersstoring isoform counts (default"counts").group_iso_by (str, optional) – Column in
adata.varused to group isoforms by gene (default"gene_symbol").gene_names (str or None, optional) – Column name in
adata.varused as display names for genes. IfNone, the values ofgroup_iso_byare used.group_gene_by_n_iso (bool, optional) – If
True, genes are sorted and grouped by their isoform count before batching. Required forbatch_size > 1infit()because the GLMM model expects a fixed isoform count per batch (defaultFalse).design_mtx (tensor, array, DataFrame, str, or list of str, optional) – Design matrix for fixed effects. Accepts an array/tensor/DataFrame of shape
(n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.covariate_names (list of str or None, optional) – Explicit covariate names. When not provided, names are inferred from column names or auto-generated.
min_counts (int, optional) – Minimum total isoform count across spots to retain an isoform (default 10).
min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).
min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A
UserWarningis issued when spots are removed.
- Raises:
ValueError – If input arguments are invalid or required fields are missing.
- Return type:
None
- fit(n_jobs=1, batch_size=1, quiet=True, print_progress=True, with_design_mtx=False, from_null=False, refit_null=True, random_seed=None)#
Model fitting.
- Parameters:
n_jobs (int, optional) – The number of cores to use for parallel fitting. Default to 1.
batch_size (int, optional) – The maximum number of genes per job to fit in parallel. Default to 1.
quiet (bool, optional) – Whether to suppress the fitting logs. Default to True.
print_progress (bool, optional) – Whether to show the progress bar. Default to True.
with_design_mtx (bool, optional) – Whether to include the design matrix for the fixed effects. Default to False.
from_null (bool, optional) – Whether to initialize the full model from a null model with zero spatial variability (white-noise random effect).
refit_null (bool, optional) – Whether to refit the null model after fitting the full model. Only applicable when
from_null=True.random_seed (int or None, optional) – The random seed for reproducibility. Default to None.
- Return type:
None
See also
- test_spatial_variability(method='llr', use_perm_null=False, return_results=False, print_progress=True, n_perms_per_gene=None, **kwargs)#
Parametric test for spatial variability.
Caution
The likelihood ratio statistic is not well-calibrated for sparse data. We recommend the non-parametric HSIC-based tests in
splisosm.hyptest.np.SplisosmNPfor spatial variability testing.Note that the parametric and non-parametric tests are assymptotically equivalent under the null. See [SFW+26] for detailed theoretical analysis.
- Parameters:
method ({"llr"}, optional) – The test method. Currently only support
"llr", the likelihood ratio test (H_0:theta= 0, i.e. no spatial variance).use_perm_null (bool, optional) – Whether to generate the null distribution from permutation. If False, use the chi-square with df = n_var_components as the null.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results will be stored in
self._sv_test_results.print_progress (bool, optional) – Whether to show the progress bar for permutation. Default to True.
n_perms_per_gene (int or None, optional) – Number of permutations per gene when
use_perm_null=True. Default to 20.**kwargs (Any) – Additional arguments passed to
self._fit_sv_llr_perm()ifuse_perm_null = True.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise returns None.- Return type:
dict or None
- test_differential_usage(method='score', print_progress=True, return_results=False)#
Parametric test for spatial isoform differential usage.
Before running this function, the design matrix must be set up using
setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.Similar to
splisosm.SplisosmNP.test_differential_usage(), here we also support two types of association tests but implicitly:Unconditional (when
model_type='glm'): test the unconditional association between isoform usage ratios and the covariate of interest (H_0:beta= 0).Conditional (when
model_type='glmm-full'): test for association (H_0:beta= 0) conditioned on the spatial random effect.
- Parameters:
method ({"score", "wald"}, optional) –
Depending on whether the design matrix is used for model fitting, different methods must be used for hypothesis testing:
"wald"when models were fit withfit(..., with_design_mtx=True)."score"when models were fit withfit(..., with_design_mtx=False).
Caution
The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives). Always use
method="score"when possible.print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in
self._du_test_results.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.- Return type:
dict or None
- get_formatted_test_results(test_type, with_gene_summary=False)#
Get formatted test results as a
pandas.DataFrame.- Parameters:
test_type ({"sv", "du"}) – Which results to retrieve:
"sv"for spatial variability or"du"for differential usage.with_gene_summary (bool, optional) – If
True, append gene-level summary statistics fromextract_feature_summary().
- Returns:
Result table. SV results contain
gene,statistic,pvalue, andpvalue_adj. DU results additionally containcovariateand include one row per gene-covariate pair.- Return type:
- Raises:
ValueError – If
test_typeis invalid or the requested test has not been run.
- extract_feature_summary(level='gene', print_progress=True)#
Compute filtered feature-level summary statistics.
Gene-level statistics are aggregated across all isoforms retained by
setup_data(). Isoform-level statistics are computed per isoform and joined to the corresponding rows ofadata.var.Results are cached: repeated calls with the same
levelreturn the cachedpandas.DataFramewithout recomputation.- Parameters:
level ({"gene", "isoform"}, optional) – Summary granularity.
"gene"returns one row per gene;"isoform"returns one row per retained isoform.print_progress (bool, optional) – Whether to show a progress bar while computing summaries.
- Returns:
Feature summary table. Gene-level summaries include
n_isos,perplexity,pct_bin_on,count_avg, andcount_std. Isoform-level summaries include the retainedadata.varcolumns plus count and usage-ratio summaries.- Return type:
- Raises:
RuntimeError – If
setup_data()has not been called.ValueError – If
levelis not"gene"or"isoform".
- get_fitted_models()#
Get the fitted models after running fit().
Each call reconstructs the full model objects from the stored lightweight state. The returned list is a fresh reconstruction; mutations to the returned models are not persisted.
- Returns:
models –
model_type='glmm-full': list[splisosm.hyptest.glmm.IsoFullModel]model_type='glmm-null': list[splisosm.hyptest.glmm.IsoNullNoSpVar]model_type='glm': list[splisosm.glmm.MultinomGLM]
- Return type:
list of fitted models
- get_gene_model(gene_name)#
Return the fitted model for a single gene by display name.
Each call reconstructs a fresh model from stored state.
- Parameters:
gene_name (str) – Display name of the gene (must appear in
gene_names).- Returns:
Fitted
MultinomGLM,IsoFullModel, orIsoNullNoSpVarinstancecorresponding to
gene_name.
- Raises:
RuntimeError – If
fit()has not been called yet.KeyError – If
gene_nameis not found ingene_names.
- Return type:
Any
- get_training_summary()#
Return a per-gene training summary as a
pandas.DataFrame.The DataFrame is indexed by gene name and contains the following columns:
'model_type': str. Model type ('glmm-full','glmm-null', or'glm').'converged': bool. Whether the gene converged during training.'best_loss': float. Best (lowest) negative log-likelihood achieved.'best_epoch': int. Epoch at which the best loss was recorded.'fitting_time_s': float. Wall-clock seconds spent fitting this gene.
Only available after calling
fit().- Raises:
RuntimeError – If
fit()has not been called yet.- Return type:
DataFrame
- get_fitted_ratios_anndata(layer_name='fitted_ratios')#
Return a copy of the filtered AnnData with fitted isoform ratios as a new layer.
For each gene the per-spot softmax isoform ratios are taken from the fitted model (
model.get_isoform_ratio()) and placed into the corresponding isoform columns of a(n_spots, n_filtered_isoforms)dense matrix, which is stored underlayer_name.Isoform ordering guarantee: the columns of the new layer follow the exact row order of
self._filtered_adata.var. This holds becausesetup_data()builds per-gene count tensors by slicingfiltered_adataisoforms in theirvarrow order (viagroupby(sort=False)), andget_isoform_ratio()returns ratios in the same column order as the input counts.- Parameters:
layer_name (str) – Name of the new layer. Defaults to
'fitted_ratios'.- Returns:
A copy of
self._filtered_adata(shape(n_spots, n_filtered_isoforms)) withlayer_nameadded. ThevarDataFrame is identical toself._filtered_adata.var, so isoform metadata (e.g. the gene-grouping column) is intact.- Return type:
AnnData
- Raises:
RuntimeError – If
fit()has not been called, or ifsetup_data()was not called with anAnnDataobject (i.e. raw tensors were provided instead).
- save(path)#
Save the fitted model to a file.
Only lightweight per-gene
_FittedGeneStateobjects (fitted parameters + convergence metadata) are stored – no fullnn.Moduleobjects. The raw.adatais not saved; only_filtered_adatais persisted. After loading,.adatawill beNone.- Parameters:
path (str) – The path to save the fitted models.
- Return type:
None
- static load(path, map_location=None)#
Load a
SplisosmGLMMmodel previously saved withsave().- Parameters:
path (str) – Path to the file written by
save().map_location (str | None) – Optional device string (e.g.
'cpu','cuda:0') to remap tensor storage when loading a model that was saved on a different device. Defaults toNone(preserves original device mapping).
- Returns:
The fully restored model, including fitted parameters, test results, and all metadata.
- Return type:
Examples
>>> model.save("model.pt") >>> loaded = SplisosmGLMM.load("model.pt") >>> loaded.get_training_summary()