splisosm.hyptest_np

splisosm.hyptest_np#

Non-parametric hypothesis tests for spatial isoform usage.

Classes#

SplisosmNP

Non-parametric spatial isoform statistical model.

Module Contents#

class splisosm.hyptest_np.SplisosmNP(k_neighbors=4, rho=0.99, standardize_cov=True)#

Non-parametric spatial isoform statistical model.

Examples

Spatial variability test:

>>> from splisosm import SplisosmNP
>>> # adata : AnnData of shape (n_spots, n_isoforms)
>>> #   adata.layers["counts"]    — raw isoform counts
>>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
>>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
>>> model = SplisosmNP()
>>> model.setup_data(adata, layer="counts", group_iso_by="gene_symbol")
>>> model.test_spatial_variability(method="hsic-ir")
>>> sv_results = model.get_formatted_test_results("sv")

Differential usage test:

>>> model = SplisosmNP()
>>> model.setup_data(
...     adata, layer="counts", group_iso_by="gene_symbol",
...     design_mtx="covariate",  # obs column name, or (n_spots, n_factors) array
... )
>>> model.test_differential_usage(method="hsic-gp", residualize="cov_only")
>>> du_results = model.get_formatted_test_results("du")

Initialise the model.

Parameters:

k_neighbors (int, optional) – Number of nearest neighbours used to build the spatial adjacency graph for the CAR kernel (default 4).
rho (float, optional) – Spatial autocorrelation strength in the CAR model (default 0.99). Values close to 1 give a smoother spatial kernel.
standardize_cov (bool, optional) – Whether to standardise the spatial covariance matrix so that its diagonal entries are 1 (default True).

extract_feature_summary(level='gene', print_progress=True)#

Compute filtered feature-level summary statistics.

Gene-level statistics are aggregated across all isoforms that passed the filters applied in setup_data(). Isoform-level statistics are computed per isoform and augmented onto the corresponding rows of adata.var.

Results are cached: repeated calls with the same level return the cached pandas.DataFrame without recomputation.

Parameters:

level (Literal['gene', 'isoform']) – Summary granularity. 'gene': one row per gene. 'isoform': one row per isoform that passed filtering.
print_progress (bool) – Whether to show a progress bar.

Returns:

For level='gene', the index is the gene display name and the columns are:

'n_isos': int. Number of isoforms retained after filtering.
'perplexity': float. Effective number of isoforms based on the marginal isoform usage entropy.
'pct_bin_on': float. Fraction of spots with non-zero total gene counts.
'count_avg': float. Mean per-spot total count for the gene.
'count_std': float. Std of per-spot total count for the gene.

For level='isoform', the index is the isoform name (matching adata.var_names) and the columns are the original adata.var columns plus:

'pct_bin_on': float. Fraction of spots with count > 0.
'count_total': float. Total counts across all spots.
'count_avg': float. Mean count per spot.
'count_std': float. Std of count per spot.
'ratio_total': float. Fraction of total gene counts attributable to this isoform.
'ratio_avg': float. Mean per-spot isoform usage ratio (computed over spots with non-zero gene coverage).
'ratio_std': float. Std of per-spot isoform usage ratio (computed over spots with non-zero gene coverage).

Return type:

DataFrame

Raises:

RuntimeError – If setup_data() has not been called.
ValueError – If level is not 'gene' or 'isoform'.

get_formatted_test_results(test_type, with_gene_summary=False)#

Get formatted test results as a pandas DataFrame.

Parameters:

test_type ({"sv", "du"}) – Which results to retrieve: "sv" for spatial variability or "du" for differential usage.
with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary() (columns: n_isos, perplexity, pct_bin_on, count_avg, count_std).

Returns:

Formatted test results.

Return type:

DataFrame

setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, min_component_size=1, skip_spatial_kernel=False)#

Setup isoform-level spatial data for hypothesis testing.

Extracts isoform count tensors from an AnnData object, optionally filters disconnected graph components, builds a spatial covariance kernel, and resolves the design matrix.

Parameters:

adata (AnnData) – Annotated data matrix. Counts are read from adata.layers[layer] grouped by group_iso_by, and spatial coordinates from adata.obsm[spatial_key]. See splisosm.utils.prepare_inputs_from_anndata() for full preprocessing details.
spatial_key (str, optional) – Key in adata.obsm for spatial coordinates (default "spatial"). Optional when adj_key is provided: if the key is missing from adata.obsm the spatial kernel is built from the adjacency alone and coordinate-free SV tests ("hsic-ir" / "hsic-ic" / "hsic-gc") and DU tests still run. method="spark-x" (SV) and method="hsic-gp" (DU) require raw coordinates and raise a clear error at call time when they are absent.
adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. Also makes spatial_key optional (see above). The adjacency matrix is symmetrized internally.
layer (str, optional) – Layer in adata.layers that stores isoform counts (default "counts").
group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene (default "gene_symbol").
gene_names (str or None, optional) – Column name in adata.var used as display names for genes. If None, the values of group_iso_by are used.
design_mtx (tensor, array, DataFrame, str, or list of str, optional) –
Design matrix for differential-usage tests. Accepts an array/tensor/DataFrame of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names. Categorical obs columns are one-hot encoded automatically.

When a scipy sparse matrix is passed directly, it is stored as scipy CSR internally and all differential-usage methods handle it without densifying: "hsic" uses a sparse matrix-multiply path in linear_hsic_test(); "t-fisher" and "t-tippett" extract group indices directly from the sparse non-zero structure. "hsic-gp" densifies each column via _get_design_col() before GPR fitting (GPR residuals are always dense).

All other input types (obs column names, array, tensor, DataFrame) are converted to a dense torch float32 tensor.
covariate_names (list of str or None, optional) – Explicit covariate names. When design_mtx is given as column name(s) and this is None, the column names are used automatically; otherwise auto-generated as factor_1, etc.
min_counts (int, optional) – Minimum total isoform count across spots required to retain an isoform (default 10).
min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).
filter_single_iso_genes (bool, optional) – Whether to remove genes with fewer than two retained isoforms (default True).
min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A UserWarning is issued when spots are removed.
skip_spatial_kernel (bool, optional) – If True, skip construction of the CAR spatial kernel and store an IdentityKernel placeholder as self.sp_kernel instead. Use this when only test_differential_usage() is needed (it fits custom GPR to handle spatial autocorrelation). Calling test_spatial_variability() on a model set up with skip_spatial_kernel=True will raise a RuntimeError. Default False.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', n_jobs=-1, print_progress=True, return_results=False)#

Test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Two types of association tests are supported:

Unconditional ("hsic", "t-fisher", "t-tippett"): test the unconditional association between isoform usage ratios and the covariate.
Conditional ("hsic-gp"): test the association conditioned on spatial coordinates via Gaussian process regression. See [ZPJScholkopf12] for more details.

Parameters:

method (str, optional) –
Method for association testing:
- "hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test.
- "hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.
Or one of the T-tests (binary factors only):
- "t-fisher", "t-tippett": each isoform is tested independently and p-values are combined gene-wise via Fisher’s or Tippett’s method.
ratio_transformation (str, optional) – Compositional transformation for isoform ratios. One of 'none', 'clr', 'ilr', 'alr', 'radial' [PYPA22]. See splisosm.utils.counts_to_ratios().
nan_filling (str, optional) – How to fill NaN values in isoform ratios. One of 'mean' or 'none'. See splisosm.utils.counts_to_ratios().
gpr_backend (str, optional) – GPR backend to use for method='hsic-gp'. One of 'sklearn' (default) or 'gpytorch'. For FFT-accelerated spatial GP on regular grids use SplisosmFFT instead.
gpr_configs (dict, optional) –
Nested configuration dict for the GPR objects, with optional keys 'covariate' and/or 'isoform'. Each sub-dict is forwarded to splisosm.kernel_gpr.make_kernel_gpr(). Unspecified keys use the defaults from splisosm.kernel_gpr._DEFAULT_GPR_CONFIGS:
```
{
    "covariate": {
        "constant_value": 1.0,
        "constant_value_bounds": (1e-3, 1e3),
        "length_scale": 1.0,
        "length_scale_bounds": "fixed",
        "n_inducing": 5000,
    },
    "isoform": {
        "constant_value": 1.0,
        "constant_value_bounds": (1e-3, 1e3),
        "length_scale": 1.0,
        "length_scale_bounds": "fixed",
        "n_inducing": 5000,
    },
}
```
"n_inducing" (int or None) controls the scale of spatial GP fitting for each backend:
- sklearn — maximum number of observations used for hyperparameter fitting. Full exact GP when n_obs ≤ n_inducing (or None); a randomly sub-sampled subset-of-data of n_inducing points otherwise (not the same inducing-point approximation as gpytorch). Default: 5000. Set to None to use all observations (warns when n_obs > 10_000).
- gpytorch — FITC sparse-GP inducing-point approximation with n_inducing points; set to None for exact GP. Default: 5000.
residualize ({"cov_only", "both"}, optional) –
Controls which signals are spatially residualized when method="hsic-gp":
- "cov_only" (default): residualize covariates only; test HSIC(Z_res, Y_raw). Fastest; calibration matches "both" when covariate GPR captures most spatial confounding.
- "both": residualize both covariates and isoform ratios.
n_jobs (int, optional) – Number of parallel workers for the per-gene loop. -1 uses all available CPUs. Each worker densifies one sparse count tensor (~4–40 MB at 100 K–1 M spots × 10 isoforms). When gpr_backend="gpytorch" and device != "cpu", the GPU is not thread-safe; parallelism is automatically disabled. Default -1.
print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self._du_test_results.

Returns:

results – If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.

Return type:

dict or None

test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', null_method='eig', null_configs=None, n_jobs=-1, return_results=False, print_progress=True)#

Test for spatial variability.

Kernel-based multivariate hypothesis testing for spatial variability in

gene-level total counts ("hsic-gc" or "spark-x" [ZSZ21])
isoform usage ratios ("hsic-ir")
isoform counts ("hsic-ic")

Test statistics and p-values are computed per gene for each gene separately.

Parameters:

method ({"hsic-ir", "hsic-ic", "hsic-gc", "spark-x"}, optional) – Test target: "hsic-ir" (isoform usage ratios), "hsic-ic" (isoform counts), "hsic-gc" (gene-level counts), or "spark-x" (SPARK-X [ZSZ21]).
ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation applied to isoform ratios when method="hsic-ir". See splisosm.utils.counts_to_ratios() and [PYPA22] for details.
nan_filling ({"mean", "none"}, optional) – Strategy for NaN values in isoform ratios. See splisosm.utils.counts_to_ratios() for details.
null_method ({"eig", "clt", "welch", "perm"}, optional) –
Method for computing the null distribution of the test statistic:
- "eig" (default): asymptotic chi-square mixture using kernel eigenvalues; Liu’s method [LTZ09]. Supports optional null_configs["approx_rank"] (int) to use only the top-k eigenvalues. By default, approx_rank = np.ceil(np.sqrt(n_spots) * 4) for large datasets (n_spots > 5000). Set it to None to use all eigenvalues, which can be slow for large n_spots.
- "clt": moment-matching normal (Central Limit Theorem) approximation using tr(K’) and tr(K’²) of the (centred) spatial kernel. Fastest, but tail probabilities can be inaccurate when the effective degrees of freedom of the chi-squared mixture is small. "trace" is accepted as a deprecated alias.
- "welch": Welch-Satterthwaite moment matching. Uses the same tr(K’) and tr(K’²) as "clt" but approximates the null by a scaled chi-squared g * chi2(h) with g = Var/(2*E) and h = 2*E^2/Var. Comparable cost to "clt" with more accurate right-tail p-values, typically closer to the "eig" (Liu) reference.
- "perm": permutation-based null distribution. Supports optional null_configs["n_perms_per_gene"] (default 1000), and null_configs["perm_batch_size"] (default 50, larger values lead to more memory usage) for batch-wise null statistic computation.
null_configs (dict or None, optional) – Extra keyword arguments for the chosen null_method.
n_jobs (int, optional) – Number of parallel workers for the per-gene loop. -1 uses all available CPUs. Each worker densifies one sparse count tensor (~4–40 MB at 100 K–1 M spots × 10 isoforms) so choose n_jobs to fit within available RAM. Default -1.
return_results (bool, optional) – If True, return the result dict. Otherwise store results in sv_test_results and return None.
print_progress (bool, optional) – Whether to show a progress bar.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._sv_test_results.

Return type:

dict or None

Notes

To run the SPARK-X test, the R-package SPARK must be installed and accessible from Python via rpy2.

adata: AnnData | None#: Source AnnData object; None before setup_data().

covariate_names: list[str]#: Covariate display names (length n_factors).

design_mtx: Tensor | None#: Design matrix (n_spots, n_factors); None if no covariates.

property filtered_adata: AnnData#

The filtered AnnData of shape (n_spots, sum(n_isos_per_gene)).

This is the data used internally after setup_data(). It is a copy of the input adata, subsetted to the retained spots and isoforms after filtering.

Raises:: RuntimeError – If setup_data() has not been called.
Return type:: AnnData

gene_names: list[str]#: Gene display names (length n_genes).

n_factors: int#: Number of covariates for differential usage testing.

n_genes: int#: Number of genes after filtering.

n_isos_per_gene: list[int]#: Number of isoforms per gene (list of length n_genes).

n_spots: int#: Number of spatial spots/cells.

sp_kernel: Any#: Spatial kernel (SpatialCovKernel or IdentityKernel). Set by setup_data().