splisosm.hyptest_np#

Non-parametric hypothesis tests for spatial isoform usage.

Classes#

SplisosmNP

Non-parametric spatial isoform statistical model.

Module Contents#

class splisosm.hyptest_np.SplisosmNP(k_neighbors=4, rho=0.99, standardize_cov=True)#

Non-parametric spatial isoform statistical model.

Examples

Spatial variability test:

>>> from splisosm import SplisosmNP
>>> # adata : AnnData of shape (n_spots, n_isoforms)
>>> #   adata.layers["counts"]    — raw isoform counts
>>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
>>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
>>> model = SplisosmNP()
>>> model.setup_data(adata, layer="counts", group_iso_by="gene_symbol")
>>> model.test_spatial_variability(method="hsic-ir")
>>> sv_results = model.get_formatted_test_results("sv")

Differential usage test:

>>> model = SplisosmNP()
>>> model.setup_data(
...     adata, layer="counts", group_iso_by="gene_symbol",
...     design_mtx="covariate",  # obs column name, or (n_spots, n_factors) array
... )
>>> model.test_differential_usage(method="hsic-gp", residualize="cov_only")
>>> du_results = model.get_formatted_test_results("du")

Initialise the model.

Parameters:
  • k_neighbors (int, optional) – Number of nearest neighbours used to build the spatial adjacency graph for the CAR kernel (default 4).

  • rho (float, optional) – Spatial autocorrelation strength in the CAR model (default 0.99). Values close to 1 give a smoother spatial kernel.

  • standardize_cov (bool, optional) – Whether to standardise the spatial covariance matrix so that its diagonal entries are 1 (default True).

extract_feature_summary(level='gene', print_progress=True)#

Compute filtered feature-level summary statistics.

Gene-level statistics are aggregated across all isoforms that passed the filters applied in setup_data(). Isoform-level statistics are computed per isoform and augmented onto the corresponding rows of adata.var.

Results are cached: repeated calls with the same level return the cached pandas.DataFrame without recomputation.

Parameters:
  • level (Literal['gene', 'isoform']) – Summary granularity. 'gene': one row per gene. 'isoform': one row per isoform that passed filtering.

  • print_progress (bool) – Whether to show a progress bar.

Returns:

For level='gene', the index is the gene display name and the columns are:

  • 'n_isos': int. Number of isoforms retained after filtering.

  • 'perplexity': float. Effective number of isoforms based on the marginal isoform usage entropy.

  • 'pct_bin_on': float. Fraction of spots with non-zero total gene counts.

  • 'count_avg': float. Mean per-spot total count for the gene.

  • 'count_std': float. Std of per-spot total count for the gene.

For level='isoform', the index is the isoform name (matching adata.var_names) and the columns are the original adata.var columns plus:

  • 'pct_bin_on': float. Fraction of spots with count > 0.

  • 'count_total': float. Total counts across all spots.

  • 'count_avg': float. Mean count per spot.

  • 'count_std': float. Std of count per spot.

  • 'ratio_total': float. Fraction of total gene counts attributable to this isoform.

  • 'ratio_avg': float. Mean per-spot isoform usage ratio (computed over spots with non-zero gene coverage).

  • 'ratio_std': float. Std of per-spot isoform usage ratio (computed over spots with non-zero gene coverage).

Return type:

DataFrame

Raises:
  • RuntimeError – If setup_data() has not been called.

  • ValueError – If level is not 'gene' or 'isoform'.

get_formatted_test_results(test_type, with_gene_summary=False)#

Get formatted test results as a pandas DataFrame.

Parameters:
  • test_type ({"sv", "du"}) – Which results to retrieve: "sv" for spatial variability or "du" for differential usage.

  • with_gene_summary (bool, optional) – If True, append gene-level summary statistics from extract_feature_summary() (columns: n_isos, perplexity, pct_bin_on, count_avg, count_std).

Returns:

Formatted test results.

Return type:

DataFrame

setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, min_component_size=1, skip_spatial_kernel=False)#

Setup isoform-level spatial data for hypothesis testing.

Extracts isoform count tensors from an AnnData object, optionally filters disconnected graph components, builds a spatial covariance kernel, and resolves the design matrix.

Parameters:
  • adata (AnnData) – Annotated data matrix. Counts are read from adata.layers[layer] grouped by group_iso_by, and spatial coordinates from adata.obsm[spatial_key]. See splisosm.utils.prepare_inputs_from_anndata() for full preprocessing details.

  • spatial_key (str, optional) – Key in adata.obsm for spatial coordinates (default "spatial").

  • adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. The adjacency matrix is symmetrized internally.

  • layer (str, optional) – Layer in adata.layers that stores isoform counts (default "counts").

  • group_iso_by (str, optional) – Column in adata.var used to group isoforms by gene (default "gene_symbol").

  • gene_names (str or None, optional) – Column name in adata.var used as display names for genes. If None, the values of group_iso_by are used.

  • design_mtx (tensor, array, DataFrame, str, or list of str, optional) –

    Design matrix for differential-usage tests. Accepts an array/tensor/DataFrame of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names. Categorical obs columns are one-hot encoded automatically.

    When a scipy sparse matrix is passed directly, it is stored as scipy CSR internally and all differential-usage methods handle it without densifying: "hsic" uses a sparse matrix-multiply path in linear_hsic_test(); "t-fisher" and "t-tippett" extract group indices directly from the sparse non-zero structure. "hsic-gp" densifies each column via _get_design_col() before GPR fitting (GPR residuals are always dense).

    All other input types (obs column names, array, tensor, DataFrame) are converted to a dense torch float32 tensor.

  • covariate_names (list of str or None, optional) – Explicit covariate names. When design_mtx is given as column name(s) and this is None, the column names are used automatically; otherwise auto-generated as factor_1, etc.

  • min_counts (int, optional) – Minimum total isoform count across spots required to retain an isoform (default 10).

  • min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).

  • filter_single_iso_genes (bool, optional) – Whether to remove genes with fewer than two retained isoforms (default True).

  • min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A UserWarning is issued when spots are removed.

  • skip_spatial_kernel (bool, optional) – If True, skip construction of the CAR spatial kernel and store an IdentityKernel placeholder as self.sp_kernel instead. Use this when only test_differential_usage() is needed (it fits custom GPR to handle spatial autocorrelation). Calling test_spatial_variability() on a model set up with skip_spatial_kernel=True will raise a RuntimeError. Default False.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', n_jobs=-1, print_progress=True, return_results=False)#

Test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Two types of association tests are supported:

  • Unconditional ("hsic", "t-fisher", "t-tippett"): test the unconditional association between isoform usage ratios and the covariate.

  • Conditional ("hsic-gp"): test the association conditioned on spatial coordinates via Gaussian process regression. See [ZPJScholkopf12] for more details.

Parameters:
  • method (str, optional) –

    Method for association testing:

    • "hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test.

    • "hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.

    Or one of the T-tests (binary factors only):

    • "t-fisher", "t-tippett": each isoform is tested independently and p-values are combined gene-wise via Fisher’s or Tippett’s method.

  • ratio_transformation (str, optional) – Compositional transformation for isoform ratios. One of 'none', 'clr', 'ilr', 'alr', 'radial' [PYPA22]. See splisosm.utils.counts_to_ratios().

  • nan_filling (str, optional) – How to fill NaN values in isoform ratios. One of 'mean' or 'none'. See splisosm.utils.counts_to_ratios().

  • gpr_backend (str, optional) – GPR backend to use for method='hsic-gp'. One of 'sklearn' (default) or 'gpytorch'. For FFT-accelerated spatial GP on regular grids use SplisosmFFT instead.

  • gpr_configs (dict, optional) –

    Nested configuration dict for the GPR objects, with optional keys 'covariate' and/or 'isoform'. Each sub-dict is forwarded to splisosm.kernel_gpr.make_kernel_gpr(). Unspecified keys use the defaults from splisosm.kernel_gpr._DEFAULT_GPR_CONFIGS:

    {
        "covariate": {
            "constant_value": 1.0,
            "constant_value_bounds": (1e-3, 1e3),
            "length_scale": 1.0,
            "length_scale_bounds": "fixed",
            "n_inducing": 5000,
        },
        "isoform": {
            "constant_value": 1.0,
            "constant_value_bounds": (1e-3, 1e3),
            "length_scale": 1.0,
            "length_scale_bounds": "fixed",
            "n_inducing": 5000,
        },
    }
    

    "n_inducing" (int or None) controls the scale of spatial GP fitting for each backend:

    • sklearn — maximum number of observations used for hyperparameter fitting. Full exact GP when n_obs n_inducing (or None); a randomly sub-sampled subset-of-data of n_inducing points otherwise (not the same inducing-point approximation as gpytorch). Default: 5000. Set to None to use all observations (warns when n_obs > 10_000).

    • gpytorch — FITC sparse-GP inducing-point approximation with n_inducing points; set to None for exact GP. Default: 5000.

  • residualize ({"cov_only", "both"}, optional) –

    Controls which signals are spatially residualized when method="hsic-gp":

    • "cov_only" (default): residualize covariates only; test HSIC(Z_res, Y_raw). Fastest; calibration matches "both" when covariate GPR captures most spatial confounding.

    • "both": residualize both covariates and isoform ratios.

  • n_jobs (int, optional) – Number of parallel workers for the per-gene loop. -1 uses all available CPUs. Each worker densifies one sparse count tensor (~4–40 MB at 100 K–1 M spots × 10 isoforms). When gpr_backend="gpytorch" and device != "cpu", the GPU is not thread-safe; parallelism is automatically disabled. Default -1.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self._du_test_results.

Returns:

results – If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._du_test_results.

Return type:

dict or None

test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', null_method='eig', null_configs=None, n_jobs=-1, return_results=False, print_progress=True)#

Test for spatial variability.

Kernel-based multivariate hypothesis testing for spatial variability in

  • gene-level total counts ("hsic-gc" or "spark-x" [ZSZ21])

  • isoform usage ratios ("hsic-ir")

  • isoform counts ("hsic-ic")

Test statistics and p-values are computed per gene for each gene separately.

Parameters:
  • method ({"hsic-ir", "hsic-ic", "hsic-gc", "spark-x"}, optional) – Test target: "hsic-ir" (isoform usage ratios), "hsic-ic" (isoform counts), "hsic-gc" (gene-level counts), or "spark-x" (SPARK-X [ZSZ21]).

  • ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation applied to isoform ratios when method="hsic-ir". See splisosm.utils.counts_to_ratios() and [PYPA22] for details.

  • nan_filling ({"mean", "none"}, optional) – Strategy for NaN values in isoform ratios. See splisosm.utils.counts_to_ratios() for details.

  • null_method ({"eig", "trace", "perm"}, optional) –

    Method for computing the null distribution of the test statistic:

    • "eig" (default): asymptotic chi-square mixture using kernel eigenvalues; Liu’s method [LTZ09]. Supports optional null_configs["approx_rank"] (int) to use only the top-k eigenvalues. By default, approx_rank = np.ceil(np.sqrt(n_spots) * 4) for large datasets (n_spots > 5000). Set it to None to use all eigenvalues, which can be slow for large n_spots.

    • "trace": moment-matching normal approximation using tr(K’) and tr(K’²) of the (centred) spatial kernel.

    • "perm": permutation-based null distribution. Supports optional null_configs["n_perms_per_gene"] (default 1000), and null_configs["perm_batch_size"] (default 50, larger values lead to more memory usage) for batch-wise null statistic computation.

  • null_configs (dict or None, optional) – Extra keyword arguments for the chosen null_method.

  • n_jobs (int, optional) – Number of parallel workers for the per-gene loop. -1 uses all available CPUs. Each worker densifies one sparse count tensor (~4–40 MB at 100 K–1 M spots × 10 isoforms) so choose n_jobs to fit within available RAM. Default -1.

  • return_results (bool, optional) – If True, return the result dict. Otherwise store results in sv_test_results and return None.

  • print_progress (bool, optional) – Whether to show a progress bar.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._sv_test_results.

Return type:

dict or None

Notes

To run the SPARK-X test, the R-package SPARK must be installed and accessible from Python via rpy2.

adata: AnnData | None#

Source AnnData object; None before setup_data().

covariate_names: list[str]#

Covariate display names (length n_factors).

design_mtx: Tensor | None#

Design matrix (n_spots, n_factors); None if no covariates.

property filtered_adata: AnnData#

The filtered AnnData of shape (n_spots, sum(n_isos_per_gene)).

This is the data used internally after setup_data(). It is a copy of the input adata, subsetted to the retained spots and isoforms after filtering.

Raises:

RuntimeError – If setup_data() has not been called.

Return type:

AnnData

gene_names: list[str]#

Gene display names (length n_genes).

n_factors: int#

Number of covariates for differential usage testing.

n_genes: int#

Number of genes after filtering.

n_isos_per_gene: list[int]#

Number of isoforms per gene (list of length n_genes).

n_spots: int#

Number of spatial spots/cells.

sp_kernel: Any#

Spatial kernel (SpatialCovKernel or IdentityKernel). Set by setup_data().