splisosm.hyptest_np

splisosm.hyptest_np#

Non-parametric hypothesis tests for spatial isoform usage.

Classes#

SplisosmNP

Non-parametric spatial isoform statistical model.

Module Contents#

class splisosm.hyptest_np.SplisosmNP#

Non-parametric spatial isoform statistical model.

Examples

Setup data:

>>> from splisosm import SplisosmNP
>>> import torch
>>> # Simulate data for 10 genes with different number of isoforms
>>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)]  # 5 genes with 3 isoforms
>>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)]  # 5 genes with 4 isoforms
>>> data = data_3_iso + data_4_iso
>>> coordinates = torch.rand(100, 2)  # 100 spots with 2D coordinates
>>> design_mtx = torch.rand(100, 2)  # 100 spots with 2 covariates

Spatial variability test:

>>> model = SplisosmNP()
>>> model.setup_data(data, coordinates)
>>> model.test_spatial_variability(method = 'hsic-ir')
>>> sv_results = model.get_formatted_test_results('sv')
>>> print(sv_results.head())

Differential usage test:

>>> model = SplisosmNP()
>>> model.setup_data(data, coordinates, design_mtx=design_mtx)
>>> model.test_differential_usage(method = 'hsic')
>>> du_results = model.get_formatted_test_results('du')
>>> print(du_results.head())

get_formatted_test_results(test_type)#

Get the formatted test results as data frame.

Parameters:: test_type (Literal['sv', 'du']) – Type of test results to retrieve. Can be one of 'sv' (spatial variability) or 'du' (differential usage).
Returns:: Formatted test results.
Return type:: DataFrame

setup_data(data=None, coordinates=None, approx_rank=None, design_mtx=None, gene_names=None, covariate_names=None, *, adata=None, spatial_key='spatial', layer='counts', group_iso_by='gene_symbol', min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True)#

Setup isoform-level spatial data for hypothesis testing.

This method supports two input modes for backward compatibility.

Legacy mode: pass data and coordinates directly.
AnnData mode: pass adata, where counts are extracted from adata.layers[layer] grouped by group_iso_by, and coordinates are read from adata.obsm[spatial_key]. See splisosm.utils.prepare_inputs_from_anndata() for details.

Parameters:

data (Optional[list[Union[Tensor, ndarray]]]) – Legacy mode only. List of tensors/arrays with shape (n_spots, n_isos) containing isoform counts for each gene.
coordinates (Optional[Union[Tensor, ndarray, DataFrame]]) – Legacy mode only. Shape (n_spots, 2), spatial coordinates.
approx_rank (Optional[int]) – The rank of the low-rank approximation for the spatial covariance matrix. If None, use the full-rank dense covariance matrix. For larger datasets (n_spots > 5,000), the maximum rank is set to 4 * sqrt(n_spots).
design_mtx (Optional[Union[Tensor, ndarray, DataFrame, str, list[str]]]) –
Design matrix for differential usage tests.
- Legacy mode: tensor/array/dataframe of shape (n_spots, n_factors).
- AnnData mode: tensor/array/dataframe, or one obs-column name (str), or a list of obs-column names.
When design_mtx contains categorical obs columns in AnnData mode, they are automatically one-hot encoded. Covariate names are inferred when not explicitly provided (see covariate_names below).
gene_names (Optional[Union[list[str], str]]) –
Gene names.
- Legacy mode: list of gene names.
- AnnData mode: optional column name in adata.var used as display names for grouped genes; if None, use grouped gene IDs.
covariate_names (Optional[list[str]]) –
List of covariate names. If not provided, names are inferred as follows:
- In AnnData mode with column name(s): column names are used, with categorical columns expanded to one-hot encoded names (e.g., col_cat0, col_cat1 for col if it has categorical values).
- In legacy mode with DataFrame: DataFrame column names are used.
- Otherwise: default names like factor_1, factor_2, etc. are generated.
When explicitly provided, must match the number of factors in the design matrix (after any categorical encoding/one-hot expansion).
adata (Optional[AnnData]) – AnnData object used in the new input mode.
spatial_key (str) – Key in adata.obsm for spatial coordinates.
layer (str) – Counts layer in adata.layers.
group_iso_by (str) – Column in adata.var used to group isoforms by gene.
min_counts (int) – Minimum total isoform count across spots required to retain an isoform in AnnData mode.
min_bin_pct (float) – Minimum percentage/fraction of spots where an isoform is expressed in AnnData mode. Values in [0, 1] are treated as fractions; values in (1, 100] are treated as percentages.
filter_single_iso_genes (bool) – AnnData mode only. Whether to remove genes with fewer than two retained isoforms.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', print_progress=True, return_results=False)#

Test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Two types of association tests are supported:

Unconditional ("hsic", "t-fisher", "t-tippett"): test the unconditional association between isoform usage ratios and the covariate.
Conditional ("hsic-gp"): test the association conditioned on spatial coordinates via Gaussian process regression. See [ZPJScholkopf12] for more details.

Parameters:

method (str, optional) –
Method for association testing:
- "hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test.
- "hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.
Or one of the T-tests (binary factors only):
- "t-fisher", "t-tippett": each isoform is tested independently and p-values are combined gene-wise via Fisher’s or Tippett’s method.
ratio_transformation (str, optional) – Compositional transformation for isoform ratios. One of 'none', 'clr', 'ilr', 'alr', 'radial' [PYPA22]. See splisosm.utils.counts_to_ratios().
nan_filling (str, optional) – How to fill NaN values in isoform ratios. One of 'mean' or 'none'. See splisosm.utils.counts_to_ratios().
gpr_backend (str, optional) – GPR backend to use for method='hsic-gp'. One of 'sklearn' (default) or 'gpytorch'. For FFT-accelerated spatial GP on regular grids use SplisosmFFT instead.
gpr_configs (dict, optional) –
Nested configuration dict for the GPR objects, with optional keys 'covariate' and/or 'isoform'. Each sub-dict is forwarded to splisosm.kernel_gpr.make_kernel_gpr(). Unspecified keys use the defaults from splisosm.kernel_gpr._DEFAULT_GPR_CONFIGS:
```
{
    "covariate": {
        "constant_value": 1.0,
        "constant_value_bounds": (1e-3, 1e3),
        "length_scale": 1.0,
        "length_scale_bounds": "fixed",
        "n_inducing": 5000,
    },
    "isoform": {
        "constant_value": 1.0,
        "constant_value_bounds": (1e-3, 1e3),
        "length_scale": 1.0,
        "length_scale_bounds": "fixed",
        "n_inducing": 5000,
    },
}
```
"n_inducing" (int) is supported by both backends with the same semantics:
- sklearn — full exact GP when n_obs ≤ n_inducing; a randomly sub-sampled subset of n_inducing points is used as the inducing set otherwise (default: 5000).
- gpytorch — FITC sparse GP approximation with n_inducing points; set to None to use exact GP (default: 5000).
residualize ({"cov_only", "both"}, optional) –
Controls which signals are spatially residualized when method="hsic-gp":
- "cov_only" (default): residualize covariates only; test HSIC(Z_res, Y_raw). Fastest; calibration matches "both" when covariate GPR captures most spatial confounding.
- "both": residualize both covariates and isoform ratios.
print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self.du_test_results.

Returns:

results – If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.du_test_results.

Return type:

dict or None

test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', use_perm_null=False, n_perms_per_gene=None, return_results=False, print_progress=True)#

Test for spatial variability.

Kernel-based multivariate hypothesis testing for spatial variability in

gene-level total counts ("hsic-gc" or "spark-x" [ZSZ21])
isoform usage ratios ("hsic-ir")
isoform counts ("hsic-ic")

Test statistics and p-values are computed per gene for each gene separately.

Parameters:

method (Literal['hsic-ir', 'hsic-ic', 'hsic-gc', 'spark-x']) – Must be one of "hsic-ir", "hsic-ic", "hsic-gc", or "spark-x".
ratio_transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – If using isoform ratios, the compositional transformation to apply. Can be one of 'none', 'clr', 'ilr', 'alr', or 'radial' [PYPA22]. See splisosm.utils.counts_to_ratios() for more details.
nan_filling (Literal['mean', 'none']) – How to fill the NaN values in the isoform ratios. Can be one of 'mean' or 'none'. See splisosm.utils.counts_to_ratios() for more details.
use_perm_null (bool) – Whether to generate the null distribution from permutation. If False, use the asymptotic distribution of chi-square mixtures with Liu’s method [LTZ09].
n_perms_per_gene (Optional[int]) – Number of permutations per gene for permutation test.
return_results (bool) – Whether to return the test statistics and p-values. Default to False, in which case the results are stored in self.sv_test_results.
print_progress (bool) – Whether to show the progress bar. Default to True.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.sv_test_results.

Return type:

dict or None

Notes

To run the SPARK-X test, the R-package SPARK must be installed and accessible from Python via rpy2.

adata = None#

du_test_results: dict#

Dictionary to store the differential usage test results after running test_differential_usage(). It contains the following keys:

'method': str, the method used for the test.
'statistic': numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.
'pvalue': numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.
'pvalue_adj': numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.

n_factors: int#: Number of covariates to test for differential usage.

n_genes: int#: Number of genes.

n_isos: list[int]#: List of numbers of isoforms per gene.

n_spots: int#: Number of spots.

setup_input_mode = None#

sv_test_results: dict#

Dictionary to store the spatial variability test results after running test_spatial_variability(). It contains the following keys:

'method': str, the method used for the test.
'statistic': numpy.ndarray of shape (n_genes,), the test statistic for each gene.
'pvalue': numpy.ndarray of shape (n_genes,), the p-value for each gene.
'pvalue_adj': numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.