splisosm.hyptest_np#

Non-parametric hypothesis tests for spatial isoform usage.

Classes#

SplisosmNP

Non-parametric spatial isoform statistical model.

Module Contents#

class splisosm.hyptest_np.SplisosmNP#

Non-parametric spatial isoform statistical model.

Examples

Setup data:

>>> from splisosm import SplisosmNP
>>> import torch
>>> # Simulate data for 10 genes with different number of isoforms
>>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)]  # 5 genes with 3 isoforms
>>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)]  # 5 genes with 4 isoforms
>>> data = data_3_iso + data_4_iso
>>> coordinates = torch.rand(100, 2)  # 100 spots with 2D coordinates
>>> design_mtx = torch.rand(100, 2)  # 100 spots with 2 covariates

Spatial variability test:

>>> model = SplisosmNP()
>>> model.setup_data(data, coordinates)
>>> model.test_spatial_variability(method = 'hsic-ir')
>>> sv_results = model.get_formatted_test_results('sv')
>>> print(sv_results.head())

Differential usage test:

>>> model = SplisosmNP()
>>> model.setup_data(data, coordinates, design_mtx=design_mtx)
>>> model.test_differential_usage(method = 'hsic')
>>> du_results = model.get_formatted_test_results('du')
>>> print(du_results.head())
get_formatted_test_results(test_type)#

Get the formatted test results as data frame.

Parameters:

test_type (Literal['sv', 'du']) – Type of test results to retrieve. Can be one of 'sv' (spatial variability) or 'du' (differential usage).

Returns:

Formatted test results.

Return type:

DataFrame

setup_data(data=None, coordinates=None, approx_rank=None, design_mtx=None, gene_names=None, covariate_names=None, *, adata=None, spatial_key='spatial', layer='counts', group_iso_by='gene_symbol', min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True)#

Setup isoform-level spatial data for hypothesis testing.

This method supports two input modes for backward compatibility.

  • Legacy mode: pass data and coordinates directly.

  • AnnData mode: pass adata, where counts are extracted from adata.layers[layer] grouped by group_iso_by, and coordinates are read from adata.obsm[spatial_key]. See splisosm.utils.prepare_inputs_from_anndata() for details.

Parameters:
  • data (Optional[list[Union[Tensor, ndarray]]]) – Legacy mode only. List of tensors/arrays with shape (n_spots, n_isos) containing isoform counts for each gene.

  • coordinates (Optional[Union[Tensor, ndarray, DataFrame]]) – Legacy mode only. Shape (n_spots, 2), spatial coordinates.

  • approx_rank (Optional[int]) – The rank of the low-rank approximation for the spatial covariance matrix. If None, use the full-rank dense covariance matrix. For larger datasets (n_spots > 5,000), the maximum rank is set to 4 * sqrt(n_spots).

  • design_mtx (Optional[Union[Tensor, ndarray, DataFrame, str, list[str]]]) –

    Design matrix for differential usage tests.

    • Legacy mode: tensor/array/dataframe of shape (n_spots, n_factors).

    • AnnData mode: tensor/array/dataframe, or one obs-column name (str), or a list of obs-column names.

    When design_mtx contains categorical obs columns in AnnData mode, they are automatically one-hot encoded. Covariate names are inferred when not explicitly provided (see covariate_names below).

  • gene_names (Optional[Union[list[str], str]]) –

    Gene names.

    • Legacy mode: list of gene names.

    • AnnData mode: optional column name in adata.var used as display names for grouped genes; if None, use grouped gene IDs.

  • covariate_names (Optional[list[str]]) –

    List of covariate names. If not provided, names are inferred as follows:

    • In AnnData mode with column name(s): column names are used, with categorical columns expanded to one-hot encoded names (e.g., col_cat0, col_cat1 for col if it has categorical values).

    • In legacy mode with DataFrame: DataFrame column names are used.

    • Otherwise: default names like factor_1, factor_2, etc. are generated.

    When explicitly provided, must match the number of factors in the design matrix (after any categorical encoding/one-hot expansion).

  • adata (Optional[AnnData]) – AnnData object used in the new input mode.

  • spatial_key (str) – Key in adata.obsm for spatial coordinates.

  • layer (str) – Counts layer in adata.layers.

  • group_iso_by (str) – Column in adata.var used to group isoforms by gene.

  • min_counts (int) – Minimum total isoform count across spots required to retain an isoform in AnnData mode.

  • min_bin_pct (float) – Minimum percentage/fraction of spots where an isoform is expressed in AnnData mode. Values in [0, 1] are treated as fractions; values in (1, 100] are treated as percentages.

  • filter_single_iso_genes (bool) – AnnData mode only. Whether to remove genes with fewer than two retained isoforms.

Raises:

ValueError – If input arguments are invalid or required fields are missing.

Return type:

None

test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', print_progress=True, return_results=False)#

Test for spatial isoform differential usage.

Before running this function, the design matrix must be set up using setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.

Two types of association tests are supported:

  • Unconditional ("hsic", "t-fisher", "t-tippett"): test the unconditional association between isoform usage ratios and the covariate.

  • Conditional ("hsic-gp"): test the association conditioned on spatial coordinates via Gaussian process regression. See [ZPJScholkopf12] for more details.

Parameters:
  • method (str, optional) –

    Method for association testing:

    • "hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test.

    • "hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.

    Or one of the T-tests (binary factors only):

    • "t-fisher", "t-tippett": each isoform is tested independently and p-values are combined gene-wise via Fisher’s or Tippett’s method.

  • ratio_transformation (str, optional) – Compositional transformation for isoform ratios. One of 'none', 'clr', 'ilr', 'alr', 'radial' [PYPA22]. See splisosm.utils.counts_to_ratios().

  • nan_filling (str, optional) – How to fill NaN values in isoform ratios. One of 'mean' or 'none'. See splisosm.utils.counts_to_ratios().

  • gpr_backend (str, optional) – GPR backend to use for method='hsic-gp'. One of 'sklearn' (default) or 'gpytorch'. For FFT-accelerated spatial GP on regular grids use SplisosmFFT instead.

  • gpr_configs (dict, optional) –

    Nested configuration dict for the GPR objects, with optional keys 'covariate' and/or 'isoform'. Each sub-dict is forwarded to splisosm.kernel_gpr.make_kernel_gpr(). Unspecified keys use the defaults from splisosm.kernel_gpr._DEFAULT_GPR_CONFIGS:

    {
        "covariate": {
            "constant_value": 1.0,
            "constant_value_bounds": (1e-3, 1e3),
            "length_scale": 1.0,
            "length_scale_bounds": "fixed",
            "n_inducing": 5000,
        },
        "isoform": {
            "constant_value": 1.0,
            "constant_value_bounds": (1e-3, 1e3),
            "length_scale": 1.0,
            "length_scale_bounds": "fixed",
            "n_inducing": 5000,
        },
    }
    

    "n_inducing" (int) is supported by both backends with the same semantics:

    • sklearn — full exact GP when n_obs n_inducing; a randomly sub-sampled subset of n_inducing points is used as the inducing set otherwise (default: 5000).

    • gpytorch — FITC sparse GP approximation with n_inducing points; set to None to use exact GP (default: 5000).

  • residualize ({"cov_only", "both"}, optional) –

    Controls which signals are spatially residualized when method="hsic-gp":

    • "cov_only" (default): residualize covariates only; test HSIC(Z_res, Y_raw). Fastest; calibration matches "both" when covariate GPR captures most spatial confounding.

    • "both": residualize both covariates and isoform ratios.

  • print_progress (bool, optional) – Whether to show the progress bar. Default to True.

  • return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in self.du_test_results.

Returns:

results – If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.du_test_results.

Return type:

dict or None

test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', use_perm_null=False, n_perms_per_gene=None, return_results=False, print_progress=True)#

Test for spatial variability.

Kernel-based multivariate hypothesis testing for spatial variability in

  • gene-level total counts ("hsic-gc" or "spark-x" [ZSZ21])

  • isoform usage ratios ("hsic-ir")

  • isoform counts ("hsic-ic")

Test statistics and p-values are computed per gene for each gene separately.

Parameters:
  • method (Literal['hsic-ir', 'hsic-ic', 'hsic-gc', 'spark-x']) – Must be one of "hsic-ir", "hsic-ic", "hsic-gc", or "spark-x".

  • ratio_transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – If using isoform ratios, the compositional transformation to apply. Can be one of 'none', 'clr', 'ilr', 'alr', or 'radial' [PYPA22]. See splisosm.utils.counts_to_ratios() for more details.

  • nan_filling (Literal['mean', 'none']) – How to fill the NaN values in the isoform ratios. Can be one of 'mean' or 'none'. See splisosm.utils.counts_to_ratios() for more details.

  • use_perm_null (bool) – Whether to generate the null distribution from permutation. If False, use the asymptotic distribution of chi-square mixtures with Liu’s method [LTZ09].

  • n_perms_per_gene (Optional[int]) – Number of permutations per gene for permutation test.

  • return_results (bool) – Whether to return the test statistics and p-values. Default to False, in which case the results are stored in self.sv_test_results.

  • print_progress (bool) – Whether to show the progress bar. Default to True.

Returns:

If return_results is True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.sv_test_results.

Return type:

dict or None

Notes

To run the SPARK-X test, the R-package SPARK must be installed and accessible from Python via rpy2.

adata = None#
du_test_results: dict#

Dictionary to store the differential usage test results after running test_differential_usage(). It contains the following keys:

  • 'method': str, the method used for the test.

  • 'statistic': numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.

  • 'pvalue': numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.

  • 'pvalue_adj': numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.

n_factors: int#

Number of covariates to test for differential usage.

n_genes: int#

Number of genes.

n_isos: list[int]#

List of numbers of isoforms per gene.

n_spots: int#

Number of spots.

setup_input_mode = None#
sv_test_results: dict#

Dictionary to store the spatial variability test results after running test_spatial_variability(). It contains the following keys:

  • 'method': str, the method used for the test.

  • 'statistic': numpy.ndarray of shape (n_genes,), the test statistic for each gene.

  • 'pvalue': numpy.ndarray of shape (n_genes,), the p-value for each gene.

  • 'pvalue_adj': numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.