splisosm.hyptest_np#
Non-parametric hypothesis tests for spatial isoform usage.
Classes#
Non-parametric spatial isoform statistical model. |
Module Contents#
- class splisosm.hyptest_np.SplisosmNP#
Non-parametric spatial isoform statistical model.
Examples
Setup data:
>>> from splisosm import SplisosmNP >>> import torch >>> # Simulate data for 10 genes with different number of isoforms >>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)] # 5 genes with 3 isoforms >>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)] # 5 genes with 4 isoforms >>> data = data_3_iso + data_4_iso >>> coordinates = torch.rand(100, 2) # 100 spots with 2D coordinates >>> design_mtx = torch.rand(100, 2) # 100 spots with 2 covariates
Spatial variability test:
>>> model = SplisosmNP() >>> model.setup_data(data, coordinates) >>> model.test_spatial_variability(method = 'hsic-ir') >>> sv_results = model.get_formatted_test_results('sv') >>> print(sv_results.head())
Differential usage test:
>>> model = SplisosmNP() >>> model.setup_data(data, coordinates, design_mtx=design_mtx) >>> model.test_differential_usage(method = 'hsic') >>> du_results = model.get_formatted_test_results('du') >>> print(du_results.head())
- get_formatted_test_results(test_type)#
Get the formatted test results as data frame.
- Parameters:
test_type (Literal['sv', 'du']) – Type of test results to retrieve. Can be one of
'sv'(spatial variability) or'du'(differential usage).- Returns:
Formatted test results.
- Return type:
- setup_data(data=None, coordinates=None, approx_rank=None, design_mtx=None, gene_names=None, covariate_names=None, *, adata=None, spatial_key='spatial', layer='counts', group_iso_by='gene_symbol', min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True)#
Setup isoform-level spatial data for hypothesis testing.
This method supports two input modes for backward compatibility.
Legacy mode: pass
dataandcoordinatesdirectly.AnnData mode: pass
adata, where counts are extracted fromadata.layers[layer]grouped bygroup_iso_by, and coordinates are read fromadata.obsm[spatial_key]. Seesplisosm.utils.prepare_inputs_from_anndata()for details.
- Parameters:
data (Optional[list[Union[Tensor, ndarray]]]) – Legacy mode only. List of tensors/arrays with shape
(n_spots, n_isos)containing isoform counts for each gene.coordinates (Optional[Union[Tensor, ndarray, DataFrame]]) – Legacy mode only. Shape
(n_spots, 2), spatial coordinates.approx_rank (Optional[int]) – The rank of the low-rank approximation for the spatial covariance matrix. If None, use the full-rank dense covariance matrix. For larger datasets (n_spots > 5,000), the maximum rank is set to
4 * sqrt(n_spots).design_mtx (Optional[Union[Tensor, ndarray, DataFrame, str, list[str]]]) –
Design matrix for differential usage tests.
Legacy mode: tensor/array/dataframe of shape
(n_spots, n_factors).AnnData mode: tensor/array/dataframe, or one obs-column name (str), or a list of obs-column names.
When
design_mtxcontains categorical obs columns in AnnData mode, they are automatically one-hot encoded. Covariate names are inferred when not explicitly provided (seecovariate_namesbelow).gene_names (Optional[Union[list[str], str]]) –
Gene names.
Legacy mode: list of gene names.
AnnData mode: optional column name in
adata.varused as display names for grouped genes; if None, use grouped gene IDs.
covariate_names (Optional[list[str]]) –
List of covariate names. If not provided, names are inferred as follows:
In AnnData mode with column name(s): column names are used, with categorical columns expanded to one-hot encoded names (e.g.,
col_cat0,col_cat1forcolif it has categorical values).In legacy mode with DataFrame: DataFrame column names are used.
Otherwise: default names like
factor_1,factor_2, etc. are generated.
When explicitly provided, must match the number of factors in the design matrix (after any categorical encoding/one-hot expansion).
adata (Optional[AnnData]) – AnnData object used in the new input mode.
spatial_key (str) – Key in
adata.obsmfor spatial coordinates.layer (str) – Counts layer in
adata.layers.group_iso_by (str) – Column in
adata.varused to group isoforms by gene.min_counts (int) – Minimum total isoform count across spots required to retain an isoform in AnnData mode.
min_bin_pct (float) – Minimum percentage/fraction of spots where an isoform is expressed in AnnData mode. Values in
[0, 1]are treated as fractions; values in(1, 100]are treated as percentages.filter_single_iso_genes (bool) – AnnData mode only. Whether to remove genes with fewer than two retained isoforms.
- Raises:
ValueError – If input arguments are invalid or required fields are missing.
- Return type:
None
- test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', print_progress=True, return_results=False)#
Test for spatial isoform differential usage.
Before running this function, the design matrix must be set up using
setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.Two types of association tests are supported:
Unconditional (
"hsic","t-fisher","t-tippett"): test the unconditional association between isoform usage ratios and the covariate.Conditional (
"hsic-gp"): test the association conditioned on spatial coordinates via Gaussian process regression. See [ZPJScholkopf12] for more details.
- Parameters:
method (str, optional) –
Method for association testing:
"hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test."hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.
Or one of the T-tests (binary factors only):
"t-fisher","t-tippett": each isoform is tested independently and p-values are combined gene-wise via Fisher’s or Tippett’s method.
ratio_transformation (str, optional) – Compositional transformation for isoform ratios. One of
'none','clr','ilr','alr','radial'[PYPA22]. Seesplisosm.utils.counts_to_ratios().nan_filling (str, optional) – How to fill NaN values in isoform ratios. One of
'mean'or'none'. Seesplisosm.utils.counts_to_ratios().gpr_backend (str, optional) – GPR backend to use for
method='hsic-gp'. One of'sklearn'(default) or'gpytorch'. For FFT-accelerated spatial GP on regular grids useSplisosmFFTinstead.gpr_configs (dict, optional) –
Nested configuration dict for the GPR objects, with optional keys
'covariate'and/or'isoform'. Each sub-dict is forwarded tosplisosm.kernel_gpr.make_kernel_gpr(). Unspecified keys use the defaults fromsplisosm.kernel_gpr._DEFAULT_GPR_CONFIGS:{ "covariate": { "constant_value": 1.0, "constant_value_bounds": (1e-3, 1e3), "length_scale": 1.0, "length_scale_bounds": "fixed", "n_inducing": 5000, }, "isoform": { "constant_value": 1.0, "constant_value_bounds": (1e-3, 1e3), "length_scale": 1.0, "length_scale_bounds": "fixed", "n_inducing": 5000, }, }
"n_inducing"(int) is supported by both backends with the same semantics:sklearn — full exact GP when
n_obs ≤ n_inducing; a randomly sub-sampled subset ofn_inducingpoints is used as the inducing set otherwise (default:5000).gpytorch — FITC sparse GP approximation with
n_inducingpoints; set toNoneto use exact GP (default:5000).
residualize ({"cov_only", "both"}, optional) –
Controls which signals are spatially residualized when
method="hsic-gp":"cov_only"(default): residualize covariates only; test HSIC(Z_res, Y_raw). Fastest; calibration matches"both"when covariate GPR captures most spatial confounding."both": residualize both covariates and isoform ratios.
print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in
self.du_test_results.
- Returns:
results – If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results inself.du_test_results.- Return type:
dict or None
- test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', use_perm_null=False, n_perms_per_gene=None, return_results=False, print_progress=True)#
Test for spatial variability.
Kernel-based multivariate hypothesis testing for spatial variability in
gene-level total counts (
"hsic-gc"or"spark-x"[ZSZ21])isoform usage ratios (
"hsic-ir")isoform counts (
"hsic-ic")
Test statistics and p-values are computed per gene for each gene separately.
- Parameters:
method (Literal['hsic-ir', 'hsic-ic', 'hsic-gc', 'spark-x']) – Must be one of
"hsic-ir","hsic-ic","hsic-gc", or"spark-x".ratio_transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – If using isoform ratios, the compositional transformation to apply. Can be one of
'none','clr','ilr','alr', or'radial'[PYPA22]. Seesplisosm.utils.counts_to_ratios()for more details.nan_filling (Literal['mean', 'none']) – How to fill the NaN values in the isoform ratios. Can be one of
'mean'or'none'. Seesplisosm.utils.counts_to_ratios()for more details.use_perm_null (bool) – Whether to generate the null distribution from permutation. If False, use the asymptotic distribution of chi-square mixtures with Liu’s method [LTZ09].
n_perms_per_gene (Optional[int]) – Number of permutations per gene for permutation test.
return_results (bool) – Whether to return the test statistics and p-values. Default to False, in which case the results are stored in
self.sv_test_results.print_progress (bool) – Whether to show the progress bar. Default to True.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self.sv_test_results.- Return type:
dict or None
Notes
To run the SPARK-X test, the R-package
SPARKmust be installed and accessible from Python viarpy2.
- adata = None#
- du_test_results: dict#
Dictionary to store the differential usage test results after running test_differential_usage(). It contains the following keys:
'method': str, the method used for the test.'statistic': numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.'pvalue': numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.'pvalue_adj': numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.
- n_factors: int#
Number of covariates to test for differential usage.
- n_genes: int#
Number of genes.
- n_isos: list[int]#
List of numbers of isoforms per gene.
- n_spots: int#
Number of spots.
- setup_input_mode = None#
- sv_test_results: dict#
Dictionary to store the spatial variability test results after running test_spatial_variability(). It contains the following keys:
'method': str, the method used for the test.'statistic': numpy.ndarray of shape (n_genes,), the test statistic for each gene.'pvalue': numpy.ndarray of shape (n_genes,), the p-value for each gene.'pvalue_adj': numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.