splisosm.hyptest_np#
Non-parametric hypothesis tests for spatial isoform usage.
Classes#
Non-parametric spatial isoform statistical model. |
Module Contents#
- class splisosm.hyptest_np.SplisosmNP(k_neighbors=4, rho=0.99, standardize_cov=True)#
Non-parametric spatial isoform statistical model.
Examples
Spatial variability test:
>>> from splisosm import SplisosmNP >>> # adata : AnnData of shape (n_spots, n_isoforms) >>> # adata.layers["counts"] — raw isoform counts >>> # adata.var["gene_symbol"] — column grouping isoforms by gene >>> # adata.obsm["spatial"] — (n_spots, 2) spatial coordinates >>> model = SplisosmNP() >>> model.setup_data(adata, layer="counts", group_iso_by="gene_symbol") >>> model.test_spatial_variability(method="hsic-ir") >>> sv_results = model.get_formatted_test_results("sv")
Differential usage test:
>>> model = SplisosmNP() >>> model.setup_data( ... adata, layer="counts", group_iso_by="gene_symbol", ... design_mtx="covariate", # obs column name, or (n_spots, n_factors) array ... ) >>> model.test_differential_usage(method="hsic-gp", residualize="cov_only") >>> du_results = model.get_formatted_test_results("du")
Initialise the model.
- Parameters:
k_neighbors (int, optional) – Number of nearest neighbours used to build the spatial adjacency graph for the CAR kernel (default 4).
rho (float, optional) – Spatial autocorrelation strength in the CAR model (default 0.99). Values close to 1 give a smoother spatial kernel.
standardize_cov (bool, optional) – Whether to standardise the spatial covariance matrix so that its diagonal entries are 1 (default
True).
- extract_feature_summary(level='gene', print_progress=True)#
Compute filtered feature-level summary statistics.
Gene-level statistics are aggregated across all isoforms that passed the filters applied in
setup_data(). Isoform-level statistics are computed per isoform and augmented onto the corresponding rows ofadata.var.Results are cached: repeated calls with the same
levelreturn the cachedpandas.DataFramewithout recomputation.- Parameters:
level (Literal['gene', 'isoform']) – Summary granularity.
'gene': one row per gene.'isoform': one row per isoform that passed filtering.print_progress (bool) – Whether to show a progress bar.
- Returns:
For
level='gene', the index is the gene display name and the columns are:'n_isos': int. Number of isoforms retained after filtering.'perplexity': float. Effective number of isoforms based on the marginal isoform usage entropy.'pct_bin_on': float. Fraction of spots with non-zero total gene counts.'count_avg': float. Mean per-spot total count for the gene.'count_std': float. Std of per-spot total count for the gene.
For
level='isoform', the index is the isoform name (matchingadata.var_names) and the columns are the originaladata.varcolumns plus:'pct_bin_on': float. Fraction of spots with count > 0.'count_total': float. Total counts across all spots.'count_avg': float. Mean count per spot.'count_std': float. Std of count per spot.'ratio_total': float. Fraction of total gene counts attributable to this isoform.'ratio_avg': float. Mean per-spot isoform usage ratio (computed over spots with non-zero gene coverage).'ratio_std': float. Std of per-spot isoform usage ratio (computed over spots with non-zero gene coverage).
- Return type:
- Raises:
RuntimeError – If
setup_data()has not been called.ValueError – If
levelis not'gene'or'isoform'.
- get_formatted_test_results(test_type, with_gene_summary=False)#
Get formatted test results as a pandas DataFrame.
- Parameters:
test_type ({"sv", "du"}) – Which results to retrieve:
"sv"for spatial variability or"du"for differential usage.with_gene_summary (bool, optional) – If
True, append gene-level summary statistics fromextract_feature_summary()(columns:n_isos,perplexity,pct_bin_on,count_avg,count_std).
- Returns:
Formatted test results.
- Return type:
- setup_data(adata, *, spatial_key='spatial', adj_key=None, layer='counts', group_iso_by='gene_symbol', gene_names=None, design_mtx=None, covariate_names=None, min_counts=10, min_bin_pct=0.0, filter_single_iso_genes=True, min_component_size=1, skip_spatial_kernel=False)#
Setup isoform-level spatial data for hypothesis testing.
Extracts isoform count tensors from an AnnData object, optionally filters disconnected graph components, builds a spatial covariance kernel, and resolves the design matrix.
- Parameters:
adata (AnnData) – Annotated data matrix. Counts are read from
adata.layers[layer]grouped bygroup_iso_by, and spatial coordinates fromadata.obsm[spatial_key]. Seesplisosm.utils.prepare_inputs_from_anndata()for full preprocessing details.spatial_key (str, optional) – Key in
adata.obsmfor spatial coordinates (default"spatial"). Optional whenadj_keyis provided: if the key is missing fromadata.obsmthe spatial kernel is built from the adjacency alone and coordinate-free SV tests ("hsic-ir"/"hsic-ic"/"hsic-gc") and DU tests still run.method="spark-x"(SV) andmethod="hsic-gp"(DU) require raw coordinates and raise a clear error at call time when they are absent.adj_key (str or None, optional) – Key in
adata.obspfor a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. Also makesspatial_keyoptional (see above). The adjacency matrix is symmetrized internally.layer (str, optional) – Layer in
adata.layersthat stores isoform counts (default"counts").group_iso_by (str, optional) – Column in
adata.varused to group isoforms by gene (default"gene_symbol").gene_names (str or None, optional) – Column name in
adata.varused as display names for genes. IfNone, the values ofgroup_iso_byare used.design_mtx (tensor, array, DataFrame, str, or list of str, optional) –
Design matrix for differential-usage tests. Accepts an array/tensor/DataFrame of shape
(n_spots, n_factors), a single obs-column name (str), or a list of obs-column names. Categorical obs columns are one-hot encoded automatically.When a scipy sparse matrix is passed directly, it is stored as scipy CSR internally and all differential-usage methods handle it without densifying:
"hsic"uses a sparse matrix-multiply path inlinear_hsic_test();"t-fisher"and"t-tippett"extract group indices directly from the sparse non-zero structure."hsic-gp"densifies each column via_get_design_col()before GPR fitting (GPR residuals are always dense).All other input types (obs column names, array, tensor, DataFrame) are converted to a dense torch float32 tensor.
covariate_names (list of str or None, optional) – Explicit covariate names. When
design_mtxis given as column name(s) and this isNone, the column names are used automatically; otherwise auto-generated asfactor_1, etc.min_counts (int, optional) – Minimum total isoform count across spots required to retain an isoform (default 10).
min_bin_pct (float, optional) – Minimum fraction/percentage of spots where an isoform must be expressed (default 0.0).
filter_single_iso_genes (bool, optional) – Whether to remove genes with fewer than two retained isoforms (default
True).min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all data structures before the spatial kernel is built. Default 1 disables filtering. A
UserWarningis issued when spots are removed.skip_spatial_kernel (bool, optional) – If
True, skip construction of the CAR spatial kernel and store anIdentityKernelplaceholder asself.sp_kernelinstead. Use this when onlytest_differential_usage()is needed (it fits custom GPR to handle spatial autocorrelation). Callingtest_spatial_variability()on a model set up withskip_spatial_kernel=Truewill raise aRuntimeError. DefaultFalse.
- Raises:
ValueError – If input arguments are invalid or required fields are missing.
- Return type:
None
- test_differential_usage(method='hsic-gp', ratio_transformation='none', nan_filling='mean', gpr_backend='sklearn', gpr_configs=None, residualize='cov_only', n_jobs=-1, print_progress=True, return_results=False)#
Test for spatial isoform differential usage.
Before running this function, the design matrix must be set up using
setup_data(). Each column of the design matrix corresponds to a covariate to test for differential association with the isoform usage ratios of each gene. Test statistics and p-values are computed per (gene, covariate) pair separately.Two types of association tests are supported:
Unconditional (
"hsic","t-fisher","t-tippett"): test the unconditional association between isoform usage ratios and the covariate.Conditional (
"hsic-gp"): test the association conditioned on spatial coordinates via Gaussian process regression. See [ZPJScholkopf12] for more details.
- Parameters:
method (str, optional) –
Method for association testing:
"hsic": Unconditional HSIC test (multivariate RV coefficient). For continuous factors, equivalent to the multivariate Pearson correlation test. For binary factors, equivalent to the two-sample Hotelling T**2 test."hsic-gp": Conditional HSIC test. Spatial effects are removed via Gaussian process regression before computing the HSIC statistic.
Or one of the T-tests (binary factors only):
"t-fisher","t-tippett": each isoform is tested independently and p-values are combined gene-wise via Fisher’s or Tippett’s method.
ratio_transformation (str, optional) – Compositional transformation for isoform ratios. One of
'none','clr','ilr','alr','radial'[PYPA22]. Seesplisosm.utils.counts_to_ratios().nan_filling (str, optional) – How to fill NaN values in isoform ratios. One of
'mean'or'none'. Seesplisosm.utils.counts_to_ratios().gpr_backend (str, optional) – GPR backend to use for
method='hsic-gp'. One of'sklearn'(default) or'gpytorch'. For FFT-accelerated spatial GP on regular grids useSplisosmFFTinstead.gpr_configs (dict, optional) –
Nested configuration dict for the GPR objects, with optional keys
'covariate'and/or'isoform'. Each sub-dict is forwarded tosplisosm.kernel_gpr.make_kernel_gpr(). Unspecified keys use the defaults fromsplisosm.kernel_gpr._DEFAULT_GPR_CONFIGS:{ "covariate": { "constant_value": 1.0, "constant_value_bounds": (1e-3, 1e3), "length_scale": 1.0, "length_scale_bounds": "fixed", "n_inducing": 5000, }, "isoform": { "constant_value": 1.0, "constant_value_bounds": (1e-3, 1e3), "length_scale": 1.0, "length_scale_bounds": "fixed", "n_inducing": 5000, }, }
"n_inducing"(int or None) controls the scale of spatial GP fitting for each backend:sklearn — maximum number of observations used for hyperparameter fitting. Full exact GP when
n_obs ≤ n_inducing(orNone); a randomly sub-sampled subset-of-data ofn_inducingpoints otherwise (not the same inducing-point approximation as gpytorch). Default:5000. Set toNoneto use all observations (warns whenn_obs > 10_000).gpytorch — FITC sparse-GP inducing-point approximation with
n_inducingpoints; set toNonefor exact GP. Default:5000.
residualize ({"cov_only", "both"}, optional) –
Controls which signals are spatially residualized when
method="hsic-gp":"cov_only"(default): residualize covariates only; test HSIC(Z_res, Y_raw). Fastest; calibration matches"both"when covariate GPR captures most spatial confounding."both": residualize both covariates and isoform ratios.
n_jobs (int, optional) – Number of parallel workers for the per-gene loop.
-1uses all available CPUs. Each worker densifies one sparse count tensor (~4–40 MB at 100 K–1 M spots × 10 isoforms). Whengpr_backend="gpytorch"anddevice != "cpu", the GPU is not thread-safe; parallelism is automatically disabled. Default-1.print_progress (bool, optional) – Whether to show the progress bar. Default to True.
return_results (bool, optional) – Whether to return the test statistics and p-values. If False, the results are stored in
self._du_test_results.
- Returns:
results – If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results inself._du_test_results.- Return type:
dict or None
- test_spatial_variability(method='hsic-ir', ratio_transformation='none', nan_filling='mean', null_method='eig', null_configs=None, n_jobs=-1, return_results=False, print_progress=True)#
Test for spatial variability.
Kernel-based multivariate hypothesis testing for spatial variability in
gene-level total counts (
"hsic-gc"or"spark-x"[ZSZ21])isoform usage ratios (
"hsic-ir")isoform counts (
"hsic-ic")
Test statistics and p-values are computed per gene for each gene separately.
- Parameters:
method ({"hsic-ir", "hsic-ic", "hsic-gc", "spark-x"}, optional) – Test target:
"hsic-ir"(isoform usage ratios),"hsic-ic"(isoform counts),"hsic-gc"(gene-level counts), or"spark-x"(SPARK-X [ZSZ21]).ratio_transformation ({"none", "clr", "ilr", "alr", "radial"}, optional) – Compositional transformation applied to isoform ratios when
method="hsic-ir". Seesplisosm.utils.counts_to_ratios()and [PYPA22] for details.nan_filling ({"mean", "none"}, optional) – Strategy for NaN values in isoform ratios. See
splisosm.utils.counts_to_ratios()for details.null_method ({"eig", "clt", "welch", "perm"}, optional) –
Method for computing the null distribution of the test statistic:
"eig"(default): asymptotic chi-square mixture using kernel eigenvalues; Liu’s method [LTZ09]. Supports optionalnull_configs["approx_rank"](int) to use only the top-k eigenvalues. By default, approx_rank = np.ceil(np.sqrt(n_spots) * 4) for large datasets (n_spots > 5000). Set it to None to use all eigenvalues, which can be slow for large n_spots."clt": moment-matching normal (Central Limit Theorem) approximation using tr(K’) and tr(K’²) of the (centred) spatial kernel. Fastest, but tail probabilities can be inaccurate when the effective degrees of freedom of the chi-squared mixture is small."trace"is accepted as a deprecated alias."welch": Welch-Satterthwaite moment matching. Uses the same tr(K’) and tr(K’²) as"clt"but approximates the null by a scaled chi-squaredg * chi2(h)withg = Var/(2*E)andh = 2*E^2/Var. Comparable cost to"clt"with more accurate right-tail p-values, typically closer to the"eig"(Liu) reference."perm": permutation-based null distribution. Supports optionalnull_configs["n_perms_per_gene"](default 1000), andnull_configs["perm_batch_size"](default 50, larger values lead to more memory usage) for batch-wise null statistic computation.
null_configs (dict or None, optional) – Extra keyword arguments for the chosen
null_method.n_jobs (int, optional) – Number of parallel workers for the per-gene loop.
-1uses all available CPUs. Each worker densifies one sparse count tensor (~4–40 MB at 100 K–1 M spots × 10 isoforms) so choosen_jobsto fit within available RAM. Default-1.return_results (bool, optional) – If
True, return the result dict. Otherwise store results insv_test_resultsand returnNone.print_progress (bool, optional) – Whether to show a progress bar.
- Returns:
If
return_resultsis True, returns dict with test statistics and p-values. Otherwise, returns None and stores results in self._sv_test_results.- Return type:
dict or None
Notes
To run the SPARK-X test, the R-package
SPARKmust be installed and accessible from Python viarpy2.
- adata: AnnData | None#
Source
AnnDataobject;Nonebeforesetup_data().
- property filtered_adata: AnnData#
The filtered AnnData of shape (
n_spots, sum(n_isos_per_gene)).This is the data used internally after
setup_data(). It is a copy of the inputadata, subsetted to the retained spots and isoforms after filtering.- Raises:
RuntimeError – If
setup_data()has not been called.- Return type:
- n_factors: int#
Number of covariates for differential usage testing.
- n_genes: int#
Number of genes after filtering.
- n_spots: int#
Number of spatial spots/cells.
- sp_kernel: Any#
Spatial kernel (
SpatialCovKernelorIdentityKernel). Set bysetup_data().