Helper Functions#

These helpers support data preparation, standalone tests, and result post-processing outside the model classes.

Data preparation#

splisosm.utils.preprocessing.prepare_inputs_from_anndata

Extract and filter isoform count tensors from an AnnData object.

splisosm.utils.preprocessing.counts_to_ratios

Convert isoform counts to proportions.

splisosm.utils.preprocessing.add_ratio_layer

Compute within-gene isoform usage ratios and store them as a new layer.

splisosm.utils.preprocessing.compute_feature_summaries

Compute gene-level and isoform-level summary statistics.

splisosm.utils.preprocessing.auto_chunk_size

Return an automatic response-column chunk cap.

splisosm.utils.preprocessing.prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, adj_key=None, min_counts=0, min_bin_pct=0, filter_single_iso_genes=False, gene_names=None, design_mtx=None, covariate_names=None, min_component_size=1, k_neighbors=4, return_filtered_anndata=False)#

Extract and filter isoform count tensors from an AnnData object.

Shared helper used by both splisosm.hyptest.np.SplisosmNP and splisosm.hyptest.glmm.SplisosmGLMM to prepare legacy-compatible tensors from an AnnData input. Feature filtering, sparse/dense handling, coordinate extraction, and design-matrix resolution are all performed here.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • layer (str) – Key in adata.layers containing raw isoform counts.

  • group_iso_by (str) – Column in adata.var used to group isoforms by gene.

  • spatial_key (str) – Key in adata.obsm for spatial coordinates. Optional when adj_key is provided: if the key is missing from adata.obsm the returned coordinates is None and downstream kernel construction proceeds from the adjacency matrix alone. Raw coordinates are still required by SPARK-X and the GP-conditional DU test; those callers raise a dedicated error at call time when coordinates are absent.

  • adj_key (str | None) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and is used directly to build the spatial kernel. The adjacency matrix is symmetrized internally before being returned.

  • min_counts (int) – Minimum total isoform count across spots required to retain an isoform.

  • min_bin_pct (float) – Minimum fraction/percentage of spots with non-zero expression for an isoform. Values in [0, 1] are treated as fractions; values in (1, 100] are treated as percentages.

  • filter_single_iso_genes (bool) – Whether to discard genes with fewer than two retained isoforms.

  • gene_names (str | None) – Column name in adata.var used as display names for grouped genes. If None, the grouped gene IDs are used.

  • design_mtx (Any | None) – Design matrix for differential-usage tests. Accepts a tensor, array, sparse matrix, DataFrame of shape (n_spots, n_factors), a single obs-column name, or a list of obs-column names.

  • covariate_names (list[str] | None) – Explicit covariate names. When design_mtx is given as column name(s) and this is None, the column names are used automatically.

  • min_component_size (int) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all arrays (counts, coordinates, design matrix). Requires building a k-NN graph from coordinates (unless adj_key is provided). Default 1 disables filtering. Emits UserWarning when spots are removed.

  • k_neighbors (int) – Number of nearest neighbours for the k-NN graph when adj_key is None and component filtering is active. Default 4.

  • return_filtered_anndata (bool) – If True, return a copy of adata restricted to the spots and isoforms that survived all filtering steps, with columns ordered to match counts_list. Default False returns None as the seventh element.

Returns:

  • counts_list (list[torch.Tensor]) – Per-gene isoform count tensors, each of shape (n_spots, n_isos). Sparse adata.layers[layer] input yields sparse COO tensors.

  • coordinates (torch.Tensor or None) – Spatial coordinates with shape (n_spots, n_spatial_dims) and dtype float32. None when spatial_key is missing from adata.obsm and adj_key supplies the neighborhood graph instead.

  • resolved_gene_names (list[str]) – Display names for each gene in counts_list.

  • resolved_design (np.ndarray or tensor or None) – Resolved design matrix, or the original object if it was already array-like; None when design_mtx is None.

  • resolved_covariates (list[str] or None) – Resolved covariate names, or None when design_mtx is None.

  • adj_matrix (scipy.sparse.spmatrix or None) – The (possibly filtered) adjacency matrix to use for building the spatial kernel. This is:

    • adata.obsp[adj_key] (filtered to retained spots) when adj_key is provided, or

    • the k-NN adjacency built from coordinates (filtered) when min_component_size > 1 and filtering removed spots, or

    • None otherwise (caller should build k-NN inside SpatialCovKernel).

  • filtered_adata (anndata.AnnData) – A copy of adata restricted to the spots and isoforms that survived all filtering steps (component filtering + min_counts / min_bin_pct / single-isoform-gene filtering). Columns are ordered to match the concatenation order of isoforms across genes in counts_list. This is useful for computing per-feature summary statistics without re-running the filtering logic.

Raises:

ValueError – If required fields are missing from adata, no isoforms survive filtering, or argument values are out of range.

Return type:

tuple[list[Tensor], Tensor | None, list[str], Any | None, list[str] | None, spmatrix | None, AnnData | None]

splisosm.utils.preprocessing.counts_to_ratios(counts, transformation='none', nan_filling='mean', fill_before_transform=None)#

Convert isoform counts to proportions.

Spots with zero total counts (“zero-coverage spots”) are handled according to nan_filling and fill_before_transform. When nan_filling='mean' the zero-coverage rows are replaced with the per-isoform mean of the remaining rows. The timing of this fill relative to the ratio transformation is controlled by fill_before_transform:

  • False (new default): zero-coverage rows are filled with the column-wise mean of the transformed values (i.e. after the ratio transform is applied). For log-ratio transforms (clr, ilr, alr) this means pseudocount-filled rows are replaced with the mean of the true transformed values.

  • True (old behaviour): zero-coverage rows are filled with the mean of the raw isoform ratios before the transformation is applied. For log-ratio transforms the pseudocount-based rows are kept as-is (no explicit fill).

Note

Behaviour change since v1.1.0: the default filling now happens after transformation. Code that relied on the previous before-transform fill should pass fill_before_transform=True explicitly. Passing fill_before_transform=False adopts the new default without a warning.

After conversion, the isoform ratios can be further transformed using log-ratio-based transformations (clr, ilr, alr) or radial transformation [PYPA22].

Parameters:
  • counts (ndarray | Tensor) – Shape (n_spots, n_isos). Isoform counts.

  • transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – Transformation applied to the proportions. Can be one of the following: 'none': no transformation, return isoform ratios. 'clr': centered log-ratio transformation. 'ilr': isometric log-ratio transformation. 'alr': additive log-ratio transformation. 'radial': radial transformation [PYPA22].

  • nan_filling (Literal['mean', 'none']) – Method to fill all-zero rows. 'mean': fill all-zero rows with the per-isoform mean across expressed spots (timing controlled by fill_before_transform). 'none': do not fill rows and return NaNs at all-zero rows.

  • fill_before_transform (bool | None) – Controls when zero-coverage rows are mean-filled relative to the ratio transformation. Only relevant when nan_filling='mean' and transformation != 'none'. False: fill after transformation (new default). True: fill before transformation (old behaviour). None (default): use the new default (False) and emit a FutureWarning to inform callers of the behaviour change.

Returns:

ratios – Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.

Return type:

Tensor

Notes

Log-ratio-based transformations (clr, ilr, alr) are implemented via scikit-bio, with a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.

splisosm.utils.preprocessing.add_ratio_layer(adata, layer, group_iso_by, ratio_layer_key, fill_nan_with_mean=False)#

Compute within-gene isoform usage ratios and store them as a new layer.

For each spot and gene, every isoform’s ratio is its count divided by the total count of that gene at that spot. The result is written to adata.layers[ratio_layer_key] in-place.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • layer (str) – Key in adata.layers containing raw isoform counts.

  • group_iso_by (str) – Column in adata.var grouping isoforms to genes.

  • ratio_layer_key (str) – Key under which to store the computed ratio matrix. Must differ from layer.

  • fill_nan_with_mean (bool, optional) –

    How to handle spots where the gene has zero total counts.

    • False (default): store as a sparse CSR matrix with the same sparsity structure as the count layer. Spots with zero gene total have ratio 0 for all isoforms of that gene (structural zeros, not stored explicitly).

    • True: store as a dense float32 matrix; spots with zero gene total are filled with the per-isoform mean ratio across expressed spots (or 0.0 if no spot expresses the isoform).

Raises:

ValueError – If layer is missing, group_iso_by is not a var column, or ratio_layer_key equals layer.

Return type:

None

splisosm.utils.preprocessing.compute_feature_summaries(adata, gene_names, layer='counts', group_iso_by='gene_symbol', print_progress=True)#

Compute gene-level and isoform-level summary statistics.

This is the shared implementation behind extract_feature_summary(), extract_feature_summary(), and extract_feature_summary().

Parameters:
  • adata (AnnData) – Annotated data matrix (typically the filtered AnnData from setup_data).

  • gene_names (list[str]) – Ordered list of gene display names (length n_genes).

  • layer (str) – Layer in adata.layers containing raw isoform counts.

  • group_iso_by (str) – Column in adata.var that groups isoforms by gene.

  • print_progress (bool) – Show a tqdm progress bar.

Returns:

  • gene_summary (pandas.DataFrame) – Indexed by gene name with columns: n_isos, perplexity, pct_bin_on, count_avg, count_std, major_ratio_avg.

  • isoform_summary (pandas.DataFrame) – Indexed by isoform name with original adata.var columns plus: pct_bin_on, count_total, count_avg, count_std, ratio_total, ratio_avg, ratio_std.

Return type:

tuple[DataFrame, DataFrame]

splisosm.utils.preprocessing.auto_chunk_size(n_observations, *, backend='np', n_jobs=1, dtype_bytes=8, memory_budget=2147483648, max_columns=32)#

Return an automatic response-column chunk cap.

Parameters:
  • n_observations (int) – Number of spots for the NP backend or grid cells for the FFT backend.

  • backend (Literal['np', 'fft']) – Backend used by the spatial variability test. "np" estimates dense RHS and kernel-product buffers; "fft" also accounts for complex Fourier work arrays.

  • n_jobs (int) – Joblib-style worker count. The default budget is interpreted per worker, so an 8-worker run targets roughly 16 GiB aggregate live memory.

  • dtype_bytes (int) – Bytes per real-valued input element.

  • memory_budget (int) – Per-worker live-memory budget in bytes. Defaults to 2 GiB.

  • max_columns (int) – Hard performance cap for response columns/channels. SPLISOSM uses 32 by default for both NP and FFT SV tests.

Returns:

Positive response-column cap. Genes are never split across chunks; a single gene with more columns than this cap should be processed as a singleton chunk.

Return type:

int

Standalone tests#

splisosm.utils.stats.run_hsic_gc

Compute the HSIC-GC statistic for gene-level counts.

splisosm.utils.stats.run_sparkx

Wrapper for running the SPARK-X test for spatial gene expression variability.

splisosm.utils.stats.false_discovery_control

Adjust p-values to control the false discovery rate.

splisosm.utils.stats.run_hsic_gc(counts_gene=None, coordinates=None, null_method='liu', null_configs=None, chunk_size='auto', n_jobs=1, min_component_size=1, adata=None, layer=None, spatial_key='spatial', adj_key=None, min_counts=0, min_bin_pct=0.0, **spatial_kernel_kwargs)#

Compute the HSIC-GC statistic for gene-level counts.

This standalone helper runs the gene-count SV test used by method="hsic-gc" without constructing a SplisosmNP model.

Parameters:
  • counts_gene (np.ndarray | Tensor | None) – Shape (n_spots, n_genes). Gene counts.

  • coordinates (np.ndarray | Tensor | None) – Shape (n_spots, n_dim). Spatial coordinates of spots.

  • null_method ({"liu", "welch"}, optional) –

    Method for computing the null distribution of the test statistic:

    • "liu" (default): asymptotic chi-square mixture using kernel cumulants with Liu’s method. Use null_configs["n_probes"] (int) to tune Hutchinson Rademacher probes for implicit CAR cumulants. An advanced null_configs["approx_rank"] override is available for diagnostics, but is not needed for the default SV path.

    • "welch": Welch-Satterthwaite scaled chi-squared approximation using the first two HSIC null moments. Typically more accurate in the right tail than the retired normal approximation at the same cost.

    "eig" is accepted as a deprecated alias for "liu". "clt" and "trace" are accepted as deprecated aliases for "welch".

  • null_configs (dict or None, optional) – Extra keyword arguments for the chosen null_method. Use "n_probes" to control the Hutchinson budget for both "liu" cumulants and "welch" first-two-moment traces when the spatial kernel has no exact trace path; implicit CAR kernels default to 60 probes. "n_probes" is stochastic trace control, not a low-rank approximation.

  • chunk_size (int or {"auto"}, optional) – Maximum number of gene-count response columns to process in one spatial-kernel application. "auto" (default) estimates a memory-safe cap from a 2 GiB live-memory budget and caps the result at 32 columns for per-feature runtime. Genes are never split across chunks.

  • n_jobs (int, optional) – Number of joblib workers for chunked gene-level HSIC computation. 1 (default) runs serially. Use -1 to use all available CPUs. Threads are preferred so the sparse spatial kernel and count matrix can be shared without large pickled copies.

  • min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots that belong to components smaller than this threshold are removed before the spatial kernel is built. Components are detected on the same k-NN graph used for the spatial kernel (controlled by k_neighbors), unless adj_key supplies a graph. The default value of 1 disables filtering. A UserWarning is issued whenever spots are removed.

  • adata (AnnData or None, optional) – If provided, use adata.X (when layer=None) or adata.layers[layer] as counts_gene — a (n_spots, n_genes) count matrix (dense or sparse) — and adata.obsm[spatial_key] for coordinates. Mutually exclusive with counts_gene / coordinates.

  • layer (str or None, optional) – Layer key in adata.layers. When None (default), adata.X is used. Used only in AnnData mode.

  • spatial_key (str, optional) – Key in adata.obsm for spatial coordinates. Used only in AnnData mode and optional when adj_key is provided: if the key is missing from adata.obsm the kernel is built from the adjacency alone.

  • adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided in AnnData mode, the adjacency is loaded from adata.obsp[adj_key] and used for both component filtering and the spatial kernel construction. In AnnData mode this also makes spatial_key optional. Ignored in matrix mode.

  • min_counts (int, optional) – Minimum total count to retain a gene in AnnData mode. Default 0.

  • min_bin_pct (float, optional) – Minimum fraction of spots expressing a gene (count > 0). Default 0.0.

  • **spatial_kernel_kwargs – Additional arguments forwarded to SpatialCovKernel. For example, standardize_cov=True (default) standardises the CAR covariance diagonal to one, which can reduce graph-outlier leverage but slows setup.

Returns:

Results with keys:

  • 'statistic': np.ndarray of shape (n_genes,).

  • 'pvalue': np.ndarray of shape (n_genes,).

  • 'pvalue_adj': np.ndarray of shape (n_genes,).

  • 'method': "hsic-gc".

  • 'null_method': the value of null_method.

  • 'n_spots': number of spots after component filtering.

Return type:

dict

splisosm.utils.stats.run_sparkx(counts_gene, coordinates)#

Wrapper for running the SPARK-X test for spatial gene expression variability.

It runs the R-package SPARK [ZSZ21] via rpy2.

Parameters:
  • counts_gene (ndarray | Tensor) – Shape (n_spots, n_genes), the observed gene counts.

  • coordinates (ndarray | Tensor) – Shape (n_spots, 2), the spatial coordinates.

Returns:

Results of the SPARK-X spatial variability test with keys:

  • 'statistic': np.ndarray of shape (n_genes,). Mean SPARK-X statistics.

  • 'pvalue': np.ndarray of shape (n_genes,). Combined p-values.

  • 'pvalue_adj': np.ndarray of shape (n_genes,). Adjusted combined p-values.

  • 'method': str. Method name “spark-x”.

Return type:

dict

splisosm.utils.stats.false_discovery_control(ps, *, axis=0, method='bh')#

Adjust p-values to control the false discovery rate.

The false discovery rate (FDR) is the expected proportion of rejected null hypotheses that are actually true. If the null hypothesis is rejected when the adjusted p-value falls below a specified level, the false discovery rate is controlled at that level.

Parameters:
  • ps (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – The p-values to adjust. Elements must be real numbers between 0 and 1.

  • axis (int | None) – The axis along which to perform the adjustment. The adjustment is performed independently along each axis-slice. If axis is None, ps is raveled before performing the adjustment.

  • method (Literal['bh', 'by']) – The false discovery rate control procedure to apply: 'bh' is for Benjamini-Hochberg [BH95] (Eq. 1), 'by' is for Benjamini-Yekutieli [BY01] (Theorem 1.3). The latter is more conservative, but it is guaranteed to control the FDR even when the p-values are not from independent tests.

Returns:

ps_adjusted – The adjusted p-values. If the null hypothesis is rejected where these fall below a specified level, the false discovery rate is controlled at that level.

Return type:

ndarray

Notes

From scipy.stats.false_discovery_control in SciPy v1.13.1. See scipy/scipy.

HSIC and Liu helpers#

splisosm.utils.hsic.linear_hsic_test

The linear HSIC test (multivariate RV coefficient).

splisosm.utils.hsic.liu_sf

Compute p-values for weighted chi-squared sums using Liu's approximation.

splisosm.utils.hsic.liu_sf_from_cumulants

Compute Liu p-values from chi-squared-mixture cumulants.

splisosm.utils.hsic.linear_hsic_test(X, Y, centering=True)#

The linear HSIC test (multivariate RV coefficient).

Equivalent to a multivariate extension of Pearson correlation.

Supports sparse inputs for memory and speed efficiency:

  • Sparse X (scipy sparse matrix, shape (n, p)): the cross-product Y_c.T @ X_c is computed via a sparse matrix multiply (X.T.dot(Y_c)). Because Y is mean-centred, the X centering correction reduces to zero, so only the original sparse X is needed. X_c.T @ X_c is obtained as X.T @ X  -  n * mean_X mean_X, keeping the first term sparse.

  • Sparse Y (scipy sparse or torch sparse COO): densified once upfront before any computation.

Parameters:
  • X (Tensor or spmatrix, shape (n_samples, n_x))

  • Y (Tensor or spmatrix, shape (n_samples, n_y))

  • centering (bool) – Whether to mean-centre X and Y before computing the statistic.

Returns:

  • hsic (float) – HSIC test statistic (scaled by 1 / (n - 1)**2).

  • pvalue (float) – P-value from the asymptotic chi-squared mixture distribution.

Return type:

tuple[float, float]

splisosm.utils.hsic.liu_sf(t, lambs, dofs=None, deltas=None, kurtosis=False)#

Compute p-values for weighted chi-squared sums using Liu’s approximation.

Parameters:
  • t (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – Observed test statistic value(s).

  • lambs (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – Mixture weights.

  • dofs (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Degrees of freedom for each component. Defaults to one.

  • deltas (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Noncentrality parameters for each component. Defaults to zero.

  • kurtosis (bool) – If True, use the kurtosis-matching edge-case branch.

Returns:

Survival probabilities Pr(X > t).

Return type:

ndarray

splisosm.utils.hsic.liu_sf_from_cumulants(t, cumulants, kurtosis=False)#

Compute Liu p-values from chi-squared-mixture cumulants.

Parameters:
  • t (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str])

  • cumulants (dict[int, float])

  • kurtosis (bool)

Return type:

ndarray