Helper Functions#
These helpers support data preparation, standalone tests, and result post-processing outside the model classes.
Data preparation#
Extract and filter isoform count tensors from an AnnData object. |
|
Convert isoform counts to proportions. |
|
Compute within-gene isoform usage ratios and store them as a new layer. |
|
Compute gene-level and isoform-level summary statistics. |
|
Return an automatic response-column chunk cap. |
- splisosm.utils.preprocessing.prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, adj_key=None, min_counts=0, min_bin_pct=0, filter_single_iso_genes=False, gene_names=None, design_mtx=None, covariate_names=None, min_component_size=1, k_neighbors=4, return_filtered_anndata=False)#
Extract and filter isoform count tensors from an AnnData object.
Shared helper used by both
splisosm.hyptest.np.SplisosmNPandsplisosm.hyptest.glmm.SplisosmGLMMto prepare legacy-compatible tensors from an AnnData input. Feature filtering, sparse/dense handling, coordinate extraction, and design-matrix resolution are all performed here.- Parameters:
adata (AnnData) – Annotated data matrix.
layer (str) – Key in
adata.layerscontaining raw isoform counts.group_iso_by (str) – Column in
adata.varused to group isoforms by gene.spatial_key (str) – Key in
adata.obsmfor spatial coordinates. Optional whenadj_keyis provided: if the key is missing fromadata.obsmthe returnedcoordinatesisNoneand downstream kernel construction proceeds from the adjacency matrix alone. Raw coordinates are still required by SPARK-X and the GP-conditional DU test; those callers raise a dedicated error at call time when coordinates are absent.adj_key (str | None) – Key in
adata.obspfor a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and is used directly to build the spatial kernel. The adjacency matrix is symmetrized internally before being returned.min_counts (int) – Minimum total isoform count across spots required to retain an isoform.
min_bin_pct (float) – Minimum fraction/percentage of spots with non-zero expression for an isoform. Values in
[0, 1]are treated as fractions; values in(1, 100]are treated as percentages.filter_single_iso_genes (bool) – Whether to discard genes with fewer than two retained isoforms.
gene_names (str | None) – Column name in
adata.varused as display names for grouped genes. IfNone, the grouped gene IDs are used.design_mtx (Any | None) – Design matrix for differential-usage tests. Accepts a tensor, array, sparse matrix, DataFrame of shape
(n_spots, n_factors), a single obs-column name, or a list of obs-column names.covariate_names (list[str] | None) – Explicit covariate names. When
design_mtxis given as column name(s) and this isNone, the column names are used automatically.min_component_size (int) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all arrays (counts, coordinates, design matrix). Requires building a k-NN graph from coordinates (unless
adj_keyis provided). Default 1 disables filtering. EmitsUserWarningwhen spots are removed.k_neighbors (int) – Number of nearest neighbours for the k-NN graph when
adj_keyisNoneand component filtering is active. Default 4.return_filtered_anndata (bool) – If
True, return a copy ofadatarestricted to the spots and isoforms that survived all filtering steps, with columns ordered to matchcounts_list. DefaultFalsereturnsNoneas the seventh element.
- Returns:
counts_list (list[torch.Tensor]) – Per-gene isoform count tensors, each of shape
(n_spots, n_isos). Sparseadata.layers[layer]input yields sparse COO tensors.coordinates (torch.Tensor or None) – Spatial coordinates with shape
(n_spots, n_spatial_dims)and dtype float32.Nonewhenspatial_keyis missing fromadata.obsmandadj_keysupplies the neighborhood graph instead.resolved_gene_names (list[str]) – Display names for each gene in
counts_list.resolved_design (np.ndarray or tensor or None) – Resolved design matrix, or the original object if it was already array-like;
Nonewhendesign_mtxisNone.resolved_covariates (list[str] or None) – Resolved covariate names, or
Nonewhendesign_mtxisNone.adj_matrix (scipy.sparse.spmatrix or None) – The (possibly filtered) adjacency matrix to use for building the spatial kernel. This is:
adata.obsp[adj_key](filtered to retained spots) whenadj_keyis provided, orthe k-NN adjacency built from coordinates (filtered) when
min_component_size > 1and filtering removed spots, orNoneotherwise (caller should build k-NN insideSpatialCovKernel).
filtered_adata (anndata.AnnData) – A copy of
adatarestricted to the spots and isoforms that survived all filtering steps (component filtering + min_counts / min_bin_pct / single-isoform-gene filtering). Columns are ordered to match the concatenation order of isoforms across genes incounts_list. This is useful for computing per-feature summary statistics without re-running the filtering logic.
- Raises:
ValueError – If required fields are missing from
adata, no isoforms survive filtering, or argument values are out of range.- Return type:
tuple[list[Tensor], Tensor | None, list[str], Any | None, list[str] | None, spmatrix | None, AnnData | None]
- splisosm.utils.preprocessing.counts_to_ratios(counts, transformation='none', nan_filling='mean', fill_before_transform=None)#
Convert isoform counts to proportions.
Spots with zero total counts (“zero-coverage spots”) are handled according to
nan_fillingandfill_before_transform. Whennan_filling='mean'the zero-coverage rows are replaced with the per-isoform mean of the remaining rows. The timing of this fill relative to the ratio transformation is controlled byfill_before_transform:False(new default): zero-coverage rows are filled with the column-wise mean of the transformed values (i.e. after the ratio transform is applied). For log-ratio transforms (clr, ilr, alr) this means pseudocount-filled rows are replaced with the mean of the true transformed values.True(old behaviour): zero-coverage rows are filled with the mean of the raw isoform ratios before the transformation is applied. For log-ratio transforms the pseudocount-based rows are kept as-is (no explicit fill).
Note
Behaviour change since v1.1.0: the default filling now happens after transformation. Code that relied on the previous before-transform fill should pass
fill_before_transform=Trueexplicitly. Passingfill_before_transform=Falseadopts the new default without a warning.After conversion, the isoform ratios can be further transformed using log-ratio-based transformations (clr, ilr, alr) or radial transformation [PYPA22].
- Parameters:
counts (ndarray | Tensor) – Shape (n_spots, n_isos). Isoform counts.
transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – Transformation applied to the proportions. Can be one of the following:
'none': no transformation, return isoform ratios.'clr': centered log-ratio transformation.'ilr': isometric log-ratio transformation.'alr': additive log-ratio transformation.'radial': radial transformation [PYPA22].nan_filling (Literal['mean', 'none']) – Method to fill all-zero rows.
'mean': fill all-zero rows with the per-isoform mean across expressed spots (timing controlled byfill_before_transform).'none': do not fill rows and return NaNs at all-zero rows.fill_before_transform (bool | None) – Controls when zero-coverage rows are mean-filled relative to the ratio transformation. Only relevant when
nan_filling='mean'andtransformation != 'none'.False: fill after transformation (new default).True: fill before transformation (old behaviour).None(default): use the new default (False) and emit aFutureWarningto inform callers of the behaviour change.
- Returns:
ratios – Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.
- Return type:
Notes
Log-ratio-based transformations (clr, ilr, alr) are implemented via
scikit-bio, with a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.
- splisosm.utils.preprocessing.add_ratio_layer(adata, layer, group_iso_by, ratio_layer_key, fill_nan_with_mean=False)#
Compute within-gene isoform usage ratios and store them as a new layer.
For each spot and gene, every isoform’s ratio is its count divided by the total count of that gene at that spot. The result is written to
adata.layers[ratio_layer_key]in-place.- Parameters:
adata (AnnData) – Annotated data matrix.
layer (str) – Key in
adata.layerscontaining raw isoform counts.group_iso_by (str) – Column in
adata.vargrouping isoforms to genes.ratio_layer_key (str) – Key under which to store the computed ratio matrix. Must differ from layer.
fill_nan_with_mean (bool, optional) –
How to handle spots where the gene has zero total counts.
False(default): store as a sparse CSR matrix with the same sparsity structure as the count layer. Spots with zero gene total have ratio 0 for all isoforms of that gene (structural zeros, not stored explicitly).True: store as a dense float32 matrix; spots with zero gene total are filled with the per-isoform mean ratio across expressed spots (or 0.0 if no spot expresses the isoform).
- Raises:
ValueError – If layer is missing, group_iso_by is not a
varcolumn, or ratio_layer_key equals layer.- Return type:
None
- splisosm.utils.preprocessing.compute_feature_summaries(adata, gene_names, layer='counts', group_iso_by='gene_symbol', print_progress=True)#
Compute gene-level and isoform-level summary statistics.
This is the shared implementation behind
extract_feature_summary(),extract_feature_summary(), andextract_feature_summary().- Parameters:
adata (AnnData) – Annotated data matrix (typically the filtered AnnData from
setup_data).gene_names (list[str]) – Ordered list of gene display names (length
n_genes).layer (str) – Layer in
adata.layerscontaining raw isoform counts.group_iso_by (str) – Column in
adata.varthat groups isoforms by gene.print_progress (bool) – Show a tqdm progress bar.
- Returns:
gene_summary (pandas.DataFrame) – Indexed by gene name with columns:
n_isos,perplexity,pct_bin_on,count_avg,count_std,major_ratio_avg.isoform_summary (pandas.DataFrame) – Indexed by isoform name with original
adata.varcolumns plus:pct_bin_on,count_total,count_avg,count_std,ratio_total,ratio_avg,ratio_std.
- Return type:
tuple[DataFrame, DataFrame]
- splisosm.utils.preprocessing.auto_chunk_size(n_observations, *, backend='np', n_jobs=1, dtype_bytes=8, memory_budget=2147483648, max_columns=32)#
Return an automatic response-column chunk cap.
- Parameters:
n_observations (int) – Number of spots for the NP backend or grid cells for the FFT backend.
backend (Literal['np', 'fft']) – Backend used by the spatial variability test.
"np"estimates dense RHS and kernel-product buffers;"fft"also accounts for complex Fourier work arrays.n_jobs (int) – Joblib-style worker count. The default budget is interpreted per worker, so an 8-worker run targets roughly 16 GiB aggregate live memory.
dtype_bytes (int) – Bytes per real-valued input element.
memory_budget (int) – Per-worker live-memory budget in bytes. Defaults to 2 GiB.
max_columns (int) – Hard performance cap for response columns/channels. SPLISOSM uses
32by default for both NP and FFT SV tests.
- Returns:
Positive response-column cap. Genes are never split across chunks; a single gene with more columns than this cap should be processed as a singleton chunk.
- Return type:
int
Standalone tests#
Compute the HSIC-GC statistic for gene-level counts. |
|
Wrapper for running the SPARK-X test for spatial gene expression variability. |
|
Adjust p-values to control the false discovery rate. |
- splisosm.utils.stats.run_hsic_gc(counts_gene=None, coordinates=None, null_method='liu', null_configs=None, chunk_size='auto', n_jobs=1, min_component_size=1, adata=None, layer=None, spatial_key='spatial', adj_key=None, min_counts=0, min_bin_pct=0.0, **spatial_kernel_kwargs)#
Compute the HSIC-GC statistic for gene-level counts.
This standalone helper runs the gene-count SV test used by
method="hsic-gc"without constructing aSplisosmNPmodel.- Parameters:
counts_gene (np.ndarray | Tensor | None) – Shape
(n_spots, n_genes). Gene counts.coordinates (np.ndarray | Tensor | None) – Shape
(n_spots, n_dim). Spatial coordinates of spots.null_method ({"liu", "welch"}, optional) –
Method for computing the null distribution of the test statistic:
"liu"(default): asymptotic chi-square mixture using kernel cumulants with Liu’s method. Usenull_configs["n_probes"](int) to tune Hutchinson Rademacher probes for implicit CAR cumulants. An advancednull_configs["approx_rank"]override is available for diagnostics, but is not needed for the default SV path."welch": Welch-Satterthwaite scaled chi-squared approximation using the first two HSIC null moments. Typically more accurate in the right tail than the retired normal approximation at the same cost.
"eig"is accepted as a deprecated alias for"liu"."clt"and"trace"are accepted as deprecated aliases for"welch".null_configs (dict or None, optional) – Extra keyword arguments for the chosen
null_method. Use"n_probes"to control the Hutchinson budget for both"liu"cumulants and"welch"first-two-moment traces when the spatial kernel has no exact trace path; implicit CAR kernels default to 60 probes."n_probes"is stochastic trace control, not a low-rank approximation.chunk_size (int or {"auto"}, optional) – Maximum number of gene-count response columns to process in one spatial-kernel application.
"auto"(default) estimates a memory-safe cap from a 2 GiB live-memory budget and caps the result at 32 columns for per-feature runtime. Genes are never split across chunks.n_jobs (int, optional) – Number of joblib workers for chunked gene-level HSIC computation.
1(default) runs serially. Use-1to use all available CPUs. Threads are preferred so the sparse spatial kernel and count matrix can be shared without large pickled copies.min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots that belong to components smaller than this threshold are removed before the spatial kernel is built. Components are detected on the same k-NN graph used for the spatial kernel (controlled by
k_neighbors), unlessadj_keysupplies a graph. The default value of1disables filtering. AUserWarningis issued whenever spots are removed.adata (AnnData or None, optional) – If provided, use
adata.X(whenlayer=None) oradata.layers[layer]ascounts_gene— a(n_spots, n_genes)count matrix (dense or sparse) — andadata.obsm[spatial_key]for coordinates. Mutually exclusive withcounts_gene/coordinates.layer (str or None, optional) – Layer key in
adata.layers. WhenNone(default),adata.Xis used. Used only in AnnData mode.spatial_key (str, optional) – Key in
adata.obsmfor spatial coordinates. Used only in AnnData mode and optional whenadj_keyis provided: if the key is missing fromadata.obsmthe kernel is built from the adjacency alone.adj_key (str or None, optional) – Key in
adata.obspfor a pre-built adjacency matrix. When provided in AnnData mode, the adjacency is loaded fromadata.obsp[adj_key]and used for both component filtering and the spatial kernel construction. In AnnData mode this also makesspatial_keyoptional. Ignored in matrix mode.min_counts (int, optional) – Minimum total count to retain a gene in AnnData mode. Default 0.
min_bin_pct (float, optional) – Minimum fraction of spots expressing a gene (count > 0). Default
0.0.**spatial_kernel_kwargs – Additional arguments forwarded to
SpatialCovKernel. For example,standardize_cov=True(default) standardises the CAR covariance diagonal to one, which can reduce graph-outlier leverage but slows setup.
- Returns:
Results with keys:
'statistic': np.ndarray of shape (n_genes,).'pvalue': np.ndarray of shape (n_genes,).'pvalue_adj': np.ndarray of shape (n_genes,).'method':"hsic-gc".'null_method': the value of null_method.'n_spots': number of spots after component filtering.
- Return type:
dict
- splisosm.utils.stats.run_sparkx(counts_gene, coordinates)#
Wrapper for running the SPARK-X test for spatial gene expression variability.
It runs the R-package SPARK [ZSZ21] via rpy2.
- Parameters:
- Returns:
Results of the SPARK-X spatial variability test with keys:
'statistic': np.ndarray of shape (n_genes,). Mean SPARK-X statistics.'pvalue': np.ndarray of shape (n_genes,). Combined p-values.'pvalue_adj': np.ndarray of shape (n_genes,). Adjusted combined p-values.'method': str. Method name “spark-x”.
- Return type:
dict
- splisosm.utils.stats.false_discovery_control(ps, *, axis=0, method='bh')#
Adjust p-values to control the false discovery rate.
The false discovery rate (FDR) is the expected proportion of rejected null hypotheses that are actually true. If the null hypothesis is rejected when the adjusted p-value falls below a specified level, the false discovery rate is controlled at that level.
- Parameters:
ps (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – The p-values to adjust. Elements must be real numbers between 0 and 1.
axis (int | None) – The axis along which to perform the adjustment. The adjustment is performed independently along each axis-slice. If
axisis None,psis raveled before performing the adjustment.method (Literal['bh', 'by']) – The false discovery rate control procedure to apply:
'bh'is for Benjamini-Hochberg [BH95] (Eq. 1),'by'is for Benjamini-Yekutieli [BY01] (Theorem 1.3). The latter is more conservative, but it is guaranteed to control the FDR even when the p-values are not from independent tests.
- Returns:
ps_adjusted – The adjusted p-values. If the null hypothesis is rejected where these fall below a specified level, the false discovery rate is controlled at that level.
- Return type:
Notes
From
scipy.stats.false_discovery_controlin SciPy v1.13.1. See scipy/scipy.
HSIC and Liu helpers#
The linear HSIC test (multivariate RV coefficient). |
|
Compute p-values for weighted chi-squared sums using Liu's approximation. |
|
Compute Liu p-values from chi-squared-mixture cumulants. |
- splisosm.utils.hsic.linear_hsic_test(X, Y, centering=True)#
The linear HSIC test (multivariate RV coefficient).
Equivalent to a multivariate extension of Pearson correlation.
Supports sparse inputs for memory and speed efficiency:
Sparse X (scipy sparse matrix, shape
(n, p)): the cross-productY_c.T @ X_cis computed via a sparse matrix multiply (X.T.dot(Y_c)). BecauseYis mean-centred, theXcentering correction reduces to zero, so only the original sparseXis needed.X_c.T @ X_cis obtained asX.T @ X - n * mean_X ⊗ mean_X, keeping the first term sparse.Sparse Y (scipy sparse or torch sparse COO): densified once upfront before any computation.
- Parameters:
- Returns:
hsic (float) – HSIC test statistic (scaled by 1 / (n - 1)**2).
pvalue (float) – P-value from the asymptotic chi-squared mixture distribution.
- Return type:
tuple[float, float]
- splisosm.utils.hsic.liu_sf(t, lambs, dofs=None, deltas=None, kurtosis=False)#
Compute p-values for weighted chi-squared sums using Liu’s approximation.
- Parameters:
t (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – Observed test statistic value(s).
lambs (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]) – Mixture weights.
dofs (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Degrees of freedom for each component. Defaults to one.
deltas (_Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str] | None) – Noncentrality parameters for each component. Defaults to zero.
kurtosis (bool) – If
True, use the kurtosis-matching edge-case branch.
- Returns:
Survival probabilities
Pr(X > t).- Return type:
- splisosm.utils.hsic.liu_sf_from_cumulants(t, cumulants, kurtosis=False)#
Compute Liu p-values from chi-squared-mixture cumulants.