splisosm.utils

splisosm.utils#

General utilities for preprocessing and statistical helpers.

Functions#

`add_ratio_layer`(adata, layer, group_iso_by, ...[, ...])	Compute within-gene isoform usage ratios and store them as a new layer.
`compute_feature_summaries`(adata, gene_names[, layer, ...])	Compute gene-level and isoform-level summary statistics.
`counts_to_ratios`(counts[, transformation, ...])	Convert isoform counts to proportions.
`extract_counts_n_ratios`(adata[, layer, group_iso_by, ...])	Extract per-gene lists of isoform counts and ratios from anndata.
`extract_gene_level_statistics`(adata[, layer, group_iso_by])	Extract gene-level metadata from isoform-level counts anndata.
`false_discovery_control`(ps, *[, axis, method])	Adjust p-values to control the false discovery rate.
`get_cov_sp`(coords[, k, rho])	Return the dense standardised CAR spatial covariance matrix.
`prepare_inputs_from_anndata`(adata, layer, ...[, ...])	Extract and filter isoform count tensors from an AnnData object.
`run_hsic_gc`([counts_gene, coordinates, null_method, ...])	Compute the HSIC-GC statistic for gene-level counts.
`run_sparkx`(counts_gene, coordinates)	Wrapper for running the SPARK-X test for spatial gene expression variability.

Module Contents#

splisosm.utils.add_ratio_layer(adata, layer, group_iso_by, ratio_layer_key, fill_nan_with_mean=False)#

Compute within-gene isoform usage ratios and store them as a new layer.

For each spot and gene, every isoform’s ratio is its count divided by the total count of that gene at that spot. The result is written to adata.layers[ratio_layer_key] in-place.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Key in adata.layers containing raw isoform counts.
group_iso_by (str) – Column in adata.var grouping isoforms to genes.
ratio_layer_key (str) – Key under which to store the computed ratio matrix. Must differ from layer.
fill_nan_with_mean (bool, optional) –
How to handle spots where the gene has zero total counts.
- False (default): store as a sparse CSR matrix with the same sparsity structure as the count layer. Spots with zero gene total have ratio 0 for all isoforms of that gene (structural zeros, not stored explicitly).
- True: store as a dense float32 matrix; spots with zero gene total are filled with the per-isoform mean ratio across expressed spots (or 0.0 if no spot expresses the isoform).

Raises:

ValueError – If layer is missing, group_iso_by is not a var column, or ratio_layer_key equals layer.

Return type:

None

splisosm.utils.compute_feature_summaries(adata, gene_names, layer='counts', group_iso_by='gene_symbol', print_progress=True)#

Compute gene-level and isoform-level summary statistics.

This is the shared implementation behind extract_feature_summary(), extract_feature_summary(), and extract_feature_summary().

Parameters:

adata (AnnData) – Annotated data matrix (typically the filtered AnnData from setup_data).
gene_names (list[str]) – Ordered list of gene display names (length n_genes).
layer (str) – Layer in adata.layers containing raw isoform counts.
group_iso_by (str) – Column in adata.var that groups isoforms by gene.
print_progress (bool) – Show a tqdm progress bar.

Returns:

gene_summary (pandas.DataFrame) – Indexed by gene name with columns: n_isos, perplexity, pct_bin_on, count_avg, count_std, major_ratio_avg.
isoform_summary (pandas.DataFrame) – Indexed by isoform name with original adata.var columns plus: pct_bin_on, count_total, count_avg, count_std, ratio_total, ratio_avg, ratio_std.

Return type:

tuple[DataFrame, DataFrame]

splisosm.utils.counts_to_ratios(counts, transformation='none', nan_filling='mean', fill_before_transform=None)#

Convert isoform counts to proportions.

Spots with zero total counts (“zero-coverage spots”) are handled according to nan_filling and fill_before_transform. When nan_filling='mean' the zero-coverage rows are replaced with the per-isoform mean of the remaining rows. The timing of this fill relative to the ratio transformation is controlled by fill_before_transform:

False (new default): zero-coverage rows are filled with the column-wise mean of the transformed values (i.e. after the ratio transform is applied). For log-ratio transforms (clr, ilr, alr) this means pseudocount-filled rows are replaced with the mean of the true transformed values.
True (old behaviour): zero-coverage rows are filled with the mean of the raw isoform ratios before the transformation is applied. For log-ratio transforms the pseudocount-based rows are kept as-is (no explicit fill).

Note

Behaviour change since v1.1.0: the default filling now happens after transformation. Code that relied on the previous before-transform fill should pass fill_before_transform=True explicitly. Passing fill_before_transform=False adopts the new default without a warning.

After conversion, the isoform ratios can be further transformed using log-ratio-based transformations (clr, ilr, alr) or radial transformation [PYPA22].

Parameters:

counts (ndarray | Tensor) – Shape (n_spots, n_isos). Isoform counts.
transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – Transformation applied to the proportions. Can be one of the following: 'none': no transformation, return isoform ratios. 'clr': centered log-ratio transformation. 'ilr': isometric log-ratio transformation. 'alr': additive log-ratio transformation. 'radial': radial transformation [PYPA22].
nan_filling (Literal['mean', 'none']) – Method to fill all-zero rows. 'mean': fill all-zero rows with the per-isoform mean across expressed spots (timing controlled by fill_before_transform). 'none': do not fill rows and return NaNs at all-zero rows.
fill_before_transform (Optional[bool]) – Controls when zero-coverage rows are mean-filled relative to the ratio transformation. Only relevant when nan_filling='mean' and transformation != 'none'. False: fill after transformation (new default). True: fill before transformation (old behaviour). None (default): use the new default (False) and emit a FutureWarning to inform callers of the behaviour change.

Returns:

ratios – Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.

Return type:

Tensor

Notes

Log-ratio-based transformations (clr, ilr, alr) are implemented via scikit-bio, with a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.

splisosm.utils.extract_counts_n_ratios(adata, layer='counts', group_iso_by='gene_symbol', return_sparse=False, filter_single_iso_genes=True)#

Extract per-gene lists of isoform counts and ratios from anndata.

Deprecated since version 1.1.0: Use add_ratio_layer() to compute ratios and store them in adata.layers, then use prepare_inputs_from_anndata() with the ratio layer key to extract per-gene ratio tensors.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Layer to extract isoform counts (adata.layers[layer]).
group_iso_by (str) – Gene index in adata.var to group isoforms by.
return_sparse (bool) – Whether to return sparse torch tensors for counts_list. If True, ratios_list will be empty and ratio_obs_merged will be None.
filter_single_iso_genes (bool) – Whether to filter out genes with only one isoform. By default True for compatibility with splisosm models.

Returns:

counts_list (list[torch.Tensor]) – Isoform counts per gene, each of shape (n_spots, n_isos).
ratios_list (list[torch.Tensor]) – Isoform ratios per gene, each of shape (n_spots, n_isos).
gene_name_list (list[str]) – Gene names.
ratio_obs_merged (np.ndarray | None) – Observed isoform ratios, shape (n_spots, n_isos_total), or None if return_sparse is True.

Return type:

tuple[list[Tensor], list[Tensor], list[str], Optional[ndarray]]

splisosm.utils.extract_gene_level_statistics(adata, layer='counts', group_iso_by='gene_symbol')#

Extract gene-level metadata from isoform-level counts anndata.

Deprecated since version Use: compute_feature_summaries() instead, which returns both gene-level and isoform-level statistics.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Layer to extract isoform counts (adata.layers[layer]).
group_iso_by (str) – Gene index in adata.var to group isoforms by.

Returns:

Gene-level metadata.

Return type:

DataFrame

splisosm.utils.false_discovery_control(ps, *, axis=0, method='bh')#

Adjust p-values to control the false discovery rate.

The false discovery rate (FDR) is the expected proportion of rejected null hypotheses that are actually true. If the null hypothesis is rejected when the adjusted p-value falls below a specified level, the false discovery rate is controlled at that level.

Parameters:

ps (numpy.typing.ArrayLike) – The p-values to adjust. Elements must be real numbers between 0 and 1.
axis (Optional[int]) – The axis along which to perform the adjustment. The adjustment is performed independently along each axis-slice. If axis is None, ps is raveled before performing the adjustment.
method (Literal['bh', 'by']) – The false discovery rate control procedure to apply: 'bh' is for Benjamini-Hochberg [BH95] (Eq. 1), 'by' is for Benjamini-Yekutieli [BY01] (Theorem 1.3). The latter is more conservative, but it is guaranteed to control the FDR even when the p-values are not from independent tests.

Returns:

ps_adjusted – The adjusted p-values. If the null hypothesis is rejected where these fall below a specified level, the false discovery rate is controlled at that level.

Return type:

ndarray

Notes

From scipy.stats.false_discovery_control in SciPy v1.13.1. See scipy/scipy.

splisosm.utils.get_cov_sp(coords, k=4, rho=0.99)#

Return the dense standardised CAR spatial covariance matrix.

Constructs a k-mutual-nearest-neighbour graph from coords, builds the CAR precision matrix M = I - rho * D^{-1/2} W_sym D^{-1/2}, and returns K = M⁻¹ with unit marginal variance.

Parameters:

coords (ndarray | Tensor) – Shape (n_spots, n_dims). Euclidean spatial coordinates of spots.
k (int) – Number of mutual nearest neighbors.
rho (float) – Spatial autocorrelation coefficient.

Returns:

cov_sp – Shape (n_spots, n_spots). Spatial covariance matrix with standardized variance (== 1).

Return type:

Tensor

splisosm.utils.prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, adj_key=None, min_counts=0, min_bin_pct=0, filter_single_iso_genes=False, gene_names=None, design_mtx=None, covariate_names=None, min_component_size=1, k_neighbors=4, return_filtered_anndata=False)#

Extract and filter isoform count tensors from an AnnData object.

Shared helper used by both splisosm.hyptest_np.SplisosmNP and splisosm.hyptest_glmm.SplisosmGLMM to prepare legacy-compatible tensors from an AnnData input. Feature filtering, sparse/dense handling, coordinate extraction, and design-matrix resolution are all performed here.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Key in adata.layers containing raw isoform counts.
group_iso_by (str) – Column in adata.var used to group isoforms by gene.
spatial_key (str) – Key in adata.obsm for spatial coordinates.
adj_key (Optional[str]) – Key in adata.obsp for a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. The adjacency matrix is symmetrized internally before returned.
min_counts (int) – Minimum total isoform count across spots required to retain an isoform.
min_bin_pct (float) – Minimum fraction/percentage of spots with non-zero expression for an isoform. Values in [0, 1] are treated as fractions; values in (1, 100] are treated as percentages.
filter_single_iso_genes (bool) – Whether to discard genes with fewer than two retained isoforms.
gene_names (Optional[str]) – Column name in adata.var used as display names for grouped genes. If None, the grouped gene IDs are used.
design_mtx (Optional[Any]) – Design matrix for differential-usage tests. Accepts a tensor/array/dataframe of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.
covariate_names (Optional[list[str]]) – Explicit covariate names. When design_mtx is given as column name(s) and this is None, the column names are used automatically.
min_component_size (int) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all arrays (counts, coordinates, design matrix). Requires building a k-NN graph from coordinates (unless adj_key is provided). Default 1 disables filtering. Emits UserWarning when spots are removed.
k_neighbors (int) – Number of nearest neighbours for the k-NN graph when adj_key is None and component filtering is active. Default 4.
return_filtered_anndata (bool) – If True, return a copy of adata restricted to the spots and isoforms that survived all filtering steps, with columns ordered to match counts_list. Default False returns None as the seventh element.

Returns:

counts_list (list[torch.Tensor]) – Per-gene isoform count tensors, each of shape (n_spots, n_isos). Sparse adata.layers[layer] input yields sparse COO tensors.
coordinates (torch.Tensor) – Shape (n_spots, 2) spatial coordinates, dtype float32.
resolved_gene_names (list[str]) – Display names for each gene in counts_list.
resolved_design (np.ndarray or tensor or None) – Resolved design matrix, or the original object if it was already array-like; None when design_mtx is None.
resolved_covariates (list[str] or None) – Resolved covariate names, or None when design_mtx is None.
adj_matrix (scipy.sparse.spmatrix or None) – The (possibly filtered) adjacency matrix to use for building the spatial kernel. This is:
- adata.obsp[adj_key] (filtered to retained spots) when adj_key is provided, or
- the k-NN adjacency built from coordinates (filtered) when min_component_size > 1 and filtering removed spots, or
- None otherwise (caller should build k-NN inside SpatialCovKernel).
filtered_adata (anndata.AnnData) – A copy of adata restricted to the spots and isoforms that survived all filtering steps (component filtering + min_counts / min_bin_pct / single-isoform-gene filtering). Columns are ordered to match the concatenation order of isoforms across genes in counts_list. This is useful for computing per-feature summary statistics without re-running the filtering logic.

Raises:

ValueError – If required fields are missing from adata, no isoforms survive filtering, or argument values are out of range.

Return type:

tuple[list[Tensor], Tensor, list[str], Optional[Any], Optional[list[str]], spmatrix | None, AnnData | None]

splisosm.utils.run_hsic_gc(counts_gene=None, coordinates=None, null_method='eig', null_configs=None, min_component_size=1, adata=None, layer=None, spatial_key='spatial', adj_key=None, min_counts=0, min_bin_pct=0.0, **spatial_kernel_kwargs)#

Compute the HSIC-GC statistic for gene-level counts.

This function is designed to be a plugin replacement for SPARK-X.

Parameters:

counts_gene (ndarray | Tensor | None) – Shape (n_spots, n_genes). Gene counts.
coordinates (ndarray | Tensor | None) – Shape (n_spots, n_dim). Spatial coordinates of spots.
null_method ({"eig", "trace"}, optional) –
Method for computing the null distribution of the test statistic:
- "eig" (default): asymptotic chi-square mixture using kernel eigenvalues; Liu’s method. Supports optional null_configs["approx_rank"] (int) to restrict to the top-k eigenvalues.
- "trace": moment-matching normal approximation using tr(K’) and tr(K’²) of the centred spatial kernel and the scalar per-gene variance.
null_configs (dict or None, optional) – Extra keyword arguments for the chosen null_method.
min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots that belong to components smaller than this threshold are removed from all data structures (counts, coordinates, design matrix) before the spatial kernel is built. Components are detected on the same k-NN graph used for the spatial kernel (controlled by k_neighbors). The default value of 1 disables filtering. A UserWarning is issued whenever spots are removed.
adata (AnnData or None, optional) – If provided, use adata.X (when layer=None) or adata.layers[layer] as counts_gene — a (n_spots, n_genes) count matrix (dense or sparse) — and adata.obsm[spatial_key] for coordinates. Mutually exclusive with counts_gene / coordinates.
layer (str or None, optional) – Layer key in adata.layers. When None (default), adata.X is used. Used only in AnnData mode.
spatial_key (str, optional) – Key in adata.obsm for spatial coordinates. Used only in AnnData mode.
adj_key (str or None, optional) – Key in adata.obsp for a pre-built adjacency matrix. When provided in AnnData mode, the adjacency is loaded from adata.obsp[adj_key] and used for both component filtering and the spatial kernel construction. Ignored in matrix mode.
min_counts (int, optional) – Minimum total count to retain a gene in AnnData mode. Default 0.
min_bin_pct (float, optional) – Minimum fraction of spots expressing a gene (count > 0). Default 0.
**spatial_kernel_kwargs – Additional arguments forwarded to SpatialCovKernel.

Returns:

Results with keys:

'statistic': np.ndarray of shape (n_genes,).
'pvalue': np.ndarray of shape (n_genes,).
'pvalue_adj': np.ndarray of shape (n_genes,).
'method': "hsic-gc".
'null_method': the value of null_method.
'n_spots': number of spots after component filtering.

Return type:

dict

splisosm.utils.run_sparkx(counts_gene, coordinates)#

Wrapper for running the SPARK-X test for spatial gene expression variability.

It runs the R-package SPARK [ZSZ21] via rpy2.

Parameters:

counts_gene (ndarray | Tensor) – Shape (n_spots, n_genes), the observed gene counts.
coordinates (ndarray | Tensor) – Shape (n_spots, 2), the spatial coordinates.

Returns:

Results of the SPARK-X spatial variability test with keys:

'statistic': np.ndarray of shape (n_genes,). Mean SPARK-X statistics.
'pvalue': np.ndarray of shape (n_genes,). Combined p-values.
'pvalue_adj': np.ndarray of shape (n_genes,). Adjusted combined p-values.
'method': str. Method name “spark-x”.

Return type:

dict