splisosm.utils

splisosm.utils#

General utilities for preprocessing and statistical helpers.

Functions#

`counts_to_ratios`(counts[, transformation, nan_filling])	Convert isoform counts to proportions.
`extract_counts_n_ratios`(adata[, layer, group_iso_by, ...])	Extract per-gene lists of isoform counts and ratios from anndata.
`extract_gene_level_statistics`(adata[, layer, group_iso_by])	Extract gene-level metadata from isoform-level counts anndata.
`false_discovery_control`(ps, *[, axis, method])	Adjust p-values to control the false discovery rate.
`get_cov_sp`(coords[, k, rho])	Wrapper function to get the spatial covariance matrix from spatial coordinates.
`load_visium_sp_meta`(adata, path_to_spatial[, library_id])	Helper function to load Visium spatial metadata.
`prepare_inputs_from_anndata`(adata, layer, ...)	Extract and filter isoform count tensors from an AnnData object.
`run_hsic_gc`(counts_gene, coordinates[, approx_rank])	Function to compute HSIC-GC statistic for gene-level counts.
`run_sparkx`(counts_gene, coordinates)	Wrapper for running the SPARK-X test for spatial gene expression variability.

Module Contents#

splisosm.utils.counts_to_ratios(counts, transformation='none', nan_filling='mean')#

Convert isoform counts to proportions.

By default, isoform ratios at zero-coverage spots are filled with the mean ratio per isoform across all spots. After conversion, the isoform ratios can be further transformed using log-ratio-based transformations (clr, ilr, alr) or radial transformation [PYPA22].

Parameters:

counts (ndarray | Tensor) – Shape (n_spots, n_isos). Isoform counts.
transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – Transformation applied to the proportions. Can be one of the following: 'none': no transformation, return isoform ratios. 'clr': centered log-ratio transformation. 'ilr': isometric log-ratio transformation. 'alr': additive log-ratio transformation. 'radial': radial transformation [PYPA22].
nan_filling (Literal['mean', 'none']) – Method to fill all-zero rows. 'mean': fill all-zero rows with the mean of the mean per column before transformation. 'none': do not fill rows and return NaNs at all-zero rows.

Returns:

ratios – Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.

Return type:

Tensor

Notes

Log-ratio-based transformations (clr, ilr, alr) are implemented via scikit-bio, with a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.

splisosm.utils.extract_counts_n_ratios(adata, layer='counts', group_iso_by='gene_symbol', return_sparse=False, filter_single_iso_genes=True)#

Extract per-gene lists of isoform counts and ratios from anndata.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Layer to extract isoform counts (adata.layers[layer]).
group_iso_by (str) – Gene index in adata.var to group isoforms by.
return_sparse (bool) – Whether to return sparse torch tensors for counts_list. If True, ratios_list will be empty and ratio_obs_merged will be None.
filter_single_iso_genes (bool) – Whether to filter out genes with only one isoform. By default True for compatibility with splisosm models.

Returns:

counts_list (list[torch.Tensor]) – Isoform counts per gene, each of shape (n_spots, n_isos).
ratios_list (list[torch.Tensor]) – Isoform ratios per gene, each of shape (n_spots, n_isos).
gene_name_list (list[str]) – Gene names.
ratio_obs_merged (np.ndarray | None) – Observed isoform ratios, shape (n_spots, n_isos_total), or None if return_sparse is True.

Return type:

tuple[list[Tensor], list[Tensor], list[str], Optional[ndarray]]

splisosm.utils.extract_gene_level_statistics(adata, layer='counts', group_iso_by='gene_symbol')#

Extract gene-level metadata from isoform-level counts anndata.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Layer to extract isoform counts (adata.layers[layer]).
group_iso_by (str) – Gene index in adata.var to group isoforms by.

Returns:

Gene-level metadata with columns:

'n_iso': int. Number of isoforms per gene.
'pct_spot_on': float. Percentage of spots with non-zero counts.
'count_avg': float. Average counts per gene.
'count_std': float. Standard deviation of counts per gene.
'perplexity': float. Expression-based effective number of isoforms.
'major_ratio_avg': float. Average ratio of the major isoform.

Return type:

DataFrame

splisosm.utils.false_discovery_control(ps, *, axis=0, method='bh')#

Adjust p-values to control the false discovery rate.

The false discovery rate (FDR) is the expected proportion of rejected null hypotheses that are actually true. If the null hypothesis is rejected when the adjusted p-value falls below a specified level, the false discovery rate is controlled at that level.

Parameters:

ps (numpy.typing.ArrayLike) – The p-values to adjust. Elements must be real numbers between 0 and 1.
axis (Optional[int]) – The axis along which to perform the adjustment. The adjustment is performed independently along each axis-slice. If axis is None, ps is raveled before performing the adjustment.
method (Literal['bh', 'by']) – The false discovery rate control procedure to apply: 'bh' is for Benjamini-Hochberg [BH95] (Eq. 1), 'by' is for Benjaminini-Yekutieli [BY01] (Theorem 1.3). The latter is more conservative, but it is guaranteed to control the FDR even when the p-values are not from independent tests.

Returns:

ps_adjusted – The adjusted p-values. If the null hypothesis is rejected where these fall below a specified level, the false discovery rate is controlled at that level.

Return type:

ndarray

Notes

From scipy.stats.false_discovery_control in SciPy v1.13.1. See scipy/scipy.

splisosm.utils.get_cov_sp(coords, k=4, rho=0.99)#

Wrapper function to get the spatial covariance matrix from spatial coordinates.

It will first construct a mutual-k-nearest neighbor graph from the euclidean spatial coordinates, then convert the adjacency matrix to a standardized spatial covariance matrix using the intrinsic conditional autoregressive (ICAR) model with spatial autocorrelation coefficient rho. See [SRF+23] for details.

Parameters:

coords (ndarray | Tensor) – Shape (n_spots, n_dims). Euclidean spatial coordinates of spots.
k (int) – Number of nearest neighbors.
rho (float) – Spatial autocorrelation coefficient.

Returns:

cov_sp – Shape (n_spots, n_spots). Spatial covariance matrix with standardized variance (== 1).

Return type:

Tensor

splisosm.utils.load_visium_sp_meta(adata, path_to_spatial, library_id=None)#

Helper function to load Visium spatial metadata.

Parameters:

adata (AnnData) – Annotated data matrix to store the spatial metadata.
path_to_spatial (str | pathlib.Path) – Path to the spatial folder generated by Space Ranger.
library_id (Optional[str]) – Library ID of the spatial data.

Returns:

anndata – AnnData with spatial metadata.

Return type:

AnnData

splisosm.utils.prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, min_counts, min_bin_pct, filter_single_iso_genes, gene_names, design_mtx, covariate_names)#

Extract and filter isoform count tensors from an AnnData object.

Shared helper used by both splisosm.hyptest_np.SplisosmNP and splisosm.hyptest_glmm.SplisosmGLMM to prepare legacy-compatible tensors from an AnnData input. Feature filtering, sparse/dense handling, coordinate extraction, and design-matrix resolution are all performed here.

Parameters:

adata (AnnData) – Annotated data matrix.
layer (str) – Key in adata.layers containing raw isoform counts.
group_iso_by (str) – Column in adata.var used to group isoforms by gene.
spatial_key (str) – Key in adata.obsm for spatial coordinates.
min_counts (int) – Minimum total isoform count across spots required to retain an isoform.
min_bin_pct (float) – Minimum fraction/percentage of spots with non-zero expression for an isoform. Values in [0, 1] are treated as fractions; values in (1, 100] are treated as percentages.
filter_single_iso_genes (bool) – Whether to discard genes with fewer than two retained isoforms.
gene_names (Optional[str]) – Column name in adata.var used as display names for grouped genes. If None, the grouped gene IDs are used.
design_mtx (Optional[Any]) – Design matrix for differential-usage tests. Accepts a tensor/array/dataframe of shape (n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.
covariate_names (Optional[list[str]]) – Explicit covariate names. When design_mtx is given as column name(s) and this is None, the column names are used automatically.

Returns:

counts_list (list[torch.Tensor]) – Per-gene isoform count tensors, each of shape (n_spots, n_isos). Sparse adata.layers[layer] input yields sparse COO tensors.
coordinates (torch.Tensor) – Shape (n_spots, 2) spatial coordinates, dtype float32.
resolved_gene_names (list[str]) – Display names for each gene in counts_list.
resolved_design (np.ndarray or tensor or None) – Resolved design matrix, or the original object if it was already array-like; None when design_mtx is None.
resolved_covariates (list[str] or None) – Resolved covariate names, or None when design_mtx is None.

Raises:

ValueError – If required fields are missing from adata, no isoforms survive filtering, or argument values are out of range.

Return type:

tuple[list[Tensor], Tensor, list[str], Optional[Any], Optional[list[str]]]

splisosm.utils.run_hsic_gc(counts_gene, coordinates, approx_rank=None, **spatial_kernel_kwargs)#

Function to compute HSIC-GC statistic for gene-level counts.

This function is designed to be a plugin replacement for SPARK-X.

Parameters:

counts_gene (ndarray | Tensor) – Shape (n_spots, n_genes). Gene counts.
coordinates (ndarray | Tensor) – Shape (n_spots, 2). Spatial coordinates of spots.
approx_rank (Optional[int]) – Approximate rank of the spatial kernel matrix.
**spatial_kernel_kwargs (Any) – Additional arguments for SpatialCovKernel.

Returns:

Results of the HSIC-GC spatial variability test with keys:

'statistic': np.ndarray of shape (n_genes,). HSIC-GC statistics.
'pvalue': np.ndarray of shape (n_genes,). P-values.
'pvalue_adj': np.ndarray of shape (n_genes,). Adjusted p-values.
'method': str. Method name “hsic-gc”.

Return type:

dict

splisosm.utils.run_sparkx(counts_gene, coordinates)#

Wrapper for running the SPARK-X test for spatial gene expression variability.

It runs the R-package SPARK [ZSZ21] via rpy2.

Parameters:

counts_gene (ndarray | Tensor) – Shape (n_spots, n_genes), the observed gene counts.
coordinates (ndarray | Tensor) – Shape (n_spots, 2), the spatial coordinates.

Returns:

Results of the SPARK-X spatial variability test with keys:

'statistic': np.ndarray of shape (n_genes,). Mean SPARK-X statistics.
'pvalue': np.ndarray of shape (n_genes,). Combined p-values.
'pvalue_adj': np.ndarray of shape (n_genes,). Adjusted combined p-values.
'method': str. Method name “spark-x”.

Return type:

dict