splisosm.utils#
General utilities for preprocessing and statistical helpers.
Functions#
|
Compute within-gene isoform usage ratios and store them as a new layer. |
|
Compute gene-level and isoform-level summary statistics. |
|
Convert isoform counts to proportions. |
|
Extract per-gene lists of isoform counts and ratios from anndata. |
|
Extract gene-level metadata from isoform-level counts anndata. |
|
Adjust p-values to control the false discovery rate. |
|
Return the dense standardised CAR spatial covariance matrix. |
|
Extract and filter isoform count tensors from an AnnData object. |
|
Compute the HSIC-GC statistic for gene-level counts. |
|
Wrapper for running the SPARK-X test for spatial gene expression variability. |
Module Contents#
- splisosm.utils.add_ratio_layer(adata, layer, group_iso_by, ratio_layer_key, fill_nan_with_mean=False)#
Compute within-gene isoform usage ratios and store them as a new layer.
For each spot and gene, every isoform’s ratio is its count divided by the total count of that gene at that spot. The result is written to
adata.layers[ratio_layer_key]in-place.- Parameters:
adata (AnnData) – Annotated data matrix.
layer (str) – Key in
adata.layerscontaining raw isoform counts.group_iso_by (str) – Column in
adata.vargrouping isoforms to genes.ratio_layer_key (str) – Key under which to store the computed ratio matrix. Must differ from layer.
fill_nan_with_mean (bool, optional) –
How to handle spots where the gene has zero total counts.
False(default): store as a sparse CSR matrix with the same sparsity structure as the count layer. Spots with zero gene total have ratio 0 for all isoforms of that gene (structural zeros, not stored explicitly).True: store as a dense float32 matrix; spots with zero gene total are filled with the per-isoform mean ratio across expressed spots (or 0.0 if no spot expresses the isoform).
- Raises:
ValueError – If layer is missing, group_iso_by is not a
varcolumn, or ratio_layer_key equals layer.- Return type:
None
- splisosm.utils.compute_feature_summaries(adata, gene_names, layer='counts', group_iso_by='gene_symbol', print_progress=True)#
Compute gene-level and isoform-level summary statistics.
This is the shared implementation behind
extract_feature_summary(),extract_feature_summary(), andextract_feature_summary().- Parameters:
adata (AnnData) – Annotated data matrix (typically the filtered AnnData from
setup_data).gene_names (list[str]) – Ordered list of gene display names (length
n_genes).layer (str) – Layer in
adata.layerscontaining raw isoform counts.group_iso_by (str) – Column in
adata.varthat groups isoforms by gene.print_progress (bool) – Show a tqdm progress bar.
- Returns:
gene_summary (pandas.DataFrame) – Indexed by gene name with columns:
n_isos,perplexity,pct_bin_on,count_avg,count_std,major_ratio_avg.isoform_summary (pandas.DataFrame) – Indexed by isoform name with original
adata.varcolumns plus:pct_bin_on,count_total,count_avg,count_std,ratio_total,ratio_avg,ratio_std.
- Return type:
- splisosm.utils.counts_to_ratios(counts, transformation='none', nan_filling='mean', fill_before_transform=None)#
Convert isoform counts to proportions.
Spots with zero total counts (“zero-coverage spots”) are handled according to
nan_fillingandfill_before_transform. Whennan_filling='mean'the zero-coverage rows are replaced with the per-isoform mean of the remaining rows. The timing of this fill relative to the ratio transformation is controlled byfill_before_transform:False(new default): zero-coverage rows are filled with the column-wise mean of the transformed values (i.e. after the ratio transform is applied). For log-ratio transforms (clr, ilr, alr) this means pseudocount-filled rows are replaced with the mean of the true transformed values.True(old behaviour): zero-coverage rows are filled with the mean of the raw isoform ratios before the transformation is applied. For log-ratio transforms the pseudocount-based rows are kept as-is (no explicit fill).
Note
Behaviour change since v1.1.0: the default filling now happens after transformation. Code that relied on the previous before-transform fill should pass
fill_before_transform=Trueexplicitly. Passingfill_before_transform=Falseadopts the new default without a warning.After conversion, the isoform ratios can be further transformed using log-ratio-based transformations (clr, ilr, alr) or radial transformation [PYPA22].
- Parameters:
counts (ndarray | Tensor) – Shape (n_spots, n_isos). Isoform counts.
transformation (Literal['none', 'clr', 'ilr', 'alr', 'radial']) – Transformation applied to the proportions. Can be one of the following:
'none': no transformation, return isoform ratios.'clr': centered log-ratio transformation.'ilr': isometric log-ratio transformation.'alr': additive log-ratio transformation.'radial': radial transformation [PYPA22].nan_filling (Literal['mean', 'none']) – Method to fill all-zero rows.
'mean': fill all-zero rows with the per-isoform mean across expressed spots (timing controlled byfill_before_transform).'none': do not fill rows and return NaNs at all-zero rows.fill_before_transform (Optional[bool]) – Controls when zero-coverage rows are mean-filled relative to the ratio transformation. Only relevant when
nan_filling='mean'andtransformation != 'none'.False: fill after transformation (new default).True: fill before transformation (old behaviour).None(default): use the new default (False) and emit aFutureWarningto inform callers of the behaviour change.
- Returns:
ratios – Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.
- Return type:
Notes
Log-ratio-based transformations (clr, ilr, alr) are implemented via
scikit-bio, with a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.
- splisosm.utils.extract_counts_n_ratios(adata, layer='counts', group_iso_by='gene_symbol', return_sparse=False, filter_single_iso_genes=True)#
Extract per-gene lists of isoform counts and ratios from anndata.
Deprecated since version 1.1.0: Use
add_ratio_layer()to compute ratios and store them inadata.layers, then useprepare_inputs_from_anndata()with the ratio layer key to extract per-gene ratio tensors.- Parameters:
adata (AnnData) – Annotated data matrix.
layer (str) – Layer to extract isoform counts (adata.layers[layer]).
group_iso_by (str) – Gene index in adata.var to group isoforms by.
return_sparse (bool) – Whether to return sparse torch tensors for counts_list. If True,
ratios_listwill be empty andratio_obs_mergedwill be None.filter_single_iso_genes (bool) – Whether to filter out genes with only one isoform. By default True for compatibility with splisosm models.
- Returns:
counts_list (list[torch.Tensor]) – Isoform counts per gene, each of shape (n_spots, n_isos).
ratios_list (list[torch.Tensor]) – Isoform ratios per gene, each of shape (n_spots, n_isos).
gene_name_list (list[str]) – Gene names.
ratio_obs_merged (np.ndarray | None) – Observed isoform ratios, shape (n_spots, n_isos_total), or None if
return_sparseis True.
- Return type:
tuple[list[Tensor], list[Tensor], list[str], Optional[ndarray]]
- splisosm.utils.extract_gene_level_statistics(adata, layer='counts', group_iso_by='gene_symbol')#
Extract gene-level metadata from isoform-level counts anndata.
Deprecated since version Use:
compute_feature_summaries()instead, which returns both gene-level and isoform-level statistics.
- splisosm.utils.false_discovery_control(ps, *, axis=0, method='bh')#
Adjust p-values to control the false discovery rate.
The false discovery rate (FDR) is the expected proportion of rejected null hypotheses that are actually true. If the null hypothesis is rejected when the adjusted p-value falls below a specified level, the false discovery rate is controlled at that level.
- Parameters:
ps (numpy.typing.ArrayLike) – The p-values to adjust. Elements must be real numbers between 0 and 1.
axis (Optional[int]) – The axis along which to perform the adjustment. The adjustment is performed independently along each axis-slice. If
axisis None,psis raveled before performing the adjustment.method (Literal['bh', 'by']) – The false discovery rate control procedure to apply:
'bh'is for Benjamini-Hochberg [BH95] (Eq. 1),'by'is for Benjamini-Yekutieli [BY01] (Theorem 1.3). The latter is more conservative, but it is guaranteed to control the FDR even when the p-values are not from independent tests.
- Returns:
ps_adjusted – The adjusted p-values. If the null hypothesis is rejected where these fall below a specified level, the false discovery rate is controlled at that level.
- Return type:
Notes
From
scipy.stats.false_discovery_controlin SciPy v1.13.1. See scipy/scipy.
- splisosm.utils.get_cov_sp(coords, k=4, rho=0.99)#
Return the dense standardised CAR spatial covariance matrix.
Constructs a k-mutual-nearest-neighbour graph from coords, builds the CAR precision matrix M = I - rho * D^{-1/2} W_sym D^{-1/2}, and returns K = M⁻¹ with unit marginal variance.
- Parameters:
- Returns:
cov_sp – Shape (n_spots, n_spots). Spatial covariance matrix with standardized variance (== 1).
- Return type:
- splisosm.utils.prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, adj_key=None, min_counts=0, min_bin_pct=0, filter_single_iso_genes=False, gene_names=None, design_mtx=None, covariate_names=None, min_component_size=1, k_neighbors=4, return_filtered_anndata=False)#
Extract and filter isoform count tensors from an AnnData object.
Shared helper used by both
splisosm.hyptest_np.SplisosmNPandsplisosm.hyptest_glmm.SplisosmGLMMto prepare legacy-compatible tensors from an AnnData input. Feature filtering, sparse/dense handling, coordinate extraction, and design-matrix resolution are all performed here.- Parameters:
adata (AnnData) – Annotated data matrix.
layer (str) – Key in
adata.layerscontaining raw isoform counts.group_iso_by (str) – Column in
adata.varused to group isoforms by gene.spatial_key (str) – Key in
adata.obsmfor spatial coordinates.adj_key (Optional[str]) – Key in
adata.obspfor a pre-built adjacency matrix. When provided, it overrides the k-NN graph construction from coordinates and be used directly to build the spatial kernel. The adjacency matrix is symmetrized internally before returned.min_counts (int) – Minimum total isoform count across spots required to retain an isoform.
min_bin_pct (float) – Minimum fraction/percentage of spots with non-zero expression for an isoform. Values in
[0, 1]are treated as fractions; values in(1, 100]are treated as percentages.filter_single_iso_genes (bool) – Whether to discard genes with fewer than two retained isoforms.
gene_names (Optional[str]) – Column name in
adata.varused as display names for grouped genes. IfNone, the grouped gene IDs are used.design_mtx (Optional[Any]) – Design matrix for differential-usage tests. Accepts a tensor/array/dataframe of shape
(n_spots, n_factors), a single obs-column name (str), or a list of obs-column names.covariate_names (Optional[list[str]]) – Explicit covariate names. When
design_mtxis given as column name(s) and this isNone, the column names are used automatically.min_component_size (int) – Minimum number of spots a connected component must contain to be retained. Spots in smaller components are removed from all arrays (counts, coordinates, design matrix). Requires building a k-NN graph from coordinates (unless
adj_keyis provided). Default 1 disables filtering. EmitsUserWarningwhen spots are removed.k_neighbors (int) – Number of nearest neighbours for the k-NN graph when
adj_keyisNoneand component filtering is active. Default 4.return_filtered_anndata (bool) – If
True, return a copy ofadatarestricted to the spots and isoforms that survived all filtering steps, with columns ordered to matchcounts_list. DefaultFalsereturnsNoneas the seventh element.
- Returns:
counts_list (list[torch.Tensor]) – Per-gene isoform count tensors, each of shape
(n_spots, n_isos). Sparseadata.layers[layer]input yields sparse COO tensors.coordinates (torch.Tensor) – Shape
(n_spots, 2)spatial coordinates, dtype float32.resolved_gene_names (list[str]) – Display names for each gene in
counts_list.resolved_design (np.ndarray or tensor or None) – Resolved design matrix, or the original object if it was already array-like;
Nonewhendesign_mtxisNone.resolved_covariates (list[str] or None) – Resolved covariate names, or
Nonewhendesign_mtxisNone.adj_matrix (scipy.sparse.spmatrix or None) – The (possibly filtered) adjacency matrix to use for building the spatial kernel. This is:
adata.obsp[adj_key](filtered to retained spots) whenadj_keyis provided, orthe k-NN adjacency built from coordinates (filtered) when
min_component_size > 1and filtering removed spots, orNoneotherwise (caller should build k-NN insideSpatialCovKernel).
filtered_adata (anndata.AnnData) – A copy of
adatarestricted to the spots and isoforms that survived all filtering steps (component filtering + min_counts / min_bin_pct / single-isoform-gene filtering). Columns are ordered to match the concatenation order of isoforms across genes incounts_list. This is useful for computing per-feature summary statistics without re-running the filtering logic.
- Raises:
ValueError – If required fields are missing from
adata, no isoforms survive filtering, or argument values are out of range.- Return type:
tuple[list[Tensor], Tensor, list[str], Optional[Any], Optional[list[str]], spmatrix | None, AnnData | None]
- splisosm.utils.run_hsic_gc(counts_gene=None, coordinates=None, null_method='eig', null_configs=None, min_component_size=1, adata=None, layer=None, spatial_key='spatial', adj_key=None, min_counts=0, min_bin_pct=0.0, **spatial_kernel_kwargs)#
Compute the HSIC-GC statistic for gene-level counts.
This function is designed to be a plugin replacement for SPARK-X.
- Parameters:
counts_gene (ndarray | Tensor | None) – Shape
(n_spots, n_genes). Gene counts.coordinates (ndarray | Tensor | None) – Shape
(n_spots, n_dim). Spatial coordinates of spots.null_method ({"eig", "trace"}, optional) –
Method for computing the null distribution of the test statistic:
"eig"(default): asymptotic chi-square mixture using kernel eigenvalues; Liu’s method. Supports optionalnull_configs["approx_rank"](int) to restrict to the top-k eigenvalues."trace": moment-matching normal approximation using tr(K’) and tr(K’²) of the centred spatial kernel and the scalar per-gene variance.
null_configs (dict or None, optional) – Extra keyword arguments for the chosen
null_method.min_component_size (int, optional) – Minimum number of spots a connected component must contain to be retained. Spots that belong to components smaller than this threshold are removed from all data structures (counts, coordinates, design matrix) before the spatial kernel is built. Components are detected on the same k-NN graph used for the spatial kernel (controlled by
k_neighbors). The default value of1disables filtering. AUserWarningis issued whenever spots are removed.adata (AnnData or None, optional) – If provided, use
adata.X(whenlayer=None) oradata.layers[layer]ascounts_gene— a(n_spots, n_genes)count matrix (dense or sparse) — andadata.obsm[spatial_key]for coordinates. Mutually exclusive withcounts_gene/coordinates.layer (str or None, optional) – Layer key in
adata.layers. WhenNone(default),adata.Xis used. Used only in AnnData mode.spatial_key (str, optional) – Key in
adata.obsmfor spatial coordinates. Used only in AnnData mode.adj_key (str or None, optional) – Key in
adata.obspfor a pre-built adjacency matrix. When provided in AnnData mode, the adjacency is loaded fromadata.obsp[adj_key]and used for both component filtering and the spatial kernel construction. Ignored in matrix mode.min_counts (int, optional) – Minimum total count to retain a gene in AnnData mode. Default 0.
min_bin_pct (float, optional) – Minimum fraction of spots expressing a gene (count > 0). Default 0.
**spatial_kernel_kwargs – Additional arguments forwarded to
SpatialCovKernel.
- Returns:
Results with keys:
'statistic': np.ndarray of shape (n_genes,).'pvalue': np.ndarray of shape (n_genes,).'pvalue_adj': np.ndarray of shape (n_genes,).'method':"hsic-gc".'null_method': the value of null_method.'n_spots': number of spots after component filtering.
- Return type:
dict
- splisosm.utils.run_sparkx(counts_gene, coordinates)#
Wrapper for running the SPARK-X test for spatial gene expression variability.
It runs the R-package SPARK [ZSZ21] via rpy2.
- Parameters:
- Returns:
Results of the SPARK-X spatial variability test with keys:
'statistic': np.ndarray of shape (n_genes,). Mean SPARK-X statistics.'pvalue': np.ndarray of shape (n_genes,). Combined p-values.'pvalue_adj': np.ndarray of shape (n_genes,). Adjusted combined p-values.'method': str. Method name “spark-x”.
- Return type:
dict