splisosm.utils
==============

.. py:module:: splisosm.utils

.. autoapi-nested-parse::

   General utilities for preprocessing and statistical helpers.


Functions
---------

.. autoapisummary::

   splisosm.utils.add_ratio_layer
   splisosm.utils.compute_feature_summaries
   splisosm.utils.counts_to_ratios
   splisosm.utils.extract_counts_n_ratios
   splisosm.utils.extract_gene_level_statistics
   splisosm.utils.false_discovery_control
   splisosm.utils.get_cov_sp
   splisosm.utils.prepare_inputs_from_anndata
   splisosm.utils.run_hsic_gc
   splisosm.utils.run_sparkx


Module Contents
---------------

.. py:function:: add_ratio_layer(adata, layer, group_iso_by, ratio_layer_key, fill_nan_with_mean = False)

   Compute within-gene isoform usage ratios and store them as a new layer.

   For each spot and gene, every isoform's ratio is its count divided by the
   total count of that gene at that spot.  The result is written to
   ``adata.layers[ratio_layer_key]`` **in-place**.

   :param adata: Annotated data matrix.
   :param layer: Key in ``adata.layers`` containing raw isoform counts.
   :param group_iso_by: Column in ``adata.var`` grouping isoforms to genes.
   :param ratio_layer_key: Key under which to store the computed ratio matrix.  Must differ
                           from *layer*.
   :param fill_nan_with_mean: How to handle spots where the gene has zero total counts.

                              * ``False`` (default): store as a **sparse CSR matrix** with the same
                                sparsity structure as the count layer.  Spots with zero gene total
                                have ratio 0 for all isoforms of that gene (structural zeros, not
                                stored explicitly).
                              * ``True``: store as a dense float32 matrix; spots with zero gene
                                total are filled with the per-isoform mean ratio across expressed
                                spots (or 0.0 if no spot expresses the isoform).
   :type fill_nan_with_mean: bool, optional

   :raises ValueError: If *layer* is missing, *group_iso_by* is not a ``var`` column, or
       *ratio_layer_key* equals *layer*.


.. py:function:: compute_feature_summaries(adata, gene_names, layer = 'counts', group_iso_by = 'gene_symbol', print_progress = True)

   Compute gene-level and isoform-level summary statistics.

   This is the shared implementation behind
   :meth:`~splisosm.SplisosmNP.extract_feature_summary`,
   :meth:`~splisosm.SplisosmFFT.extract_feature_summary`, and
   :meth:`~splisosm.SplisosmGLMM.extract_feature_summary`.

   :param adata: Annotated data matrix (typically the filtered AnnData from ``setup_data``).
   :param gene_names: Ordered list of gene display names (length ``n_genes``).
   :param layer: Layer in ``adata.layers`` containing raw isoform counts.
   :param group_iso_by: Column in ``adata.var`` that groups isoforms by gene.
   :param print_progress: Show a tqdm progress bar.

   :returns: * **gene_summary** (*pandas.DataFrame*) -- Indexed by gene name with columns: ``n_isos``, ``perplexity``,
               ``pct_bin_on``, ``count_avg``, ``count_std``, ``major_ratio_avg``.
             * **isoform_summary** (*pandas.DataFrame*) -- Indexed by isoform name with original ``adata.var`` columns plus:
               ``pct_bin_on``, ``count_total``, ``count_avg``, ``count_std``,
               ``ratio_total``, ``ratio_avg``, ``ratio_std``.


.. py:function:: counts_to_ratios(counts, transformation = 'none', nan_filling = 'mean', fill_before_transform = None)

   Convert isoform counts to proportions.

   Spots with zero total counts ("zero-coverage spots") are handled according to
   ``nan_filling`` and ``fill_before_transform``.  When ``nan_filling='mean'`` the
   zero-coverage rows are replaced with the per-isoform mean of the remaining rows.
   The timing of this fill relative to the ratio transformation is controlled by
   ``fill_before_transform``:

   * ``False`` (**new default**): zero-coverage rows are filled with the column-wise
     mean of the **transformed** values (i.e. after the ratio transform is applied).
     For log-ratio transforms (clr, ilr, alr) this means pseudocount-filled rows are
     replaced with the mean of the true transformed values.
   * ``True`` (**old behaviour**): zero-coverage rows are filled with the mean of the
     raw isoform ratios **before** the transformation is applied.  For log-ratio
     transforms the pseudocount-based rows are kept as-is (no explicit fill).

   .. note::
       **Behaviour change since v1.1.0:** the default filling now happens *after*
       transformation.  Code that relied on the previous before-transform fill
       should pass ``fill_before_transform=True`` explicitly.  Passing
       ``fill_before_transform=False`` adopts the new default without a warning.

   After conversion, the isoform ratios can be further transformed using log-ratio-based
   transformations (clr, ilr, alr) or radial transformation :cite:`park2022kernel`.

   :param counts: Shape (n_spots, n_isos). Isoform counts.
   :param transformation: Transformation applied to the proportions. Can be one of the following:
                          ``'none'``: no transformation, return isoform ratios.
                          ``'clr'``: centered log-ratio transformation.
                          ``'ilr'``: isometric log-ratio transformation.
                          ``'alr'``: additive log-ratio transformation.
                          ``'radial'``: radial transformation :cite:`park2022kernel`.
   :param nan_filling: Method to fill all-zero rows.
                       ``'mean'``: fill all-zero rows with the per-isoform mean across expressed spots
                       (timing controlled by ``fill_before_transform``).
                       ``'none'``: do not fill rows and return NaNs at all-zero rows.
   :param fill_before_transform: Controls when zero-coverage rows are mean-filled relative to the ratio
                                 transformation.  Only relevant when ``nan_filling='mean'`` and
                                 ``transformation != 'none'``.
                                 ``False``: fill **after** transformation (new default).
                                 ``True``: fill **before** transformation (old behaviour).
                                 ``None`` (default): use the new default (``False``) and emit a
                                 :class:`FutureWarning` to inform callers of the behaviour change.

   :returns: **ratios** -- Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.
   :rtype: torch.Tensor

   .. rubric:: Notes

   Log-ratio-based transformations (clr, ilr, alr) are implemented via ``scikit-bio``, with
   a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.


.. py:function:: extract_counts_n_ratios(adata, layer = 'counts', group_iso_by = 'gene_symbol', return_sparse = False, filter_single_iso_genes = True)

   Extract per-gene lists of isoform counts and ratios from anndata.

   .. deprecated:: 1.1.0
       Use :func:`add_ratio_layer` to compute ratios and store them in
       ``adata.layers``, then use
       :func:`prepare_inputs_from_anndata` with the ratio layer key to
       extract per-gene ratio tensors.

   :param adata: Annotated data matrix.
   :param layer: Layer to extract isoform counts (adata.layers[layer]).
   :param group_iso_by: Gene index in adata.var to group isoforms by.
   :param return_sparse: Whether to return sparse torch tensors for counts_list.
                         If True, `ratios_list` will be empty and `ratio_obs_merged` will be None.
   :param filter_single_iso_genes: Whether to filter out genes with only one isoform.
                                   By default True for compatibility with splisosm models.

   :returns: * **counts_list** (*list[torch.Tensor]*) -- Isoform counts per gene, each of shape (n_spots, n_isos).
             * **ratios_list** (*list[torch.Tensor]*) -- Isoform ratios per gene, each of shape (n_spots, n_isos).
             * **gene_name_list** (*list[str]*) -- Gene names.
             * **ratio_obs_merged** (*np.ndarray | None*) -- Observed isoform ratios, shape (n_spots, n_isos_total), or None if `return_sparse` is True.


.. py:function:: extract_gene_level_statistics(adata, layer = 'counts', group_iso_by = 'gene_symbol')

   Extract gene-level metadata from isoform-level counts anndata.

   .. deprecated::
       Use :func:`compute_feature_summaries` instead, which returns both
       gene-level and isoform-level statistics.

   :param adata: Annotated data matrix.
   :param layer: Layer to extract isoform counts (adata.layers[layer]).
   :param group_iso_by: Gene index in adata.var to group isoforms by.

   :returns: Gene-level metadata.
   :rtype: pandas.DataFrame


.. py:function:: false_discovery_control(ps, *, axis = 0, method = 'bh')

   Adjust p-values to control the false discovery rate.

   The false discovery rate (FDR) is the expected proportion of rejected null
   hypotheses that are actually true.
   If the null hypothesis is rejected when the *adjusted* p-value falls below
   a specified level, the false discovery rate is controlled at that level.

   :param ps: The p-values to adjust. Elements must be real numbers between 0 and 1.
   :param axis: The axis along which to perform the adjustment. The adjustment is
                performed independently along each axis-slice. If `axis` is None, `ps`
                is raveled before performing the adjustment.
   :param method: The false discovery rate control procedure to apply: ``'bh'`` is for
                  Benjamini-Hochberg :cite:`benjamini1995controlling` (Eq. 1), ``'by'`` is for Benjamini-Yekutieli
                  :cite:`benjamini2001control` (Theorem 1.3). The latter is more conservative, but it is
                  guaranteed to control the FDR even when the p-values are not from
                  independent tests.

   :returns: **ps_adjusted** -- The adjusted p-values. If the null hypothesis is rejected where these
             fall below a specified level, the false discovery rate is controlled
             at that level.
   :rtype: numpy.ndarray

   .. rubric:: Notes

   From `scipy.stats.false_discovery_control` in SciPy v1.13.1.
   See https://github.com/scipy/scipy/blob/v1.13.1/scipy/stats/_morestats.py#L4737.


.. py:function:: get_cov_sp(coords, k = 4, rho = 0.99)

   Return the dense standardised CAR spatial covariance matrix.

   Constructs a k-mutual-nearest-neighbour graph from *coords*, builds the
   CAR precision matrix M = I - rho * D^{-1/2} W_sym D^{-1/2}, and
   returns K = M⁻¹ with unit marginal variance.

   :param coords: Shape (n_spots, n_dims). Euclidean spatial coordinates of spots.
   :param k: Number of mutual nearest neighbors.
   :param rho: Spatial autocorrelation coefficient.

   :returns: **cov_sp** -- Shape (n_spots, n_spots). Spatial covariance matrix with standardized variance (== 1).
   :rtype: torch.Tensor


.. py:function:: prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, adj_key = None, min_counts = 0, min_bin_pct = 0, filter_single_iso_genes = False, gene_names = None, design_mtx = None, covariate_names = None, min_component_size = 1, k_neighbors = 4, return_filtered_anndata = False)

   Extract and filter isoform count tensors from an AnnData object.

   Shared helper used by both :class:`splisosm.hyptest_np.SplisosmNP` and
   :class:`splisosm.hyptest_glmm.SplisosmGLMM` to prepare legacy-compatible
   tensors from an AnnData input.  Feature filtering, sparse/dense handling,
   coordinate extraction, and design-matrix resolution are all performed here.

   :param adata: Annotated data matrix.
   :param layer: Key in ``adata.layers`` containing raw isoform counts.
   :param group_iso_by: Column in ``adata.var`` used to group isoforms by gene.
   :param spatial_key: Key in ``adata.obsm`` for spatial coordinates.  Optional when
                       ``adj_key`` is provided: if the key is missing from ``adata.obsm`` the
                       returned ``coordinates`` is ``None`` and downstream kernel construction
                       proceeds from the adjacency matrix alone.  Raw coordinates are still
                       required by SPARK-X and the GP-conditional DU test; those callers
                       raise a dedicated error at call time when coordinates are absent.
   :param adj_key: Key in ``adata.obsp`` for a pre-built adjacency matrix.
                   When provided, it overrides the k-NN graph construction
                   from coordinates and be used directly to build the spatial kernel.
                   The adjacency matrix is symmetrized internally before returned.
   :param min_counts: Minimum total isoform count across spots required to retain an isoform.
   :param min_bin_pct: Minimum fraction/percentage of spots with non-zero expression for an
                       isoform.  Values in ``[0, 1]`` are treated as fractions; values in
                       ``(1, 100]`` are treated as percentages.
   :param filter_single_iso_genes: Whether to discard genes with fewer than two retained isoforms.
   :param gene_names: Column name in ``adata.var`` used as display names for grouped genes.
                      If ``None``, the grouped gene IDs are used.
   :param design_mtx: Design matrix for differential-usage tests.  Accepts a
                      tensor/array/dataframe of shape ``(n_spots, n_factors)``, a single
                      obs-column name (str), or a list of obs-column names.
   :param covariate_names: Explicit covariate names.  When ``design_mtx`` is given as column
                           name(s) and this is ``None``, the column names are used automatically.
   :param min_component_size: Minimum number of spots a connected component must contain to be
                              retained.  Spots in smaller components are removed from all arrays
                              (counts, coordinates, design matrix).  Requires building a k-NN
                              graph from coordinates (unless ``adj_key`` is provided).
                              Default 1 disables filtering.  Emits ``UserWarning`` when spots
                              are removed.
   :param k_neighbors: Number of nearest neighbours for the k-NN graph when
                       ``adj_key`` is ``None`` and component filtering is active.
                       Default 4.
   :param return_filtered_anndata: If ``True``, return a copy of ``adata`` restricted to the spots and
                                   isoforms that survived all filtering steps, with columns ordered to
                                   match ``counts_list``.  Default ``False`` returns ``None`` as the
                                   seventh element.

   :returns: * **counts_list** (*list[torch.Tensor]*) -- Per-gene isoform count tensors, each of shape ``(n_spots, n_isos)``.
               Sparse ``adata.layers[layer]`` input yields sparse COO tensors.
             * **coordinates** (*torch.Tensor or None*) -- Shape ``(n_spots, n_spatial_dims)`` spatial coordinates, dtype float32.  ``None``
               when ``spatial_key`` is missing from ``adata.obsm`` and ``adj_key``
               supplies the neighborhood graph instead.
             * **resolved_gene_names** (*list[str]*) -- Display names for each gene in ``counts_list``.
             * **resolved_design** (*np.ndarray or tensor or None*) -- Resolved design matrix, or the original object if it was already
               array-like; ``None`` when ``design_mtx`` is ``None``.
             * **resolved_covariates** (*list[str] or None*) -- Resolved covariate names, or ``None`` when ``design_mtx`` is ``None``.
             * **adj_matrix** (*scipy.sparse.spmatrix or None*) -- The (possibly filtered) adjacency matrix to use for building the
               spatial kernel.  This is:

               * ``adata.obsp[adj_key]`` (filtered to retained spots) when
                 ``adj_key`` is provided, or
               * the k-NN adjacency built from coordinates (filtered) when
                 ``min_component_size > 1`` and filtering removed spots, or
               * ``None`` otherwise (caller should build k-NN inside
                 :class:`~splisosm.kernel.SpatialCovKernel`).
             * **filtered_adata** (*anndata.AnnData*) -- A copy of ``adata`` restricted to the spots and isoforms that survived
               all filtering steps (component filtering + min_counts / min_bin_pct /
               single-isoform-gene filtering).  Columns are ordered to match the
               concatenation order of isoforms across genes in ``counts_list``.
               This is useful for computing per-feature summary statistics without
               re-running the filtering logic.

   :raises ValueError: If required fields are missing from ``adata``, no isoforms survive
       filtering, or argument values are out of range.


.. py:function:: run_hsic_gc(counts_gene = None, coordinates = None, null_method = 'eig', null_configs = None, min_component_size = 1, adata = None, layer = None, spatial_key = 'spatial', adj_key = None, min_counts = 0, min_bin_pct = 0.0, **spatial_kernel_kwargs)

   Compute the HSIC-GC statistic for gene-level counts.

   This function is designed to be a plugin replacement for SPARK-X.

   :param counts_gene: Shape ``(n_spots, n_genes)``. Gene counts.
   :param coordinates: Shape ``(n_spots, n_dim)``. Spatial coordinates of spots.
   :param null_method: Method for computing the null distribution of the test statistic:

                       * ``"eig"`` (default): asymptotic chi-square mixture using kernel
                         eigenvalues; Liu's method.  Supports optional
                         ``null_configs["approx_rank"]`` (int) to restrict to the top-k
                         eigenvalues.
                       * ``"clt"``: moment-matching normal (Central Limit Theorem)
                         approximation using tr(K') and tr(K'²) of the centred spatial
                         kernel and the scalar per-gene variance.  ``"trace"`` is accepted
                         as a deprecated alias.
                       * ``"welch"``: Welch-Satterthwaite scaled chi-squared approximation
                         using the same two traces as ``"clt"``.  Typically more accurate
                         in the right tail than ``"clt"`` at the same cost.
   :type null_method: {"eig", "clt", "welch"}, optional
   :param null_configs: Extra keyword arguments for the chosen ``null_method``.
   :type null_configs: dict or None, optional
   :param min_component_size: Minimum number of spots a connected component must contain to be
                              retained.  Spots that belong to components smaller than this
                              threshold are removed from all data structures (counts, coordinates,
                              design matrix) before the spatial kernel is built. Components are detected
                              on the same k-NN graph used for the spatial kernel (controlled by ``k_neighbors``).
                              The default value of ``1`` disables filtering.
                              A ``UserWarning`` is issued whenever spots are removed.
   :type min_component_size: int, optional
   :param adata: If provided, use ``adata.X`` (when ``layer=None``) or
                 ``adata.layers[layer]`` as ``counts_gene`` — a ``(n_spots, n_genes)``
                 count matrix (dense or sparse) — and ``adata.obsm[spatial_key]`` for
                 coordinates.  Mutually exclusive with ``counts_gene`` / ``coordinates``.
   :type adata: AnnData or None, optional
   :param layer: Layer key in ``adata.layers``.  When ``None`` (default), ``adata.X``
                 is used.  Used only in AnnData mode.
   :type layer: str or None, optional
   :param spatial_key: Key in ``adata.obsm`` for spatial coordinates.  Used only in
                       AnnData mode and optional when ``adj_key`` is provided: if the key
                       is missing from ``adata.obsm`` the kernel is built from the
                       adjacency alone.
   :type spatial_key: str, optional
   :param adj_key: Key in ``adata.obsp`` for a pre-built adjacency matrix.  When
                   provided in AnnData mode, the adjacency is loaded from
                   ``adata.obsp[adj_key]`` and used for both component filtering and
                   the spatial kernel construction.  In AnnData mode this also makes
                   ``spatial_key`` optional.  Ignored in matrix mode.
   :type adj_key: str or None, optional
   :param min_counts: Minimum total count to retain a gene in AnnData mode.  Default 0.
   :type min_counts: int, optional
   :param min_bin_pct: Minimum fraction of spots expressing a gene (count > 0).  Default 0.
   :type min_bin_pct: float, optional
   :param \*\*spatial_kernel_kwargs: Additional arguments forwarded to :class:`~splisosm.kernel.SpatialCovKernel`.

   :returns: Results with keys:

             - ``'statistic'``: np.ndarray of shape (n_genes,).
             - ``'pvalue'``: np.ndarray of shape (n_genes,).
             - ``'pvalue_adj'``: np.ndarray of shape (n_genes,).
             - ``'method'``: ``"hsic-gc"``.
             - ``'null_method'``: the value of *null_method*.
             - ``'n_spots'``: number of spots after component filtering.
   :rtype: dict


.. py:function:: run_sparkx(counts_gene, coordinates)

   Wrapper for running the SPARK-X test for spatial gene expression variability.

   It runs the R-package SPARK :cite:`zhu2021spark` via rpy2.

   :param counts_gene: Shape (n_spots, n_genes), the observed gene counts.
   :param coordinates: Shape (n_spots, 2), the spatial coordinates.

   :returns: Results of the SPARK-X spatial variability test with keys:

             - ``'statistic'``: np.ndarray of shape (n_genes,). Mean SPARK-X statistics.
             - ``'pvalue'``: np.ndarray of shape (n_genes,). Combined p-values.
             - ``'pvalue_adj'``: np.ndarray of shape (n_genes,). Adjusted combined p-values.
             - ``'method'``: str. Method name "spark-x".
   :rtype: dict