splisosm.utils
==============

.. py:module:: splisosm.utils

.. autoapi-nested-parse::

   General utilities for preprocessing and statistical helpers.


Functions
---------

.. autoapisummary::

   splisosm.utils.counts_to_ratios
   splisosm.utils.extract_counts_n_ratios
   splisosm.utils.extract_gene_level_statistics
   splisosm.utils.false_discovery_control
   splisosm.utils.get_cov_sp
   splisosm.utils.load_visium_sp_meta
   splisosm.utils.prepare_inputs_from_anndata
   splisosm.utils.run_hsic_gc
   splisosm.utils.run_sparkx


Module Contents
---------------

.. py:function:: counts_to_ratios(counts, transformation = 'none', nan_filling = 'mean')

   Convert isoform counts to proportions.

   By default, isoform ratios at zero-coverage spots are filled with the mean ratio per isoform across all spots.
   After conversion, the isoform ratios can be further transformed using log-ratio-based transformations
   (clr, ilr, alr) or radial transformation :cite:`park2022kernel`.

   :param counts: Shape (n_spots, n_isos). Isoform counts.
   :param transformation: Transformation applied to the proportions. Can be one of the following:
                          ``'none'``: no transformation, return isoform ratios.
                          ``'clr'``: centered log-ratio transformation.
                          ``'ilr'``: isometric log-ratio transformation.
                          ``'alr'``: additive log-ratio transformation.
                          ``'radial'``: radial transformation :cite:`park2022kernel`.
   :param nan_filling: Method to fill all-zero rows.
                       ``'mean'``: fill all-zero rows with the mean of the mean per column **before transformation**.
                       ``'none'``: do not fill rows and return NaNs at all-zero rows.

   :returns: **ratios** -- Shape (n_spots, n_isos) or (n_spots, n_isos - 1) if ilr or alr transformation is used.
   :rtype: torch.Tensor

   .. rubric:: Notes

   Log-ratio-based transformations (clr, ilr, alr) are implemented via ``scikit-bio``, with
   a pseudocount of 1% of the global mean per isoform to avoid zeros in the ratio.


.. py:function:: extract_counts_n_ratios(adata, layer = 'counts', group_iso_by = 'gene_symbol', return_sparse = False, filter_single_iso_genes = True)

   Extract per-gene lists of isoform counts and ratios from anndata.

   :param adata: Annotated data matrix.
   :param layer: Layer to extract isoform counts (adata.layers[layer]).
   :param group_iso_by: Gene index in adata.var to group isoforms by.
   :param return_sparse: Whether to return sparse torch tensors for counts_list.
                         If True, `ratios_list` will be empty and `ratio_obs_merged` will be None.
   :param filter_single_iso_genes: Whether to filter out genes with only one isoform.
                                   By default True for compatibility with splisosm models.

   :returns: * **counts_list** (*list[torch.Tensor]*) -- Isoform counts per gene, each of shape (n_spots, n_isos).
             * **ratios_list** (*list[torch.Tensor]*) -- Isoform ratios per gene, each of shape (n_spots, n_isos).
             * **gene_name_list** (*list[str]*) -- Gene names.
             * **ratio_obs_merged** (*np.ndarray | None*) -- Observed isoform ratios, shape (n_spots, n_isos_total), or None if `return_sparse` is True.


.. py:function:: extract_gene_level_statistics(adata, layer = 'counts', group_iso_by = 'gene_symbol')

   Extract gene-level metadata from isoform-level counts anndata.

   :param adata: Annotated data matrix.
   :param layer: Layer to extract isoform counts (adata.layers[layer]).
   :param group_iso_by: Gene index in adata.var to group isoforms by.

   :returns: Gene-level metadata with columns:

             - ``'n_iso'``: int. Number of isoforms per gene.
             - ``'pct_spot_on'``: float. Percentage of spots with non-zero counts.
             - ``'count_avg'``: float. Average counts per gene.
             - ``'count_std'``: float. Standard deviation of counts per gene.
             - ``'perplexity'``: float. Expression-based effective number of isoforms.
             - ``'major_ratio_avg'``: float. Average ratio of the major isoform.
   :rtype: pandas.DataFrame


.. py:function:: false_discovery_control(ps, *, axis = 0, method = 'bh')

   Adjust p-values to control the false discovery rate.

   The false discovery rate (FDR) is the expected proportion of rejected null
   hypotheses that are actually true.
   If the null hypothesis is rejected when the *adjusted* p-value falls below
   a specified level, the false discovery rate is controlled at that level.

   :param ps: The p-values to adjust. Elements must be real numbers between 0 and 1.
   :param axis: The axis along which to perform the adjustment. The adjustment is
                performed independently along each axis-slice. If `axis` is None, `ps`
                is raveled before performing the adjustment.
   :param method: The false discovery rate control procedure to apply: ``'bh'`` is for
                  Benjamini-Hochberg :cite:`benjamini1995controlling` (Eq. 1), ``'by'`` is for Benjaminini-Yekutieli
                  :cite:`benjamini2001control` (Theorem 1.3). The latter is more conservative, but it is
                  guaranteed to control the FDR even when the p-values are not from
                  independent tests.

   :returns: **ps_adjusted** -- The adjusted p-values. If the null hypothesis is rejected where these
             fall below a specified level, the false discovery rate is controlled
             at that level.
   :rtype: numpy.ndarray

   .. rubric:: Notes

   From `scipy.stats.false_discovery_control` in SciPy v1.13.1.
   See https://github.com/scipy/scipy/blob/v1.13.1/scipy/stats/_morestats.py#L4737.


.. py:function:: get_cov_sp(coords, k = 4, rho = 0.99)

   Wrapper function to get the spatial covariance matrix from spatial coordinates.

   It will first construct a mutual-k-nearest neighbor graph from the euclidean spatial coordinates,
   then convert the adjacency matrix to a standardized spatial covariance matrix using the
   intrinsic conditional autoregressive (ICAR) model with spatial autocorrelation coefficient rho.
   See :cite:`su2023smoother` for details.

   :param coords: Shape (n_spots, n_dims). Euclidean spatial coordinates of spots.
   :param k: Number of nearest neighbors.
   :param rho: Spatial autocorrelation coefficient.

   :returns: **cov_sp** -- Shape (n_spots, n_spots). Spatial covariance matrix with standardized variance (== 1).
   :rtype: torch.Tensor


.. py:function:: load_visium_sp_meta(adata, path_to_spatial, library_id = None)

   Helper function to load Visium spatial metadata.

   :param adata: Annotated data matrix to store the spatial metadata.
   :param path_to_spatial: Path to the `spatial` folder generated by Space Ranger.
   :param library_id: Library ID of the spatial data.

   :returns: **anndata** -- AnnData with spatial metadata.
   :rtype: anndata.AnnData


.. py:function:: prepare_inputs_from_anndata(adata, layer, group_iso_by, spatial_key, min_counts, min_bin_pct, filter_single_iso_genes, gene_names, design_mtx, covariate_names)

   Extract and filter isoform count tensors from an AnnData object.

   Shared helper used by both :class:`splisosm.hyptest_np.SplisosmNP` and
   :class:`splisosm.hyptest_glmm.SplisosmGLMM` to prepare legacy-compatible
   tensors from an AnnData input.  Feature filtering, sparse/dense handling,
   coordinate extraction, and design-matrix resolution are all performed here.

   :param adata: Annotated data matrix.
   :param layer: Key in ``adata.layers`` containing raw isoform counts.
   :param group_iso_by: Column in ``adata.var`` used to group isoforms by gene.
   :param spatial_key: Key in ``adata.obsm`` for spatial coordinates.
   :param min_counts: Minimum total isoform count across spots required to retain an isoform.
   :param min_bin_pct: Minimum fraction/percentage of spots with non-zero expression for an
                       isoform.  Values in ``[0, 1]`` are treated as fractions; values in
                       ``(1, 100]`` are treated as percentages.
   :param filter_single_iso_genes: Whether to discard genes with fewer than two retained isoforms.
   :param gene_names: Column name in ``adata.var`` used as display names for grouped genes.
                      If ``None``, the grouped gene IDs are used.
   :param design_mtx: Design matrix for differential-usage tests.  Accepts a
                      tensor/array/dataframe of shape ``(n_spots, n_factors)``, a single
                      obs-column name (str), or a list of obs-column names.
   :param covariate_names: Explicit covariate names.  When ``design_mtx`` is given as column
                           name(s) and this is ``None``, the column names are used automatically.

   :returns: * **counts_list** (*list[torch.Tensor]*) -- Per-gene isoform count tensors, each of shape ``(n_spots, n_isos)``.
               Sparse ``adata.layers[layer]`` input yields sparse COO tensors.
             * **coordinates** (*torch.Tensor*) -- Shape ``(n_spots, 2)`` spatial coordinates, dtype float32.
             * **resolved_gene_names** (*list[str]*) -- Display names for each gene in ``counts_list``.
             * **resolved_design** (*np.ndarray or tensor or None*) -- Resolved design matrix, or the original object if it was already
               array-like; ``None`` when ``design_mtx`` is ``None``.
             * **resolved_covariates** (*list[str] or None*) -- Resolved covariate names, or ``None`` when ``design_mtx`` is ``None``.

   :raises ValueError: If required fields are missing from ``adata``, no isoforms survive
       filtering, or argument values are out of range.


.. py:function:: run_hsic_gc(counts_gene, coordinates, approx_rank = None, **spatial_kernel_kwargs)

   Function to compute HSIC-GC statistic for gene-level counts.

   This function is designed to be a plugin replacement for SPARK-X.

   :param counts_gene: Shape (n_spots, n_genes). Gene counts.
   :param coordinates: Shape (n_spots, 2). Spatial coordinates of spots.
   :param approx_rank: Approximate rank of the spatial kernel matrix.
   :param \*\*spatial_kernel_kwargs: Additional arguments for SpatialCovKernel.

   :returns: Results of the HSIC-GC spatial variability test with keys:

             - ``'statistic'``: np.ndarray of shape (n_genes,). HSIC-GC statistics.
             - ``'pvalue'``: np.ndarray of shape (n_genes,). P-values.
             - ``'pvalue_adj'``: np.ndarray of shape (n_genes,). Adjusted p-values.
             - ``'method'``: str. Method name "hsic-gc".
   :rtype: dict


.. py:function:: run_sparkx(counts_gene, coordinates)

   Wrapper for running the SPARK-X test for spatial gene expression variability.

   It runs the R-package SPARK :cite:`zhu2021spark` via rpy2.

   :param counts_gene: Shape (n_spots, n_genes), the observed gene counts.
   :param coordinates: Shape (n_spots, 2), the spatial coordinates.

   :returns: Results of the SPARK-X spatial variability test with keys:

             - ``'statistic'``: np.ndarray of shape (n_genes,). Mean SPARK-X statistics.
             - ``'pvalue'``: np.ndarray of shape (n_genes,). Combined p-values.
             - ``'pvalue_adj'``: np.ndarray of shape (n_genes,). Adjusted combined p-values.
             - ``'method'``: str. Method name "spark-x".
   :rtype: dict