splisosm.hyptest_np
===================

.. py:module:: splisosm.hyptest_np

.. autoapi-nested-parse::

   Non-parametric hypothesis tests for spatial isoform usage.


Classes
-------

.. autoapisummary::

   splisosm.hyptest_np.SplisosmNP


Module Contents
---------------

.. py:class:: SplisosmNP(k_neighbors = 4, rho = 0.99, standardize_cov = True)

   Non-parametric spatial isoform statistical model.

   .. rubric:: Examples

   Spatial variability test:

   >>> from splisosm import SplisosmNP
   >>> # adata : AnnData of shape (n_spots, n_isoforms)
   >>> #   adata.layers["counts"]    — raw isoform counts
   >>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
   >>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
   >>> model = SplisosmNP()
   >>> model.setup_data(adata, layer="counts", group_iso_by="gene_symbol")
   >>> model.test_spatial_variability(method="hsic-ir")
   >>> sv_results = model.get_formatted_test_results("sv")

   Differential usage test:

   >>> model = SplisosmNP()
   >>> model.setup_data(
   ...     adata, layer="counts", group_iso_by="gene_symbol",
   ...     design_mtx="covariate",  # obs column name, or (n_spots, n_factors) array
   ... )
   >>> model.test_differential_usage(method="hsic-gp", residualize="cov_only")
   >>> du_results = model.get_formatted_test_results("du")

   Initialise the model.

   :param k_neighbors: Number of nearest neighbours used to build the spatial adjacency
                       graph for the CAR kernel (default 4).
   :type k_neighbors: int, optional
   :param rho: Spatial autocorrelation strength in the CAR model (default 0.99).
               Values close to 1 give a smoother spatial kernel.
   :type rho: float, optional
   :param standardize_cov: Whether to standardise the spatial covariance matrix so that its
                           diagonal entries are 1 (default ``True``).
   :type standardize_cov: bool, optional


   .. py:method:: extract_feature_summary(level = 'gene', print_progress = True)

      Compute filtered feature-level summary statistics.

      Gene-level statistics are aggregated across all isoforms that passed
      the filters applied in :meth:`setup_data`.  Isoform-level statistics
      are computed per isoform and augmented onto the corresponding rows of
      ``adata.var``.

      Results are cached: repeated calls with the same ``level`` return the
      cached :class:`pandas.DataFrame` without recomputation.

      :param level: Summary granularity.
                    ``'gene'``: one row per gene.
                    ``'isoform'``: one row per isoform that passed filtering.
      :param print_progress: Whether to show a progress bar.

      :returns: For ``level='gene'``, the index is the gene display name and the
                columns are:

                - ``'n_isos'``: int. Number of isoforms retained after filtering.
                - ``'perplexity'``: float. Effective number of isoforms based on
                  the marginal isoform usage entropy.
                - ``'pct_bin_on'``: float. Fraction of spots with non-zero total
                  gene counts.
                - ``'count_avg'``: float. Mean per-spot total count for the gene.
                - ``'count_std'``: float. Std of per-spot total count for the gene.

                For ``level='isoform'``, the index is the isoform name (matching
                ``adata.var_names``) and the columns are the original ``adata.var``
                columns plus:

                - ``'pct_bin_on'``: float. Fraction of spots with count > 0.
                - ``'count_total'``: float. Total counts across all spots.
                - ``'count_avg'``: float. Mean count per spot.
                - ``'count_std'``: float. Std of count per spot.
                - ``'ratio_total'``: float. Fraction of total gene counts
                  attributable to this isoform.
                - ``'ratio_avg'``: float. Mean per-spot isoform usage ratio
                  (computed over spots with non-zero gene coverage).
                - ``'ratio_std'``: float. Std of per-spot isoform usage ratio
                  (computed over spots with non-zero gene coverage).
      :rtype: pandas.DataFrame

      :raises RuntimeError: If :meth:`setup_data` has not been called.
      :raises ValueError: If ``level`` is not ``'gene'`` or ``'isoform'``.


   .. py:method:: get_formatted_test_results(test_type, with_gene_summary = False)

      Get formatted test results as a pandas DataFrame.

      :param test_type: Which results to retrieve: ``"sv"`` for spatial variability or
                        ``"du"`` for differential usage.
      :type test_type: {"sv", "du"}
      :param with_gene_summary: If ``True``, append gene-level summary statistics from
                                :meth:`extract_feature_summary` (columns: ``n_isos``,
                                ``perplexity``, ``pct_bin_on``, ``count_avg``, ``count_std``).
      :type with_gene_summary: bool, optional

      :returns: Formatted test results.
      :rtype: pandas.DataFrame


   .. py:method:: setup_data(adata, *, spatial_key = 'spatial', adj_key = None, layer = 'counts', group_iso_by = 'gene_symbol', gene_names = None, design_mtx = None, covariate_names = None, min_counts = 10, min_bin_pct = 0.0, filter_single_iso_genes = True, min_component_size = 1, skip_spatial_kernel = False)

      Setup isoform-level spatial data for hypothesis testing.

      Extracts isoform count tensors from an AnnData object, optionally
      filters disconnected graph components, builds a spatial covariance
      kernel, and resolves the design matrix.

      :param adata: Annotated data matrix.  Counts are read from
                    ``adata.layers[layer]`` grouped by ``group_iso_by``, and
                    spatial coordinates from ``adata.obsm[spatial_key]``.
                    See :func:`splisosm.utils.prepare_inputs_from_anndata` for
                    full preprocessing details.
      :type adata: anndata.AnnData
      :param spatial_key: Key in ``adata.obsm`` for spatial coordinates (default
                          ``"spatial"``).
      :type spatial_key: str, optional
      :param adj_key: Key in ``adata.obsp`` for a pre-built adjacency matrix.
                      When provided, it overrides the k-NN graph construction
                      from coordinates and be used directly to build the spatial kernel.
                      The adjacency matrix is symmetrized internally.
      :type adj_key: str or None, optional
      :param layer: Layer in ``adata.layers`` that stores isoform counts (default
                    ``"counts"``).
      :type layer: str, optional
      :param group_iso_by: Column in ``adata.var`` used to group isoforms by gene
                           (default ``"gene_symbol"``).
      :type group_iso_by: str, optional
      :param gene_names: Column name in ``adata.var`` used as display names for genes.
                         If ``None``, the values of ``group_iso_by`` are used.
      :type gene_names: str or None, optional
      :param design_mtx: Design matrix for differential-usage tests.  Accepts an
                         array/tensor/DataFrame of shape ``(n_spots, n_factors)``, a
                         single obs-column name (str), or a list of obs-column names.
                         Categorical obs columns are one-hot encoded automatically.

                         When a **scipy sparse matrix** is passed directly, it is stored as
                         scipy CSR internally and all differential-usage methods handle it
                         without densifying: ``"hsic"`` uses a sparse matrix-multiply path
                         in :func:`linear_hsic_test`; ``"t-fisher"`` and ``"t-tippett"``
                         extract group indices directly from the sparse non-zero structure.
                         ``"hsic-gp"`` densifies each column via :meth:`_get_design_col`
                         before GPR fitting (GPR residuals are always dense).

                         All other input types (obs column names, array, tensor, DataFrame)
                         are converted to a dense torch float32 tensor.
      :type design_mtx: tensor, array, DataFrame, str, or list of str, optional
      :param covariate_names: Explicit covariate names.  When ``design_mtx`` is given as
                              column name(s) and this is ``None``, the column names are used
                              automatically; otherwise auto-generated as ``factor_1``, etc.
      :type covariate_names: list of str or None, optional
      :param min_counts: Minimum total isoform count across spots required to retain an
                         isoform (default 10).
      :type min_counts: int, optional
      :param min_bin_pct: Minimum fraction/percentage of spots where an isoform must be
                          expressed (default 0.0).
      :type min_bin_pct: float, optional
      :param filter_single_iso_genes: Whether to remove genes with fewer than two retained isoforms
                                      (default ``True``).
      :type filter_single_iso_genes: bool, optional
      :param min_component_size: Minimum number of spots a connected component must contain to
                                 be retained.  Spots in smaller components are removed from all
                                 data structures before the spatial kernel is built.  Default 1
                                 disables filtering.  A ``UserWarning`` is issued when spots are
                                 removed.
      :type min_component_size: int, optional
      :param skip_spatial_kernel: If ``True``, skip construction of the CAR spatial kernel and
                                  store an :class:`~splisosm.kernel.IdentityKernel` placeholder as
                                  ``self.sp_kernel`` instead.  Use this when only
                                  :meth:`test_differential_usage` is needed (it fits custom GPR
                                  to handle spatial autocorrelation).
                                  Calling :meth:`test_spatial_variability` on a model set up with
                                  ``skip_spatial_kernel=True`` will raise a ``RuntimeError``.
                                  Default ``False``.
      :type skip_spatial_kernel: bool, optional

      :raises ValueError: If input arguments are invalid or required fields are missing.


   .. py:method:: test_differential_usage(method = 'hsic-gp', ratio_transformation = 'none', nan_filling = 'mean', gpr_backend = 'sklearn', gpr_configs = None, residualize = 'cov_only', n_jobs = -1, print_progress = True, return_results = False)

      Test for spatial isoform differential usage.

      Before running this function, the design matrix must be set up using :func:`setup_data`.
      Each column of the design matrix corresponds to a covariate to test for differential
      association with the isoform usage ratios of each gene.
      Test statistics and p-values are computed per (gene, covariate) pair separately.

      Two types of association tests are supported:

      - Unconditional (``"hsic"``, ``"t-fisher"``, ``"t-tippett"``): test the
        unconditional association between isoform usage ratios and the covariate.
      - Conditional (``"hsic-gp"``): test the association conditioned on spatial
        coordinates via Gaussian process regression.  See :cite:`zhang2012kernel`
        for more details.

      :param method: Method for association testing:

                     * ``"hsic"``: Unconditional HSIC test (multivariate RV coefficient).
                       For continuous factors, equivalent to the multivariate Pearson correlation
                       test.  For binary factors, equivalent to the two-sample Hotelling T**2 test.
                     * ``"hsic-gp"``: Conditional HSIC test.  Spatial effects are removed via
                       Gaussian process regression before computing the HSIC statistic.

                     Or one of the T-tests (binary factors only):

                     * ``"t-fisher"``, ``"t-tippett"``: each isoform is tested independently
                       and p-values are combined gene-wise via Fisher's or Tippett's method.
      :type method: str, optional
      :param ratio_transformation: Compositional transformation for isoform ratios.
                                   One of ``'none'``, ``'clr'``, ``'ilr'``, ``'alr'``, ``'radial'``
                                   :cite:`park2022kernel`.  See :func:`splisosm.utils.counts_to_ratios`.
      :type ratio_transformation: str, optional
      :param nan_filling: How to fill NaN values in isoform ratios.  One of ``'mean'`` or ``'none'``.
                          See :func:`splisosm.utils.counts_to_ratios`.
      :type nan_filling: str, optional
      :param gpr_backend: GPR backend to use for ``method='hsic-gp'``.
                          One of ``'sklearn'`` (default) or ``'gpytorch'``.
                          For FFT-accelerated spatial GP on regular grids use
                          :class:`~splisosm.hyptest_fft.SplisosmFFT` instead.
      :type gpr_backend: str, optional
      :param gpr_configs: Nested configuration dict for the GPR objects, with optional keys
                          ``'covariate'`` and/or ``'isoform'``.  Each sub-dict is forwarded to
                          :func:`splisosm.kernel_gpr.make_kernel_gpr`.  Unspecified keys use the
                          defaults from :data:`splisosm.kernel_gpr._DEFAULT_GPR_CONFIGS`::

                              {
                                  "covariate": {
                                      "constant_value": 1.0,
                                      "constant_value_bounds": (1e-3, 1e3),
                                      "length_scale": 1.0,
                                      "length_scale_bounds": "fixed",
                                      "n_inducing": 5000,
                                  },
                                  "isoform": {
                                      "constant_value": 1.0,
                                      "constant_value_bounds": (1e-3, 1e3),
                                      "length_scale": 1.0,
                                      "length_scale_bounds": "fixed",
                                      "n_inducing": 5000,
                                  },
                              }

                          ``"n_inducing"`` *(int or None)* controls the scale of spatial GP
                          fitting for each backend:

                          * **sklearn** — maximum number of observations used for
                            hyperparameter fitting.  Full exact GP when ``n_obs ≤ n_inducing``
                            (or ``None``); a randomly sub-sampled **subset-of-data** of
                            ``n_inducing`` points otherwise (**not** the same inducing-point
                            approximation as gpytorch).  Default: ``5000``.  Set to ``None``
                            to use all observations (warns when ``n_obs > 10_000``).
                          * **gpytorch** — FITC sparse-GP inducing-point approximation with
                            ``n_inducing`` points; set to ``None`` for exact GP.
                            Default: ``5000``.
      :type gpr_configs: dict, optional
      :param residualize: Controls which signals are spatially residualized when
                          ``method="hsic-gp"``:

                          * ``"cov_only"`` (default): residualize covariates only; test
                            HSIC(Z_res, Y_raw).  Fastest; calibration matches ``"both"``
                            when covariate GPR captures most spatial confounding.
                          * ``"both"``: residualize both covariates and isoform ratios.
      :type residualize: {"cov_only", "both"}, optional
      :param n_jobs: Number of parallel workers for the per-gene loop.  ``-1`` uses all
                     available CPUs.  Each worker densifies one sparse count tensor
                     (~4–40 MB at 100 K–1 M spots × 10 isoforms).  When
                     ``gpr_backend="gpytorch"`` and ``device != "cpu"``, the GPU is not
                     thread-safe; parallelism is automatically disabled.  Default ``-1``.
      :type n_jobs: int, optional
      :param print_progress: Whether to show the progress bar. Default to True.
      :type print_progress: bool, optional
      :param return_results: Whether to return the test statistics and p-values.
                             If False, the results are stored in ``self._du_test_results``.
      :type return_results: bool, optional

      :returns: **results** -- If ``return_results`` is True, returns dict with test statistics and
                p-values. Otherwise, returns None and stores results in
                ``self._du_test_results``.
      :rtype: dict or None


   .. py:method:: test_spatial_variability(method = 'hsic-ir', ratio_transformation = 'none', nan_filling = 'mean', null_method = 'eig', null_configs = None, n_jobs = -1, return_results = False, print_progress = True)

      Test for spatial variability.

      Kernel-based multivariate hypothesis testing for spatial variability in

      - gene-level total counts (``"hsic-gc"`` or ``"spark-x"`` :cite:`zhu2021spark`)
      - isoform usage ratios (``"hsic-ir"``)
      - isoform counts (``"hsic-ic"``)

      Test statistics and p-values are computed per gene for each gene separately.

      :param method: Test target: ``"hsic-ir"`` (isoform usage ratios), ``"hsic-ic"``
                     (isoform counts), ``"hsic-gc"`` (gene-level counts), or
                     ``"spark-x"`` (SPARK-X :cite:`zhu2021spark`).
      :type method: {"hsic-ir", "hsic-ic", "hsic-gc", "spark-x"}, optional
      :param ratio_transformation: Compositional transformation applied to isoform ratios when
                                   ``method="hsic-ir"``.  See :func:`splisosm.utils.counts_to_ratios`
                                   and :cite:`park2022kernel` for details.
      :type ratio_transformation: {"none", "clr", "ilr", "alr", "radial"}, optional
      :param nan_filling: Strategy for NaN values in isoform ratios.
                          See :func:`splisosm.utils.counts_to_ratios` for details.
      :type nan_filling: {"mean", "none"}, optional
      :param null_method: Method for computing the null distribution of the test statistic:

                          * ``"eig"`` (default): asymptotic chi-square mixture using kernel
                            eigenvalues; Liu's method :cite:`liu2009new`.  Supports optional
                            ``null_configs["approx_rank"]`` (int) to use only the top-k
                            eigenvalues. By default, approx_rank = np.ceil(np.sqrt(n_spots) * 4)
                            for large datasets (n_spots > 5000). Set it to None to use
                            all eigenvalues, which can be slow for large n_spots.
                          * ``"trace"``: moment-matching normal approximation using
                            tr(K') and tr(K'²) of the (centred) spatial kernel.
                          * ``"perm"``: permutation-based null distribution.  Supports
                            optional ``null_configs["n_perms_per_gene"]`` (default 1000),
                            and ``null_configs["perm_batch_size"]`` (default 50, larger values
                            lead to more memory usage) for batch-wise null statistic computation.
      :type null_method: {"eig", "trace", "perm"}, optional
      :param null_configs: Extra keyword arguments for the chosen ``null_method``.
      :type null_configs: dict or None, optional
      :param n_jobs: Number of parallel workers for the per-gene loop.  ``-1`` uses all
                     available CPUs.  Each worker densifies one sparse count tensor
                     (~4–40 MB at 100 K–1 M spots × 10 isoforms) so choose ``n_jobs``
                     to fit within available RAM.  Default ``-1``.
      :type n_jobs: int, optional
      :param return_results: If ``True``, return the result dict.  Otherwise store results in
                             :attr:`sv_test_results` and return ``None``.
      :type return_results: bool, optional
      :param print_progress: Whether to show a progress bar.
      :type print_progress: bool, optional

      :returns: If `return_results` is True, returns dict with test statistics and p-values.
                Otherwise, returns None and stores results in self._sv_test_results.
      :rtype: dict or None

      .. rubric:: Notes

      To run the SPARK-X test, the R-package `SPARK` must be installed and accessible from Python via `rpy2`.


   .. py:attribute:: adata
      :type:  Optional[anndata.AnnData]

      Source :class:`~anndata.AnnData` object; ``None`` before :meth:`setup_data`.


   .. py:attribute:: covariate_names
      :type:  list[str]

      Covariate display names (length :attr:`n_factors`).


   .. py:attribute:: design_mtx
      :type:  Optional[torch.Tensor]

      Design matrix ``(n_spots, n_factors)``; ``None`` if no covariates.


   .. py:property:: filtered_adata
      :type: anndata.AnnData


      The filtered AnnData of shape (:attr:`n_spots`, sum(:attr:`n_isos_per_gene`)).

      This is the data used internally after :meth:`setup_data`.
      It is a copy of the input :attr:`adata`, subsetted to the retained spots and isoforms after filtering.

      :raises RuntimeError: If :meth:`setup_data` has not been called.


   .. py:attribute:: gene_names
      :type:  list[str]

      Gene display names (length :attr:`n_genes`).


   .. py:attribute:: n_factors
      :type:  int

      Number of covariates for differential usage testing.


   .. py:attribute:: n_genes
      :type:  int

      Number of genes after filtering.


   .. py:attribute:: n_isos_per_gene
      :type:  list[int]

      Number of isoforms per gene (list of length :attr:`n_genes`).


   .. py:attribute:: n_spots
      :type:  int

      Number of spatial spots/cells.


   .. py:attribute:: sp_kernel
      :type:  Any

      Spatial kernel (:class:`~splisosm.kernel.SpatialCovKernel` or
      :class:`~splisosm.kernel.IdentityKernel`).  Set by :meth:`setup_data`.