splisosm.hyptest_np
===================

.. py:module:: splisosm.hyptest_np

.. autoapi-nested-parse::

   Non-parametric hypothesis tests for spatial isoform usage.


Classes
-------

.. autoapisummary::

   splisosm.hyptest_np.SplisosmNP


Module Contents
---------------

.. py:class:: SplisosmNP

   Non-parametric spatial isoform statistical model.

   .. rubric:: Examples

   Setup data:

   >>> from splisosm import SplisosmNP
   >>> import torch
   >>> # Simulate data for 10 genes with different number of isoforms
   >>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)]  # 5 genes with 3 isoforms
   >>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)]  # 5 genes with 4 isoforms
   >>> data = data_3_iso + data_4_iso
   >>> coordinates = torch.rand(100, 2)  # 100 spots with 2D coordinates
   >>> design_mtx = torch.rand(100, 2)  # 100 spots with 2 covariates

   Spatial variability test:

   >>> model = SplisosmNP()
   >>> model.setup_data(data, coordinates)
   >>> model.test_spatial_variability(method = 'hsic-ir')
   >>> sv_results = model.get_formatted_test_results('sv')
   >>> print(sv_results.head())

   Differential usage test:

   >>> model = SplisosmNP()
   >>> model.setup_data(data, coordinates, design_mtx=design_mtx)
   >>> model.test_differential_usage(method = 'hsic')
   >>> du_results = model.get_formatted_test_results('du')
   >>> print(du_results.head())


   .. py:method:: get_formatted_test_results(test_type)

      Get the formatted test results as data frame.

      :param test_type: Type of test results to retrieve. Can be one of ``'sv'`` (spatial variability) or ``'du'`` (differential usage).

      :returns: Formatted test results.
      :rtype: pandas.DataFrame


   .. py:method:: setup_data(data = None, coordinates = None, approx_rank = None, design_mtx = None, gene_names = None, covariate_names = None, *, adata = None, spatial_key = 'spatial', layer = 'counts', group_iso_by = 'gene_symbol', min_counts = 10, min_bin_pct = 0.0, filter_single_iso_genes = True)

      Setup isoform-level spatial data for hypothesis testing.

      This method supports two input modes for backward compatibility.

      - Legacy mode: pass ``data`` and ``coordinates`` directly.
      - AnnData mode: pass ``adata``, where counts are extracted from
        ``adata.layers[layer]`` grouped by ``group_iso_by``, and coordinates
        are read from ``adata.obsm[spatial_key]``.
        See :func:`splisosm.utils.prepare_inputs_from_anndata` for details.

      :param data: Legacy mode only. List of tensors/arrays with shape
                   ``(n_spots, n_isos)`` containing isoform counts for each gene.
      :param coordinates: Legacy mode only. Shape ``(n_spots, 2)``, spatial coordinates.
      :param approx_rank: The rank of the low-rank approximation for the spatial covariance matrix.
                          If None, use the full-rank dense covariance matrix.
                          For larger datasets (n_spots > 5,000), the maximum rank is set to ``4 * sqrt(n_spots)``.
      :param design_mtx: Design matrix for differential usage tests.

                         - Legacy mode: tensor/array/dataframe of shape ``(n_spots, n_factors)``.
                         - AnnData mode: tensor/array/dataframe, or one obs-column name
                           (str), or a list of obs-column names.

                         When ``design_mtx`` contains categorical obs columns in AnnData mode,
                         they are automatically one-hot encoded. Covariate names are inferred
                         when not explicitly provided (see ``covariate_names`` below).
      :param gene_names: Gene names.

                         - Legacy mode: list of gene names.
                         - AnnData mode: optional column name in ``adata.var`` used as
                           display names for grouped genes; if None, use grouped gene IDs.
      :param covariate_names: List of covariate names. If not provided, names are inferred as follows:

                              - In **AnnData mode with column name(s)**: column names are used, with
                                categorical columns expanded to one-hot encoded names (e.g., ``col_cat0``,
                                ``col_cat1`` for ``col`` if it has categorical values).
                              - In **legacy mode with DataFrame**: DataFrame column names are used.
                              - **Otherwise**: default names like ``factor_1``, ``factor_2``, etc. are generated.

                              When explicitly provided, must match the number of factors in the
                              design matrix (after any categorical encoding/one-hot expansion).
      :param adata: AnnData object used in the new input mode.
      :param spatial_key: Key in ``adata.obsm`` for spatial coordinates.
      :param layer: Counts layer in ``adata.layers``.
      :param group_iso_by: Column in ``adata.var`` used to group isoforms by gene.
      :param min_counts: Minimum total isoform count across spots required to retain an isoform
                         in AnnData mode.
      :param min_bin_pct: Minimum percentage/fraction of spots where an isoform is expressed in
                          AnnData mode. Values in ``[0, 1]`` are treated as fractions; values in
                          ``(1, 100]`` are treated as percentages.
      :param filter_single_iso_genes: AnnData mode only. Whether to remove genes with fewer than two retained
                                      isoforms.

      :raises ValueError: If input arguments are invalid or required fields are missing.


   .. py:method:: test_differential_usage(method = 'hsic-gp', ratio_transformation = 'none', nan_filling = 'mean', gpr_backend = 'sklearn', gpr_configs = None, residualize = 'cov_only', print_progress = True, return_results = False)

      Test for spatial isoform differential usage.

      Before running this function, the design matrix must be set up using :func:`setup_data`.
      Each column of the design matrix corresponds to a covariate to test for differential
      association with the isoform usage ratios of each gene.
      Test statistics and p-values are computed per (gene, covariate) pair separately.

      Two types of association tests are supported:

      - Unconditional (``"hsic"``, ``"t-fisher"``, ``"t-tippett"``): test the
        unconditional association between isoform usage ratios and the covariate.
      - Conditional (``"hsic-gp"``): test the association conditioned on spatial
        coordinates via Gaussian process regression.  See :cite:`zhang2012kernel`
        for more details.

      :param method: Method for association testing:

                     * ``"hsic"``: Unconditional HSIC test (multivariate RV coefficient).
                       For continuous factors, equivalent to the multivariate Pearson correlation
                       test.  For binary factors, equivalent to the two-sample Hotelling T**2 test.
                     * ``"hsic-gp"``: Conditional HSIC test.  Spatial effects are removed via
                       Gaussian process regression before computing the HSIC statistic.

                     Or one of the T-tests (binary factors only):

                     * ``"t-fisher"``, ``"t-tippett"``: each isoform is tested independently
                       and p-values are combined gene-wise via Fisher's or Tippett's method.
      :type method: str, optional
      :param ratio_transformation: Compositional transformation for isoform ratios.
                                   One of ``'none'``, ``'clr'``, ``'ilr'``, ``'alr'``, ``'radial'``
                                   :cite:`park2022kernel`.  See :func:`splisosm.utils.counts_to_ratios`.
      :type ratio_transformation: str, optional
      :param nan_filling: How to fill NaN values in isoform ratios.  One of ``'mean'`` or ``'none'``.
                          See :func:`splisosm.utils.counts_to_ratios`.
      :type nan_filling: str, optional
      :param gpr_backend: GPR backend to use for ``method='hsic-gp'``.
                          One of ``'sklearn'`` (default) or ``'gpytorch'``.
                          For FFT-accelerated spatial GP on regular grids use
                          :class:`~splisosm.hyptest_fft.SplisosmFFT` instead.
      :type gpr_backend: str, optional
      :param gpr_configs: Nested configuration dict for the GPR objects, with optional keys
                          ``'covariate'`` and/or ``'isoform'``.  Each sub-dict is forwarded to
                          :func:`splisosm.kernel_gpr.make_kernel_gpr`.  Unspecified keys use the
                          defaults from :data:`splisosm.kernel_gpr._DEFAULT_GPR_CONFIGS`::

                              {
                                  "covariate": {
                                      "constant_value": 1.0,
                                      "constant_value_bounds": (1e-3, 1e3),
                                      "length_scale": 1.0,
                                      "length_scale_bounds": "fixed",
                                      "n_inducing": 5000,
                                  },
                                  "isoform": {
                                      "constant_value": 1.0,
                                      "constant_value_bounds": (1e-3, 1e3),
                                      "length_scale": 1.0,
                                      "length_scale_bounds": "fixed",
                                      "n_inducing": 5000,
                                  },
                              }

                          ``"n_inducing"`` *(int)* is supported by both backends with the
                          same semantics:

                          * **sklearn** — full exact GP when ``n_obs ≤ n_inducing``; a
                            randomly sub-sampled subset of ``n_inducing`` points is used
                            as the inducing set otherwise (default: ``5000``).
                          * **gpytorch** — FITC sparse GP approximation with ``n_inducing``
                            points; set to ``None`` to use exact GP (default: ``5000``).
      :type gpr_configs: dict, optional
      :param residualize: Controls which signals are spatially residualized when
                          ``method="hsic-gp"``:

                          * ``"cov_only"`` (default): residualize covariates only; test
                            HSIC(Z_res, Y_raw).  Fastest; calibration matches ``"both"``
                            when covariate GPR captures most spatial confounding.
                          * ``"both"``: residualize both covariates and isoform ratios.
      :type residualize: {"cov_only", "both"}, optional
      :param print_progress: Whether to show the progress bar. Default to True.
      :type print_progress: bool, optional
      :param return_results: Whether to return the test statistics and p-values.
                             If False, the results are stored in ``self.du_test_results``.
      :type return_results: bool, optional

      :returns: **results** -- If ``return_results`` is True, returns dict with test statistics and
                p-values. Otherwise, returns None and stores results in
                ``self.du_test_results``.
      :rtype: dict or None


   .. py:method:: test_spatial_variability(method = 'hsic-ir', ratio_transformation = 'none', nan_filling = 'mean', use_perm_null = False, n_perms_per_gene = None, return_results = False, print_progress = True)

      Test for spatial variability.

      Kernel-based multivariate hypothesis testing for spatial variability in

      - gene-level total counts (``"hsic-gc"`` or ``"spark-x"`` :cite:`zhu2021spark`)
      - isoform usage ratios (``"hsic-ir"``)
      - isoform counts (``"hsic-ic"``)

      Test statistics and p-values are computed per gene for each gene separately.

      :param method: Must be one of ``"hsic-ir"``, ``"hsic-ic"``, ``"hsic-gc"``, or ``"spark-x"``.
      :param ratio_transformation: If using isoform ratios, the compositional transformation to apply.
                                   Can be one of ``'none'``, ``'clr'``, ``'ilr'``, ``'alr'``, or ``'radial'`` :cite:`park2022kernel`.
                                   See :func:`splisosm.utils.counts_to_ratios` for more details.
      :param nan_filling: How to fill the NaN values in the isoform ratios. Can be one of ``'mean'`` or ``'none'``.
                          See :func:`splisosm.utils.counts_to_ratios` for more details.
      :param use_perm_null: Whether to generate the null distribution from permutation.
                            If False, use the asymptotic distribution of chi-square mixtures with Liu's method :cite:`liu2009new`.
      :param n_perms_per_gene: Number of permutations per gene for permutation test.
      :param return_results: Whether to return the test statistics and p-values.
                             Default to False, in which case the results are stored in ``self.sv_test_results``.
      :param print_progress: Whether to show the progress bar. Default to True.

      :returns: If `return_results` is True, returns dict with test statistics and p-values.
                Otherwise, returns None and stores results in self.sv_test_results.
      :rtype: dict or None

      .. rubric:: Notes

      To run the SPARK-X test, the R-package `SPARK` must be installed and accessible from Python via `rpy2`.


   .. py:attribute:: adata
      :value: None


   .. py:attribute:: du_test_results
      :type:  dict

      Dictionary to store the differential usage test results after running test_differential_usage().
      It contains the following keys:

      - ``'method'``: str, the method used for the test.
      - ``'statistic'``: numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.
      - ``'pvalue'``: numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.
      - ``'pvalue_adj'``: numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.


   .. py:attribute:: n_factors
      :type:  int

      Number of covariates to test for differential usage.


   .. py:attribute:: n_genes
      :type:  int

      Number of genes.


   .. py:attribute:: n_isos
      :type:  list[int]

      List of numbers of isoforms per gene.


   .. py:attribute:: n_spots
      :type:  int

      Number of spots.


   .. py:attribute:: setup_input_mode
      :value: None


   .. py:attribute:: sv_test_results
      :type:  dict

      Dictionary to store the spatial variability test results after running test_spatial_variability().
      It contains the following keys:

      - ``'method'``: str, the method used for the test.
      - ``'statistic'``: numpy.ndarray of shape (n_genes,), the test statistic for each gene.
      - ``'pvalue'``: numpy.ndarray of shape (n_genes,), the p-value for each gene.
      - ``'pvalue_adj'``: numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.