splisosm.hyptest_glmm
=====================

.. py:module:: splisosm.hyptest_glmm

.. autoapi-nested-parse::

   GLMM-based hypothesis tests for spatial isoform usage.


Classes
-------

.. autoapisummary::

   splisosm.hyptest_glmm.SplisosmGLMM


Module Contents
---------------

.. py:class:: SplisosmGLMM(model_type = 'glmm-full', share_variance = True, var_fix_sigma = True, var_prior_model = 'none', var_prior_model_params = None, init_ratio = 'uniform', fitting_method = 'joint_gd', fitting_configs = None, k_neighbors = 4, rho = 0.99, approx_rank=_APPROX_RANK_AUTO, device = 'cpu')

   Parametric spatial isoform statistical modeling using GLMM.

   This is a convenience class that wraps around the :class:`splisosm.model.MultinomGLMM`
   for batched model fitting and spatial variability and differential usage testing.

   .. rubric:: Examples

   Prepare an AnnData object and fit a GLMM:

   >>> from splisosm import SplisosmGLMM
   >>> # adata : AnnData of shape (n_spots, n_isoforms)
   >>> #   adata.layers["counts"]    — raw isoform counts
   >>> #   adata.var["gene_symbol"]  — column grouping isoforms by gene
   >>> #   adata.obsm["spatial"]     — (n_spots, 2) spatial coordinates
   >>> #   adata.obs["covariate"]    — optional covariate column for DU testing
   >>> model = SplisosmGLMM(
   ...     model_type="glmm-full",    # 'glmm-full' | 'glmm-null' | 'glm'
   ...     fitting_method="joint_gd", # 'joint_gd' | 'joint_newton' | 'marginal_gd' | 'marginal_newton'
   ...     device="cpu",              # 'cpu' | 'cuda' (NVIDIA) | 'mps' (Apple Silicon)
   ...     # approx_rank: omit for auto | None (force full rank) | int (fixed low-rank)
   ... )
   >>> model.setup_data(
   ...     adata,
   ...     layer="counts",
   ...     group_iso_by="gene_symbol",
   ...     group_gene_by_n_iso=True,  # required for batch_size > 1 in fit()
   ...     design_mtx="covariate",    # obs column name, or (n_spots, n_factors) array
   ... )
   >>> model.fit(
   ...     n_jobs=1, batch_size=20,
   ...     with_design_mtx=False,     # False → score test (recommended for DU)
   ... )
   >>> fitted_models = model.get_fitted_models()

   Differential usage test:

   >>> model.test_differential_usage(method="score")  # or "wald" if with_design_mtx=True
   >>> du_results = model.get_formatted_test_results("du")
   >>> print(du_results.head())

   :param model_type: Which model to fit. Can be one of ``'glmm-full'`` (Multinomial GLMM with spatial random effects),
                      ``'glmm-null'`` (Multinomial GLMM with white noise), ``'glm'`` (Multinomial GLM).
   :param share_variance: Whether to share the variance component across isoforms.
   :param var_fix_sigma: Whether to fix the total variance (``sigma``) to the Fano-factor
                         initial estimate.  When ``True`` (default), only ``theta_logit``
                         is learned, producing conservative but well-calibrated hypothesis
                         tests.  Set to ``False`` to learn sigma jointly with other
                         parameters; this may yield higher power for the SV test but can
                         inflate false positive rates for both SV and DU tests.
   :param var_prior_model: The prior model on the total variance ``sigma``. Default is ``'none'`` with no prior.
                           Other options are ``'gamma'`` (Gamma prior) and ``'inv_gamma'`` (Inverse Gamma prior).
   :param var_prior_model_params: The parameters for the prior model on the total variance ``sigma``.
                                  For ``'gamma'``, the default parameters are ``{'alpha': 2.0, 'beta': 0.3}``.
                                  For ``'inv_gamma'``, the default parameters are ``{'alpha': 3, 'beta': 0.5}``.
   :param init_ratio: The initialization method for the logit isoform usage ratio. Options are ``'observed'`` (initialize using observed counts)
                      and ``'uniform'`` (equal isoform usage across space).
   :param fitting_method: The fitting method to use when ``model_type='glmm-full'`` or ``'glmm-null'``.
                          Options are ``'joint_gd'`` (joint likelihood with gradient descent),
                          ``'joint_newton'`` (joint likelihood with Newton's method),
                          ``'marginal_gd'`` (marginal likelihood with gradient descent),
                          and ``'marginal_newton'`` (marginal likelihood with Newton's method).
   :param fitting_configs: A dictionary of fitting configurations with the following keys:

                           - ``'lr'``: float, learning rate
                           - ``'optim'``: str, optimization method (one of ``'adam'``, ``'sgd'``, or ``'lbfgs'``)
                           - ``'tol'``: float, tolerance for convergence
                           - ``'max_epochs'``: int, maximum number of epochs
                           - ``'patience'``: int, number of epochs to wait for improvement before stopping
                           - ``'update_nu_every_k'``: int, number of iterations to update ``nu`` when using ``fitting_method='marginal_newton'``
   :param k_neighbors: Number of nearest neighbours used to build the spatial k-NN adjacency graph
                       (default 4).  Passed to :class:`splisosm.kernel.SpatialCovKernel`.
                       Ignored when ``adj_key`` is provided to :meth:`setup_data`.
   :type k_neighbors: int, optional
   :param rho: Spectral smoothing parameter for the ICAR-like spatial kernel (default 0.99).
               Values close to 1 produce smoother spatial covariance.
   :type rho: float, optional
   :param approx_rank: Number of leading eigenvectors of the spatial kernel to retain.

                       * **Not specified (default)**: automatically selects the rank at
                         :meth:`setup_data` time — full rank when ``n_spots ≤ 5000``,
                         otherwise ``ceil(sqrt(n_spots) * 4)``.
                       * ``None``: always use full rank regardless of ``n_spots`` (a warning
                         is emitted when ``n_spots > 5000`` because the eigendecomposition of
                         a large dense matrix is expensive).
                       * Positive integer: use exactly that many eigenvectors.
   :type approx_rank: int or None, optional
   :param device: Device for all model computation (default ``"cpu"``).
                  Use ``"cuda"`` for NVIDIA GPU or ``"mps"`` for Apple Silicon.
                  Parallel fitting (``n_jobs > 1``) is not supported for non-CPU devices
                  and will automatically fall back to single-core with a warning.
   :type device: {"cpu", "cuda", "mps"}, optional

   .. seealso:: :class:`splisosm.model.MultinomGLMM`


   .. py:method:: extract_feature_summary(level = 'gene', print_progress = True)

      Compute filtered feature-level summary statistics.

      Gene-level statistics are aggregated across all isoforms that passed
      the filters applied in :meth:`setup_data`.  Isoform-level statistics
      are computed per isoform and augmented onto the corresponding rows of
      ``adata.var``.

      Results are cached: repeated calls with the same ``level`` return the
      cached :class:`pandas.DataFrame` without recomputation.

      :param level: Summary granularity.
                    ``'gene'``: one row per gene.
                    ``'isoform'``: one row per isoform that passed filtering.
      :param print_progress: Whether to show a progress bar.

      :returns: For ``level='gene'``, the index is the gene display name and the
                columns are:

                - ``'n_isos'``: int. Number of isoforms retained after filtering.
                - ``'perplexity'``: float. Effective number of isoforms based on
                  the marginal isoform usage entropy.
                - ``'pct_bin_on'``: float. Fraction of spots with non-zero total
                  gene counts.
                - ``'count_avg'``: float. Mean per-spot total count for the gene.
                - ``'count_std'``: float. Std of per-spot total count for the gene.

                For ``level='isoform'``, the index is the isoform name (matching
                ``adata.var_names``) and the columns are the original ``adata.var``
                columns plus:

                - ``'pct_bin_on'``: float. Fraction of spots with count > 0.
                - ``'count_total'``: float. Total counts across all spots.
                - ``'count_avg'``: float. Mean count per spot.
                - ``'count_std'``: float. Std of count per spot.
                - ``'ratio_total'``: float. Fraction of total gene counts
                  attributable to this isoform.
                - ``'ratio_avg'``: float. Mean per-spot isoform usage ratio
                  (computed over spots with non-zero gene coverage).
                - ``'ratio_std'``: float. Std of per-spot isoform usage ratio
                  (computed over spots with non-zero gene coverage).
      :rtype: pandas.DataFrame

      :raises RuntimeError: If :meth:`setup_data` has not been called.
      :raises ValueError: If ``level`` is not ``'gene'`` or ``'isoform'``.


   .. py:method:: fit(n_jobs = 1, batch_size = 1, quiet = True, print_progress = True, with_design_mtx = False, from_null = False, refit_null = True, random_seed = None)

      Model fitting.

      :param n_jobs: The number of cores to use for parallel fitting. Default to 1.
      :type n_jobs: int, optional
      :param batch_size: The maximum number of genes per job to fit in parallel. Default to 1.
      :type batch_size: int, optional
      :param quiet: Whether to suppress the fitting logs. Default to True.
      :type quiet: bool, optional
      :param print_progress: Whether to show the progress bar. Default to True.
      :type print_progress: bool, optional
      :param with_design_mtx: Whether to include the design matrix for the fixed effects. Default to False.
      :type with_design_mtx: bool, optional
      :param from_null: Whether to initialize the full model from a null model
                        with zero spatial variability (white-noise random effect).
      :type from_null: bool, optional
      :param refit_null: Whether to refit the null model after fitting the full model.
                         Only applicable when ``from_null=True``.
      :type refit_null: bool, optional
      :param random_seed: The random seed for reproducibility. Default to None.
      :type random_seed: int or None, optional

      .. seealso:: :func:`splisosm.model.MultinomGLMM.fit`


   .. py:method:: get_fitted_models()

      Get the fitted models after running fit().

      Each call reconstructs the full model objects from the stored
      lightweight state.  The returned list is a fresh reconstruction;
      mutations to the returned models are *not* persisted.

      :returns: **models** --

                - ``model_type='glmm-full'``: list[splisosm.hyptest_glmm.IsoFullModel]
                - ``model_type='glmm-null'``: list[splisosm.hyptest_glmm.IsoNullNoSpVar]
                - ``model_type='glm'``: list[splisosm.model.MultinomGLM]
      :rtype: list of fitted models


   .. py:method:: get_fitted_ratios_anndata(layer_name = 'fitted_ratios')

      Return a copy of the filtered AnnData with fitted isoform ratios as a new layer.

      For each gene the per-spot softmax isoform ratios are taken from the
      fitted model (``model.get_isoform_ratio()``) and placed into the
      corresponding isoform columns of a ``(n_spots, n_filtered_isoforms)``
      dense matrix, which is stored under ``layer_name``.

      **Isoform ordering guarantee**: the columns of the new layer follow
      the exact row order of ``self._filtered_adata.var``.  This holds
      because :meth:`setup_data` builds per-gene count tensors by slicing
      ``filtered_adata`` isoforms in their ``var`` row order (via
      ``groupby(sort=False)``), and :meth:`~splisosm.model.MultinomGLM.get_isoform_ratio`
      returns ratios in the same column order as the input counts.

      :param layer_name: Name of the new layer.  Defaults to ``'fitted_ratios'``.

      :returns: A *copy* of ``self._filtered_adata`` (shape
                ``(n_spots, n_filtered_isoforms)``) with ``layer_name`` added.
                The ``var`` DataFrame is identical to ``self._filtered_adata.var``,
                so isoform metadata (e.g. the gene-grouping column) is intact.
      :rtype: AnnData

      :raises RuntimeError: If :meth:`fit` has not been called, or if :meth:`setup_data` was
          not called with an :class:`~anndata.AnnData` object (i.e. raw
          tensors were provided instead).


   .. py:method:: get_formatted_test_results(test_type, with_gene_summary = False)

      Get the formatted test results as data frame.

      :param test_type: Type of test results to retrieve.
      :type test_type: {"sv", "du"}
      :param with_gene_summary: If ``True``, append gene-level summary statistics from
                                :meth:`extract_feature_summary`.
      :type with_gene_summary: bool, optional

      :returns: Formatted test results.
      :rtype: pandas.DataFrame


   .. py:method:: get_gene_model(gene_name)

      Return the fitted model for a single gene by display name.

      Each call reconstructs a fresh model from stored state.

      :param gene_name: Display name of the gene (must appear in :attr:`gene_names`).

      :returns: * Fitted ``MultinomGLM``, ``IsoFullModel``, or ``IsoNullNoSpVar`` instance
                * corresponding to ``gene_name``.

      :raises RuntimeError: If :meth:`fit` has not been called yet.
      :raises KeyError: If ``gene_name`` is not found in :attr:`gene_names`.


   .. py:method:: get_training_summary()

      Return a per-gene training summary as a :class:`pandas.DataFrame`.

      The DataFrame is indexed by gene name and contains the following columns:

      - ``'model_type'``: str. Model type (``'glmm-full'``, ``'glmm-null'``, or ``'glm'``).
      - ``'converged'``: bool. Whether the gene converged during training.
      - ``'best_loss'``: float. Best (lowest) negative log-likelihood achieved.
      - ``'best_epoch'``: int. Epoch at which the best loss was recorded.
      - ``'fitting_time_s'``: float. Wall-clock seconds spent fitting this gene.

      Only available after calling :meth:`fit`.

      :raises RuntimeError: If :meth:`fit` has not been called yet.


   .. py:method:: load(path, map_location = None)
      :staticmethod:


      Load a :class:`SplisosmGLMM` model previously saved with :meth:`save`.

      :param path: Path to the file written by :meth:`save`.
      :param map_location: Optional device string (e.g. ``'cpu'``, ``'cuda:0'``) to remap
                           tensor storage when loading a model that was saved on a different
                           device.  Defaults to ``None`` (preserves original device mapping).

      :returns: The fully restored model, including fitted parameters, test results,
                and all metadata.
      :rtype: SplisosmGLMM

      .. rubric:: Examples

      >>> model.save("model.pt")
      >>> loaded = SplisosmGLMM.load("model.pt")
      >>> loaded.get_training_summary()


   .. py:method:: save(path)

      Save the fitted model to a file.

      Only lightweight per-gene :class:`_FittedGeneState` objects (fitted
      parameters + convergence metadata) are stored -- no full ``nn.Module``
      objects.  The raw ``.adata`` is **not** saved; only
      ``_filtered_adata`` is persisted.  After loading, ``.adata`` will be
      ``None``.

      :param path: The path to save the fitted models.


   .. py:method:: setup_data(adata, *, spatial_key = 'spatial', adj_key = None, layer = 'counts', group_iso_by = 'gene_symbol', gene_names = None, group_gene_by_n_iso = False, design_mtx = None, covariate_names = None, min_counts = 10, min_bin_pct = 0.0, min_component_size = 1)

      Setup the data for the GLMM model.

      :param adata: Annotated data matrix.  Counts are read from
                    ``adata.layers[layer]`` grouped by ``group_iso_by``, and
                    spatial coordinates from ``adata.obsm[spatial_key]``.
      :type adata: anndata.AnnData
      :param spatial_key: Key in ``adata.obsm`` for spatial coordinates (default
                          ``"spatial"``).
      :type spatial_key: str, optional
      :param adj_key: Key in ``adata.obsp`` for a pre-built adjacency matrix.
                      When provided, it overrides the k-NN graph construction
                      from coordinates and be used directly to build the spatial kernel.
                      The adjacency matrix is symmetrized internally.
      :type adj_key: str or None, optional
      :param layer: Layer in ``adata.layers`` storing isoform counts (default
                    ``"counts"``).
      :type layer: str, optional
      :param group_iso_by: Column in ``adata.var`` used to group isoforms by gene
                           (default ``"gene_symbol"``).
      :type group_iso_by: str, optional
      :param gene_names: Column name in ``adata.var`` used as display names for genes.
                         If ``None``, the values of ``group_iso_by`` are used.
      :type gene_names: str or None, optional
      :param group_gene_by_n_iso: If ``True``, genes are sorted and grouped by their isoform
                                  count before batching.  Required for ``batch_size > 1`` in
                                  :meth:`fit` because the GLMM model expects a fixed isoform
                                  count per batch (default ``False``).
      :type group_gene_by_n_iso: bool, optional
      :param design_mtx: Design matrix for fixed effects.  Accepts an
                         array/tensor/DataFrame of shape ``(n_spots, n_factors)``, a
                         single obs-column name (str), or a list of obs-column names.
      :type design_mtx: tensor, array, DataFrame, str, or list of str, optional
      :param covariate_names: Explicit covariate names.  When not provided, names are
                              inferred from column names or auto-generated.
      :type covariate_names: list of str or None, optional
      :param min_counts: Minimum total isoform count across spots to retain an isoform
                         (default 10).
      :type min_counts: int, optional
      :param min_bin_pct: Minimum fraction/percentage of spots where an isoform must be
                          expressed (default 0.0).
      :type min_bin_pct: float, optional
      :param min_component_size: Minimum number of spots a connected component must contain to
                                 be retained.  Spots in smaller components are removed from all
                                 data structures before the spatial kernel is built.  Default 1
                                 disables filtering.  A ``UserWarning`` is issued when spots are
                                 removed.
      :type min_component_size: int, optional

      :raises ValueError: If input arguments are invalid or required fields are missing.


   .. py:method:: test_differential_usage(method = 'score', print_progress = True, return_results = False)

      Parametric test for spatial isoform differential usage.

      Before running this function, the design matrix must be set up using :func:`setup_data`.
      Each column of the design matrix corresponds to a covariate to test for differential association
      with the isoform usage ratios of each gene.
      Test statistics and p-values are computed per (gene, covariate) pair separately.

      Similar to :func:`splisosm.hyptest_np.SplisosmNP.test_differential_usage`, here we also support two types of association tests but **implicitly**:

      - Unconditional (when ``model_type='glm'``): test the unconditional association between isoform usage ratios and the covariate of interest (H_0: ``beta`` = 0).
      - Conditional (when ``model_type='glmm-full'``): test for association (H_0: ``beta`` = 0) conditioned on the spatial random effect.

      :param method: Depending on whether the design matrix is used for model fitting,
                     different methods must be used for hypothesis testing:

                     - ``"wald"`` when models were fit with ``fit(..., with_design_mtx=True)``.
                     - ``"score"`` when models were fit with ``fit(..., with_design_mtx=False)``.

                     .. caution::
                         The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives).
                         Always use ``method="score"`` when possible.
      :type method: {"score", "wald"}, optional
      :param print_progress: Whether to show the progress bar. Default to True.
      :type print_progress: bool, optional
      :param return_results: Whether to return the test statistics and p-values.
                             If False, the results are stored in ``self._du_test_results``.
      :type return_results: bool, optional

      :returns: If `return_results` is True, returns dict with test statistics and p-values.
                Otherwise, returns None and stores results in self._du_test_results.
      :rtype: dict or None


   .. py:method:: test_spatial_variability(method = 'llr', use_perm_null = False, return_results = False, print_progress = True, n_perms_per_gene = None, **kwargs)

      Parametric test for spatial variability.

      .. caution::
          The likelihood ratio statistic is not well-calibrated for sparse data.
          We recommend the non-parametric HSIC-based tests in :class:`splisosm.hyptest_np.SplisosmNP`
          for spatial variability testing.

      Note that the parametric and non-parametric tests are assymptotically equivalent under the null.
      See :cite:`su2026consistent` for detailed theoretical analysis.

      :param method: The test method.
                     Currently only support ``"llr"``, the likelihood ratio test (H_0: ``theta`` = 0, i.e. no spatial variance).
      :type method: {"llr"}, optional
      :param use_perm_null: Whether to generate the null distribution from permutation.
                            If False, use the chi-square with df = n_var_components as the null.
      :type use_perm_null: bool, optional
      :param return_results: Whether to return the test statistics and p-values.
                             If False, the results will be stored in ``self._sv_test_results``.
      :type return_results: bool, optional
      :param print_progress: Whether to show the progress bar for permutation. Default to True.
      :type print_progress: bool, optional
      :param n_perms_per_gene: Number of permutations per gene when ``use_perm_null=True``. Default to 20.
      :type n_perms_per_gene: int or None, optional
      :param \*\*kwargs: Additional arguments passed to `self._fit_sv_llr_perm()` if ``use_perm_null = True``.
      :type \*\*kwargs: Any

      :returns: If `return_results` is True, returns dict with test statistics and p-values.
                Otherwise returns None.
      :rtype: dict or None

      .. seealso:: :func:`splisosm.hyptest_np.SplisosmNP.test_spatial_variability`


   .. py:attribute:: adata
      :type:  Optional[anndata.AnnData]

      Source :class:`~anndata.AnnData` object; ``None`` before :meth:`setup_data`.


   .. py:attribute:: covariate_names
      :type:  list[str]

      Covariate display names (length :attr:`n_factors`).


   .. py:attribute:: design_mtx
      :type:  Optional[torch.Tensor]

      Design matrix ``(n_spots, n_factors)``; ``None`` if no covariates.


   .. py:property:: filtered_adata
      :type: anndata.AnnData


      The filtered AnnData of shape (:attr:`n_spots`, sum(:attr:`n_isos_per_gene`)).

      This is the data used internally after :meth:`setup_data`.
      It is a copy of the input :attr:`adata`, subsetted to the retained spots and isoforms after filtering.

      :raises RuntimeError: If :meth:`setup_data` has not been called.


   .. py:attribute:: gene_names
      :type:  list[str]

      Gene display names (length :attr:`n_genes`).


   .. py:attribute:: n_factors
      :type:  int

      Number of covariates for differential usage testing.


   .. py:attribute:: n_genes
      :type:  int

      Number of genes after filtering.


   .. py:attribute:: n_isos_per_gene
      :type:  list[int]

      Number of isoforms per gene (list of length :attr:`n_genes`).


   .. py:attribute:: n_spots
      :type:  int

      Number of spatial spots.


   .. py:attribute:: sp_kernel
      :type:  Any

      Spatial kernel (:class:`~splisosm.kernel.SpatialCovKernel` for ``'glmm-full'``,
      :class:`~splisosm.kernel.IdentityKernel` for ``'glmm-null'``, ``None`` for ``'glm'``).
      Set by :meth:`setup_data`.