splisosm.hyptest_glmm
=====================

.. py:module:: splisosm.hyptest_glmm

.. autoapi-nested-parse::

   GLMM-based hypothesis tests for spatial isoform usage.


Classes
-------

.. autoapisummary::

   splisosm.hyptest_glmm.SplisosmGLMM


Module Contents
---------------

.. py:class:: SplisosmGLMM(model_type = 'glmm-full', share_variance = True, var_parameterization_sigma_theta = True, var_fix_sigma = False, var_prior_model = 'none', var_prior_model_params = {}, init_ratio = 'observed', fitting_method = 'joint_gd', fitting_configs = {'max_epochs': -1})

   Parametric spatial isoform statistical modeling using GLMM.

   This is a convenience class that wraps around the :class:`splisosm.model.MultinomGLMM`
   for batched model fitting and spatial variability and differential usage testing.

   .. rubric:: Examples

   Setup data:

   >>> from splisosm import SplisosmGLMM
   >>> import torch
   >>> # Simulate data for 10 genes with different number of isoforms
   >>> data_3_iso = [torch.randint(low=0, high=5, size=(100, 3)) for _ in range(5)]  # 5 genes with 3 isoforms
   >>> data_4_iso = [torch.randint(low=0, high=5, size=(100, 4)) for _ in range(5)]  # 5 genes with 4 isoforms
   >>> data = data_3_iso + data_4_iso
   >>> coordinates = torch.rand(100, 2)  # 100 spots with 2D coordinates
   >>> design_mtx = torch.rand(100, 2)  # 100 spots with 2 covariates

   Model fitting:

   >>> model = SplisosmGLMM(model_type='glmm-full')
   >>> model.setup_data(data, coordinates, design_mtx=design_mtx, group_gene_by_n_iso=True)
   >>> model.fit(n_jobs=1, batch_size=5, with_design_mtx=False)
   >>> fitted_models = model.get_fitted_models()
   >>> print(fitted_models[0])  # print the fitted model for the first gene

   Differential usage test:

   >>> model.test_differential_usage(method='score')
   >>> du_results = model.get_formatted_test_results('du')
   >>> print(du_results.head())

   :param model_type: Which model to fit. Can be one of ``'glmm-full'`` (Multinomial GLMM with spatial random effects),
                      ``'glmm-null'`` (Multinomial GLMM with white noise), ``'glm'`` (Multinomial GLM).
   :param share_variance: Whether to share the variance component across isoforms.
   :param var_parameterization_sigma_theta: Whether to parameterize the variance components as (``sigma``, ``theta_logit``) or (``sigma_sp``, ``sigma_nsp``).
                                            If True, the variance components will be (``sigma``, ``theta_logit``), where ``sigma`` is the total variance and
                                            ``theta_logit`` is the logit of the spatial variance proportion.
                                            If False, the variance components will be (``sigma_sp``, ``sigma_nsp``), where ``sigma_sp`` is the spatial
                                            variance and ``sigma_nsp`` is the non-spatial variance.
   :param var_fix_sigma: Whether to fix the total variance (``sigma``) or not. If True, the total variance will be fixed to the initial value,
                         which is the average per-spot variance of isoform counts normalized by its mean expression.
                         See `MultinomGLMM._initialize_params` for details.
   :param var_prior_model: The prior model on the total variance ``sigma``. Default is ``'none'`` with no prior.
                           Other options are ``'gamma'`` (Gamma prior) and ``'inv_gamma'`` (Inverse Gamma prior).
   :param var_prior_model_params: The parameters for the prior model on the total variance ``sigma``.
                                  For ``'gamma'``, the default parameters are ``{'alpha': 2.0, 'beta': 0.3}``.
                                  For ``'inv_gamma'``, the default parameters are ``{'alpha': 3, 'beta': 0.5}``.
   :param init_ratio: The initialization method for the logit isoform usage ratio. Options are ``'observed'`` (initialize using observed counts)
                      and ``'uniform'`` (equal isoform usage across space).
   :param fitting_method: The fitting method to use when ``model_type='glmm-full'`` or ``'glmm-null'``.
                          Options are ``'joint_gd'`` (joint likelihood with gradient descent),
                          ``'joint_newton'`` (joint likelihood with Newton's method),
                          ``'marginal_gd'`` (marginal likelihood with gradient descent),
                          and ``'marginal_newton'`` (marginal likelihood with Newton's method).
   :param fitting_configs: A dictionary of fitting configurations with the following keys:

                           - ``'lr'``: float, learning rate
                           - ``'optim'``: str, optimization method (one of ``'adam'``, ``'sgd'``, or ``'lbfgs'``)
                           - ``'tol'``: float, tolerance for convergence
                           - ``'max_epochs'``: int, maximum number of epochs
                           - ``'patience'``: int, number of epochs to wait for improvement before stopping
                           - ``'update_nu_every_k'``: int, number of iterations to update ``nu`` when using ``fitting_method='marginal_newton'``

   .. seealso:: :class:`splisosm.model.MultinomGLMM`


   .. py:method:: fit(n_jobs = 1, batch_size = 1, quiet = True, print_progress = True, with_design_mtx = False, from_null = False, refit_null = True, random_seed = None)

      Model fitting.

      :param n_jobs: The number of cores to use for parallel fitting. Default to 1.
      :param batch_size: The maximum number of genes per job to fit in parallel. Default to 1.
      :param quiet: Whether to suppress the fitting logs. Default to True.
      :param print_progress: Whether to show the progress bar. Default to True.
      :param with_design_mtx: Whether to include the design matrix for the fixed effects. Default to False.
      :param from_null: Whether to initialize the full model from a null model
                        with zero spatial variability (white-noise random effect).
      :param refit_null: Whether to refit the null model after fitting the full model.
      :param random_seed: The random seed for reproducibility. Default to None.

      .. seealso:: :func:`splisosm.model.MultinomGLMM.fit`


   .. py:method:: get_fitted_models()

      Get the fitted models after running fit().

      :returns: **models** --

                - ``model_type='glmm-full'``: list[splisosm.hyptest_glmm.IsoFullModel]
                - ``model_type='glmm-null'``: list[splisosm.hyptest_glmm.IsoNullNoSpVar]
                - ``model_type='glm'``: list[splisosm.model.MultinomGLM]
      :rtype: list of fitted models


   .. py:method:: get_formatted_test_results(test_type)

      Get the formatted test results as data frame.

      :param test_type: Type of test results to retrieve. Can be one of ``'sv'`` (spatial variability) or ``'du'`` (differential usage).

      :returns: Formatted test results.
      :rtype: pandas.DataFrame


   .. py:method:: save(path)

      Save the fitted models to a file.

      This function is designed to overcome a limitation of ``torch.save()``,
      which naively saves `n_genes` copies of the spatial kernel matrix ``self.corr_sp``.
      The kernel matrix is reconstructed per fitted model using ``self._corr_sp_eigvals``
      and ``self._corr_sp_eigvecs``.

      :param path: The path to save the fitted models.


   .. py:method:: setup_data(data = None, coordinates = None, design_mtx = None, gene_names = None, group_gene_by_n_iso = False, covariate_names = None, *, adata = None, spatial_key = 'spatial', layer = 'counts', group_iso_by = 'gene_symbol', min_counts = 10, min_bin_pct = 0.0, filter_single_iso_genes = True)

      Setup the data for the model.

      This method supports two input modes for backward compatibility.

      - Legacy mode: pass ``data`` and ``coordinates`` directly.
      - AnnData mode: pass ``adata``, where counts are extracted from
        ``adata.layers[layer]`` grouped by ``group_iso_by``, and coordinates
        are read from ``adata.obsm[spatial_key]``.
        See :func:`splisosm.utils.prepare_inputs_from_anndata` for details.

      :param data: Legacy mode only. List of tensors/arrays with shape
                   ``(n_spots, n_isos)`` containing isoform counts for each gene.
      :param coordinates: Legacy mode only. Shape ``(n_spots, 2)``, the spatial coordinates.
      :param design_mtx: Design matrix for fixed effects.

                         - Legacy mode: tensor/array/dataframe of shape ``(n_spots, n_factors)``.
                         - AnnData mode: tensor/array/dataframe, or one obs-column name
                           (str), or a list of obs-column names.
      :param gene_names: Gene names.

                         - Legacy mode: list of gene names.
                         - AnnData mode: optional column name in ``adata.var`` used as
                           display names for grouped genes; if None, use grouped gene IDs.
      :param group_gene_by_n_iso: Whether to group genes by the number of isoforms.
      :param covariate_names: List of covariate names.
      :param adata: AnnData object used in the new input mode.
      :param spatial_key: Key in ``adata.obsm`` for spatial coordinates.
      :param layer: Counts layer in ``adata.layers``.
      :param group_iso_by: Column in ``adata.var`` used to group isoforms by gene.
      :param min_counts: Minimum total isoform count across spots required to retain an
                         isoform in AnnData mode.
      :param min_bin_pct: Minimum percentage/fraction of spots where an isoform is expressed
                          in AnnData mode. Values in ``[0, 1]`` are treated as fractions;
                          values in ``(1, 100]`` are treated as percentages.
      :param filter_single_iso_genes: AnnData mode only. Whether to remove genes with fewer than two
                                      retained isoforms.

      :raises ValueError: If input arguments are invalid or required fields are missing.


   .. py:method:: test_differential_usage(method = 'score', print_progress = True, return_results = False)

      Parametric test for spatial isoform differential usage.

      Before running this function, the design matrix must be set up using :func:`setup_data`.
      Each column of the design matrix corresponds to a covariate to test for differential association
      with the isoform usage ratios of each gene.
      Test statistics and p-values are computed per (gene, covariate) pair separately.

      Similar to :func:`splisosm.hyptest_np.SplisosmNP.test_differential_usage`, here we also support two types of association tests but **implicitly**:

      - Unconditional (when ``model_type='glm'``): test the unconditional association between isoform usage ratios and the covariate of interest (H_0: ``beta`` = 0).
      - Conditional (when ``model_type='glmm-full'``): test for association (H_0: ``beta`` = 0) conditioned on the spatial random effect.

      :param method: Depending on whether the design matrix is used for model fitting,
                     different methods must be used for hypothesis testing:

                     - ``"wald"`` when models were fit with ``fit(..., with_design_mtx=True)``.
                     - ``"score"`` when models were fit with ``fit(..., with_design_mtx=False)``.

                     .. caution::
                         The Wald statistic with GLM/GLMM is empirically anti-conserved (i.e., lots of false positives).
                         Always use ``method="score"`` when possible.
      :param print_progress: Whether to show the progress bar. Default to True.
      :param return_results: Whether to return the test statistics and p-values.
                             If False, the results are stored in ``self.du_test_results``.

      :returns: If `return_results` is True, returns dict with test statistics and p-values.
                Otherwise, returns None and stores results in self.du_test_results.
      :rtype: dict or None


   .. py:method:: test_spatial_variability(method = 'llr', use_perm_null = False, return_results = False, print_progress = True, n_perms_per_gene = None, **kwargs)

      Parametric test for spatial variability.

      .. caution::
          The likelihood ratio statistic is not well-calibrated for sparse data.
          We recommend the non-parametric HSIC-based tests in :class:`splisosm.hyptest_np.SplisosmNP`
          for spatial variability testing.

      Note that the parametric and non-parametric tests are assymptotically equivalent under the null.
      See :cite:`su2026consistent` for detailed theoretical analysis.

      :param method: The test method.
                     Currently only support ``"llr"``, the likelihood ratio test (H_0: ``sigma_sp`` = 0).
      :param use_perm_null: Whether to generate the null distribution from permutation.
                            If False, use the chi-square with df = n_var_components as the null.
      :param return_results: Whether to return the test statistics and p-values.
                             If False, the results will be stored in ``self.sv_test_results``.
      :param print_progress: Whether to show the progress bar for permutation. Default to True.
      :param \*\*kwargs: Additional arguments passed to `self._fit_sv_llr_perm()` if ``use_perm_null = True``.
      :type \*\*kwargs: Any

      :returns: If `return_results` is True, returns dict with test statistics and p-values.
                Otherwise returns None.
      :rtype: dict or None

      .. seealso:: :func:`splisosm.hyptest_np.SplisosmNP.test_spatial_variability`


   .. py:attribute:: adata
      :value: None


   .. py:attribute:: du_test_results
      :type:  dict

      Dictionary to store the differential usage test results after running test_differential_usage().
      It contains the following keys:

      - ``'method'``: str, the method used for the test.
      - ``'statistic'``: numpy.ndarray of shape (n_genes, n_factors), the test statistic for each gene and covariate.
      - ``'pvalue'``: numpy.ndarray of shape (n_genes, n_factors), the p-value for each gene and covariate.
      - ``'pvalue_adj'``: numpy.ndarray of shape (n_genes, n_factors), the BH adjusted p-value for each gene and covariate. Each column/covariate is adjusted separately.


   .. py:attribute:: fitting_results
      :type:  dict

      Dictionary to store lists of fitted models, with keys
      ``'glmm-full'``, ``'glmm-null'``, ``'glm'``.


   .. py:attribute:: model_configs
      :type:  dict

      Dictionary of model configurations, with keys
      ``share_variance``, ``var_parameterization_sigma_theta``,
      ``var_fix_sigma``, ``var_prior_model``, ``var_prior_model_params``,
      ``init_ratio``, ``fitting_method``, and ``fitting_configs``.


   .. py:attribute:: model_type
      :type:  Literal['glmm-full', 'glmm-null', 'glm']

      The type of GLM or GLMM model.


   .. py:attribute:: n_factors
      :type:  int

      Number of covariates to test for differential usage.


   .. py:attribute:: n_genes
      :type:  int

      Number of genes.


   .. py:attribute:: n_isos_per_gene
      :type:  list[int]

      List of numbers of isoforms per gene.


   .. py:attribute:: n_spots
      :type:  int

      Number of spots.


   .. py:attribute:: setup_input_mode
      :value: None


   .. py:attribute:: sv_test_results
      :type:  dict

      Dictionary to store the spatial variability test results after running test_spatial_variability().
      It contains the following keys:

      - ``'method'``: str, the method used for the test.
      - ``'statistic'``: numpy.ndarray of shape (n_genes,), the log likelihood ratio test statistic for each gene.
      - ``'df'``: int, the degrees of freedom for the test statistic.
      - ``'use_perm_null'``: bool, whether the null distribution is estimated using permutation.
      - ``'pvalue'``: numpy.ndarray of shape (n_genes,), the p-value for each gene.
      - ``'pvalue_adj'``: numpy.ndarray of shape (n_genes,), the BH adjusted p-value for each gene.