splisosm.dataset#

Dataset helpers for batched GLM and GLMM training.

Classes#

IsoDataset

Dataset for batched training of GLM and GLMM models.

Module Contents#

class splisosm.dataset.IsoDataset(data, gene_names=None, group_gene_by_n_iso=False)#

Dataset for batched training of GLM and GLMM models.

IsoDataset.get_dataloader returns a DataLoader that yields batches of genes for training.

If group_gene_by_n_iso is True, genes with the same number of isoforms are grouped together and stored as a 3D tensor of shape (n_genes, n_spots, n_isos). Otherwise, genes are stored as a list of per-gene tensors of shape (n_spots, n_isos).

Example

>>> from splisosm.dataset import IsoDataset
>>> import torch
>>> # Simulate data for 10 genes with different number of isoforms
>>> data_3_iso = [torch.randn(100, 3) for _ in range(5)]  # 5 genes with 3 isoforms
>>> data_4_iso = [torch.randn(100, 4) for _ in range(5)]  # 5 genes with 4 isoforms
>>> data = data_3_iso + data_4_iso
>>> gene_names = [f"gene_{i}" for i in range(10)]
>>> dataset = IsoDataset(data, gene_names, group_gene_by_n_iso=True)
>>> # Get dataloader for batched training
>>> dataloader = dataset.get_dataloader(batch_size=2)
>>> batch = next(iter(dataloader))
Parameters:
  • data (list[Tensor]) – List of tensors with shape (n_spots, n_isos).

  • gene_names (Optional[list[str]]) – List of gene names. If None, auto-generated.

  • group_gene_by_n_iso (bool) – Whether to group genes by the number of isoforms.

get_dataloader(batch_size=1)#

Get dataloader for the dataset.

Parameters:

batch_size (int) – Maximum number of genes in a batch.

Returns:

DataLoader iterator.

Return type:

Iterator[Any]

data: list[Tensor]#

Input list of per-gene isoform count tensor.

dataset: list[Dataset]#

If group_by_n_iso is True, a list of GroupedIsoDataset where isoform counts are stored as 3D tensors. Otherwise, a list of UngroupedIsoDataset where isoform counts are stored as a list of 2D tensors.

datasets = None#
gene_name: list[str]#

List of gene names.

gene_names#
group_by_n_iso: bool#

Whether to group genes by the number of isoforms.

group_gene_by_n_iso = False#
n_genes: int#

Number of genes.

n_isos_per_gene: list[int]#

List of numbers of isoforms per gene.

n_spots: int#

Number of spots.