data + dataloader

Dataloader information

We use a packed sequence format, so the returns from this function are not in general (batch_size, length, features) tensor but a (length, features) matrix.

Here is the docstring for the loader:

Sampler that returns the gene expression matrices and cell-type identity vectors for two groups of cells: the observed cells and the masked/reference cells. The observed cells are the cells in the neighborhood of the reference cell (set by patch_size).

We assume the type will be found at cell_type_colname and that cell_id can be used succesfully to index into the adata object.

Parameters:

Name	Type	Description	Default
`metadata`	`DataFrame`	Metadata for the cells. Must contain columns for cell_id, cell_type, x, y, and tissue_section.	required
`adata`	`AnnData`	Expression-containing (as .X) anndata. We assume input will be log scaled.	required
`patch_size`	`Union[List[int], Tuple[int]]`	Size in arbitrary units for the neighborhood calculation.	required
`cell_id_colname`	`Optional[str]`	The column to use to index into the anndata, by default "cell_label"	`'cell_label'`
`cell_type_colname`	`Optional[str]`	The column to use to access the cell type identities as cls-encoding integers, by default "cell_type"	`'cell_type'`
`tissue_section_colname`	`Optional[str]`	To simplify computation, group the cells in each sectionsby this column, by default "brain_section_label"	`'brain_section_label'`
`max_num_cells`	`Optional[Union[int, None]]`	How many cells to threshold at for the neighborhood size, by default None	`None`
`indices`	`Optional[Union[List[int], None]]`	Used to specify train/test sets via subsetting on only these cells. This should be a numeric index compatible with `.iloc`, so it may be advisable to reset the index of the dataframe. By default None	`None`

Raises:

Type	Description
`TypeError`	If adata is not an `ad.annData` object
`ValueError`	If metadata does not contain the necessary columns
`TypeError`	If metadata is not a pandas DataFrame
`ValueError`	If patch_size is not a tuple of length 2
`ValueError`	If metadata and adata are not the same length
`ValueError`	If metadata does not contain columns "x" and "y"

The required data components are an anndata object with keys corresponding to columns in a dataframe of metadata, metadata.

The dataframe must have:

x: spatial coordinates [default: x]
y: spatial coordinates [default: y]
one column (cell_id_colname) which refers to the keys, these will be used directly to subset into the anndata object [default: cell_label]
one column (cell_type_colname) which is a class encoded integer corresponding to the single cell classes from the non-spatial scRNA-seq data clustering [default: cell_type]

The units of patch_size are not scaled internally, so they must "match" the units of x and y.

What if you don't have a dataframe with those columns (and therefore is incorrectly formatted for celltransformer.data.CenterMaskSampler)? Your options are:

just create a new version of the csv/tsv/etc. with the correct column names and metadata
use the scripts/training/lightning_model.py:BaseTrainer.load_data function to preprocess your data in the format that will satisfy the conditions (see scripts/training/train_aibs_mouse.py:load_data) as an example. The motivations for this are covered in the TLDR in the main page, but we will also discuss it here.

Expectations for inputs into `celltransformer.data.CenterMaskSampler`, using `scripts/training/train_aibs_mouse.py:load_data` as an example

we need columns x and y which match the units of patch_size
- e.g. if your patch size desired size is 10um, and your spatial dimensions are provided in nm, you would want to rescale either the patch size or the spatial dimensions. I elected to primarily rescale the spatial dimensions (for my own sanity) all to um, and so use the load_data function do do so.
we also need maybe to remap cell_type_colname to give

Therefore, let's annotate the test code from train_aibs_mouse.py:load_data:

# make sure to encode as str because for whatever reason
# in the AIBS data, the anndata row ID's are string instead of int
metadata["cell_label"] = metadata["cell_label"].astype(str) 
metadata["x"] = metadata["x_reconstructed"] * 100 # initially x_reconstructed in wrong units
metadata["y"] = metadata["y_reconstructed"] * 100

metadata = metadata[ # throw away nonessential columns
    [
        config.data.celltype_colname,
        "cell_label", 
        "x",
        "y",
        "brain_section_label",
    ]
]

# label_to_cls is just a thin wrapper over 
# `sklearn.preprocessing.LabelEncoder`
metadata["cell_type"] = self.label_to_cls(
    metadata[config.data.celltype_colname]
)
# make sure to integer encoder the classes 
metadata["cell_type"] = metadata["cell_type"].astype(int)

metadata = metadata.reset_index(drop=True)

# in the AIBS dataset some cells for which gene 
# expression was measured do not have metadata associated
# so let's throw those out
adata = adata[metadata["cell_label"]]

Outputs of `getitem` and `collate`

These are the steps we implement so far:

extract spatial neighbors (within some radius) for the cell of interest
extract expression (from the anndata object) and the class encoding (cell_type) for each cell from the metadata dataframe you should have passed to CenterMaskSampler
partition them into two sets, which are simply lists in a namedtuple, see celltransformer.data.NeighborhoodMetadata

Note that we then need to stack these together across batches and create attention matrix masks, which is done in celltransformer.data.loader_pandas.collate. We create three attention mask matrices (we use a binary adjacency matrix):

Syntax	Description
`encoder_mask`	allow the neighborhoods to only attend to each other
`pooling_mask`	pool into a single query token
`decoder_mask`	allow query token and decoding cell tokens to attend to each other

See the function documentation and the next section model for more information.

My first-shot implementation of this workflow (in pandas) was ~1.3X faster than my next attempt, in polars. I welcome any feedback on why my polars code may have been suboptimal!

data + dataloader

Dataloader information

Expectations for inputs into celltransformer.data.CenterMaskSampler, using scripts/training/train_aibs_mouse.py:load_data as an example

Outputs of __getitem__ and collate

Expectations for inputs into `celltransformer.data.CenterMaskSampler`, using `scripts/training/train_aibs_mouse.py:load_data` as an example

Outputs of `getitem` and `collate`