data + dataloader
Dataloader information
We use a packed sequence format, so the returns from this function are not in general (batch_size, length, features) tensor but a (length, features) matrix.
Here is the docstring for the loader:
Sampler that returns the gene expression matrices and cell-type identity vectors
for two groups of cells: the observed cells and the masked/reference cells. The observed cells
are the cells in the neighborhood of the reference cell (set by patch_size).
We assume the type will be found at cell_type_colname and that cell_id can be used
succesfully to index into the adata object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
DataFrame
|
Metadata for the cells. Must contain columns for cell_id, cell_type, x, y, and tissue_section. |
required |
adata
|
AnnData
|
Expression-containing (as .X) anndata. We assume input will be log scaled. |
required |
patch_size
|
Union[List[int], Tuple[int]]
|
Size in arbitrary units for the neighborhood calculation. |
required |
cell_id_colname
|
Optional[str]
|
The column to use to index into the anndata, by default "cell_label" |
'cell_label'
|
cell_type_colname
|
Optional[str]
|
The column to use to access the cell type identities as cls-encoding integers, by default "cell_type" |
'cell_type'
|
tissue_section_colname
|
Optional[str]
|
To simplify computation, group the cells in each sectionsby this column, by default "brain_section_label" |
'brain_section_label'
|
max_num_cells
|
Optional[Union[int, None]]
|
How many cells to threshold at for the neighborhood size, by default None |
None
|
indices
|
Optional[Union[List[int], None]]
|
Used to specify train/test sets via subsetting on only these cells. This should be a numeric index compatible with
|
None
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If adata is not an |
ValueError
|
If metadata does not contain the necessary columns |
TypeError
|
If metadata is not a pandas DataFrame |
ValueError
|
If patch_size is not a tuple of length 2 |
ValueError
|
If metadata and adata are not the same length |
ValueError
|
If metadata does not contain columns "x" and "y" |
The required data components are an anndata object with keys corresponding to columns in a dataframe of metadata, metadata.
The dataframe must have:
- x: spatial coordinates [default: x]
- y: spatial coordinates [default: y]
- one column (
cell_id_colname) which refers to the keys, these will be used directly to subset into theanndataobject [default: cell_label] - one column (
cell_type_colname) which is a class encoded integer corresponding to the single cell classes from the non-spatial scRNA-seq data clustering [default: cell_type]
The units of patch_size are not scaled internally, so they must "match" the units of x and y.
What if you don't have a dataframe with those columns (and therefore is incorrectly formatted for celltransformer.data.CenterMaskSampler)? Your options are:
- just create a new version of the csv/tsv/etc. with the correct column names and metadata
- use the
scripts/training/lightning_model.py:BaseTrainer.load_datafunction to preprocess your data in the format that will satisfy the conditions (seescripts/training/train_aibs_mouse.py:load_data) as an example. The motivations for this are covered in the TLDR in the main page, but we will also discuss it here.
Expectations for inputs into celltransformer.data.CenterMaskSampler, using scripts/training/train_aibs_mouse.py:load_data as an example
- we need columns
xandywhich match the units ofpatch_size- e.g. if your patch size desired size is 10um, and your spatial dimensions are provided in nm, you would want to rescale either the patch size or the spatial dimensions. I elected to primarily rescale the spatial dimensions (for my own sanity) all to um, and so use the
load_datafunction do do so.
- e.g. if your patch size desired size is 10um, and your spatial dimensions are provided in nm, you would want to rescale either the patch size or the spatial dimensions. I elected to primarily rescale the spatial dimensions (for my own sanity) all to um, and so use the
- we also need maybe to remap
cell_type_colnameto give
Therefore, let's annotate the test code from train_aibs_mouse.py:load_data:
# make sure to encode as str because for whatever reason
# in the AIBS data, the anndata row ID's are string instead of int
metadata["cell_label"] = metadata["cell_label"].astype(str)
metadata["x"] = metadata["x_reconstructed"] * 100 # initially x_reconstructed in wrong units
metadata["y"] = metadata["y_reconstructed"] * 100
metadata = metadata[ # throw away nonessential columns
[
config.data.celltype_colname,
"cell_label",
"x",
"y",
"brain_section_label",
]
]
# label_to_cls is just a thin wrapper over
# `sklearn.preprocessing.LabelEncoder`
metadata["cell_type"] = self.label_to_cls(
metadata[config.data.celltype_colname]
)
# make sure to integer encoder the classes
metadata["cell_type"] = metadata["cell_type"].astype(int)
metadata = metadata.reset_index(drop=True)
# in the AIBS dataset some cells for which gene
# expression was measured do not have metadata associated
# so let's throw those out
adata = adata[metadata["cell_label"]]
Outputs of __getitem__ and collate
These are the steps we implement so far:
- extract spatial neighbors (within some radius) for the cell of interest
- extract expression (from the
anndataobject) and the class encoding (cell_type) for each cell from the metadata dataframe you should have passed toCenterMaskSampler - partition them into two sets, which are simply lists in a
namedtuple, seecelltransformer.data.NeighborhoodMetadata
Note that we then need to stack these together across batches and create attention matrix masks, which is done in celltransformer.data.loader_pandas.collate. We create three attention mask matrices (we use a binary adjacency matrix):
| Syntax | Description |
|---|---|
encoder_mask |
allow the neighborhoods to only attend to each other |
pooling_mask |
pool into a single query token |
decoder_mask |
allow query token and decoding cell tokens to attend to each other |
See the function documentation and the next section model for more information.
My first-shot implementation of this workflow (in pandas) was ~1.3X faster than my next attempt, in polars. I welcome any feedback on why my polars code may have been suboptimal!