sparrow.tb.preprocess_transcriptomics

sparrow.tb.preprocess_transcriptomics#

sparrow.tb.preprocess_transcriptomics(sdata, labels_layer, table_layer, output_layer, percent_top=(2, 5), min_counts=10, min_genes=0, min_cells=5, size_norm=True, highly_variable_genes=False, highly_variable_genes_kwargs=mappingproxy({}), max_value_scale=10, n_comps=50, update_shapes_layers=True, shapes_layers_to_filter=None, overwrite=False)#

Preprocess a table (AnnData) attribute of a SpatialData object for transcriptomics data.

Performs filtering (via scanpy.pp.filter_cells and scanpy.pp.filter_genes ) and optional normalization (on size or via scanpy.sc.pp.normalize_total), log transformation (scanpy.pp.log1p), highly variable genes selection (scanpy.pp.highly_variable_genes), scaling (scanpy.pp.scale), and PCA calculation (scanpy.tl.pca) for transcriptomics data contained in the sdata.tables[table_layer]. QC metrics are added to sdata.tables[output_layer].obs using scanpy.pp.calculate_qc_metrics.

Parameters:

sdata (SpatialData) – The input SpatialData object.
labels_layer (Union[str, Iterable[str]]) – The labels layer(s) of sdata used to select the cells via the _REGION_KEY in sdata.tables[table_layer].obs. Note that if output_layer is equal to table_layer and overwrite is True, cells in sdata.tables[table_layer] linked to other labels_layer (via the _REGION_KEY), will be removed from sdata.tables[table_layer] (also from the backing zarr store if it is backed).
table_layer (str) – The table layer in sdata on which to perform preprocessing on.
output_layer (str) – The output table layer in sdata to which preprocessed table layer will be written.
percent_top (tuple[int, ...] (default: (2, 5))) – List of ranks (where genes are ranked by expression) at which the cumulative proportion of expression will be reported as a percentage. Passed to scanpy.pp.calculate_qc_metrics.
min_counts (int (default: 10)) – Minimum number of counts a cell should contain to be kept (passed to scanpy.pp.filter_cells).
min_genes (int (default: 0)) – Minimum number of genes a cell should contain to be kept (passed to scanpy.pp.filter_cells).
min_cells (int (default: 5)) – Minimum number of cells a gene should be in to be kept (passed to scanpy.pp.filter_genes).
size_norm (bool (default: True)) – If True, normalization is based on the size of the nucleus/cell. If False, scanpy.sc.pp.normalize_total is used for normalization.
highly_variable_genes (bool (default: False)) – If True, will only retain highly variable genes, as calculated by scanpy.pp.highly_variable_genes.
highly_variable_genes_kwargs (Mapping[str, Any] (default: mappingproxy({}))) – Keyword arguments passed to scanpy.pp.highly_variable_genes. Ignored if highly_variable_genes is False.
max_value_scale (float | None (default: 10)) – The maximum value to which data will be scaled, using scanpy.pp.scale.
n_comps (int (default: 50)) – Number of principal components to calculate.
update_shapes_layers (bool (default: True)) – Whether to filter the shapes layers associated with labels_layer. If set to True, cells that do not appear in resulting output_layer (with _REGION_KEY equal to labels_layer) will be removed from the shapes layers (via _INSTANCE_KEY) in the sdata object. Filtered shapes will be added to sdata with prefix ‘filtered_low_counts’.
shapes_layers_to_filter (default: None) – List of shapes layers to filter. If None, all shapes layers associated with labels_layer will be filtered, if update_shapes_layers is True.
overwrite (bool (default: False)) – If True, overwrites the output_layer if it already exists in sdata.

Return type:

SpatialData

Returns:

: The sdata containing the preprocessed AnnData object as an attribute (sdata.tables[output_layer]).

Raises:

ValueError – If sdata does not have labels attribute.
ValueError – If sdata does not have tables attribute.
ValueError – If labels_layer, or one of the element of labels_layer is not a labels layer in sdata.
ValueError – If table_layer is not a table layer in sdata.

Warning

If max_value_scale is set too low, it may overly constrain the variability of the data, potentially impacting downstream analyses.
If the dimensionality of sdata.tables[table_layer] is smaller than the desired number of principal components, n_comps is set to the minimum dimensionality, and a message is printed.

sparrow.tb.preprocess_transcriptomics

Contents

sparrow.tb.preprocess_transcriptomics#