sparrow.tb.preprocess_transcriptomics

sparrow.tb.preprocess_transcriptomics#

sparrow.tb.preprocess_transcriptomics(sdata, labels_layer, table_layer, output_layer, min_counts=10, min_cells=5, size_norm=True, highly_variable_genes=False, highly_variable_genes_kwargs=mappingproxy({}), max_value_scale=10, n_comps=50, update_shapes_layers=True, overwrite=False)#

Preprocess a table (AnnData) attribute of a SpatialData object for transcriptomics data.

Performs filtering (via scanpy.pp.filter_cells and scanpy.pp.filter_genes ) and optional normalization (on size or via scanpy.sc.pp.normalize_total), log transformation (scanpy.pp.log1p), highly variable genes selection (scanpy.pp.highly_variable_genes), scaling (scanpy.pp.scale), and PCA calculation (scanpy.tl.pca) for transcriptomics data contained in the sdata. QC metrics are added to sdata.tables[output_layer].obs using scanpy.pp.calculate_qc_metrics.

Parameters:
  • sdata (SpatialData) – The input SpatialData object.

  • labels_layer (Union[str, Iterable[str]]) – The labels layer(s) of sdata used to select the cells via the _REGION_KEY in sdata.tables[table_layer].obs. Note that if output_layer is equal to table_layer and overwrite is True, cells in sdata.tables[table_layer] linked to other labels_layer (via the _REGION_KEY), will be removed from sdata.tables[table_layer] (also from the backing zarr store if it is backed).

  • table_layer (str) – The table layer in sdata on which to perform preprocessing on.

  • output_layer (str) – The output table layer in sdata to which preprocessed table layer will be written.

  • min_counts (int (default: 10)) – Minimum number of genes a cell should contain to be kept (passed to scanpy.pp.filter_cells).

  • min_cells (int (default: 5)) – Minimum number of cells a gene should be in to be kept (passed to scanpy.pp.filter_genes).

  • size_norm (bool (default: True)) – If True, normalization is based on the size of the nucleus/cell. If False, scanpy.sc.pp.normalize_total is used for normalization.

  • highly_variable_genes (bool (default: False)) – If True, will only retain highly variable genes, as calculated by scanpy.pp.highly_variable_genes.

  • highly_variable_genes_kwargs (Mapping[str, Any] (default: mappingproxy({}))) – Keyword arguments passed to scanpy.pp.highly_variable_genes. Ignored if highly_variable_genes is False.

  • max_value_scale (int (default: 10)) – The maximum value to which data will be scaled, using scanpy.pp.scale.

  • n_comps (int (default: 50)) – Number of principal components to calculate.

  • update_shapes_layers (bool (default: True)) – Whether to filter the shapes layers associated with labels_layer. If set to True, cells that do not appear in resulting output_layer (with _REGION_KEY equal to labels_layer) will be removed from the shapes layers (via _INSTANCE_KEY) in the sdata object. Filtered shapes will be added to sdata with prefix ‘filtered_low_counts’.

  • overwrite (bool (default: False)) – If True, overwrites the output_layer if it already exists in sdata.

Return type:

SpatialData

Returns:

: The sdata containing the preprocessed AnnData object as an attribute (sdata.tables[output_layer]).

Raises:
  • ValueError – If sdata does not have labels attribute.

  • ValueError – If sdata does not have tables attribute.

  • ValueError – If labels_layer, or one of the element of labels_layer is not a labels layer in sdata.

  • ValueError – If table_layer is not a table layer in sdata.

Warning

  • If max_value_scale is set too low, it may overly constrain the variability of the data, potentially impacting downstream analyses.

  • If the dimensionality of sdata.tables[table_layer] is smaller than the desired number of principal components, n_comps is set to the minimum dimensionality, and a message is printed.

See also

sparrow.tb.allocate

create an AnnData table in sdata using a points_layer and a labels_layer.