sparrow.tb.kmeans

Contents

sparrow.tb.kmeans#

sparrow.tb.kmeans(sdata, labels_layer, table_layer, output_layer, calculate_umap=True, rank_genes=True, n_neighbors=35, n_pcs=17, n_clusters=5, key_added='kmeans', index_names_var=None, index_positions_var=None, random_state=100, overwrite=False, **kwargs)#

Applies KMeans clustering on the table_layer of the SpatialData object with optional UMAP calculation and gene ranking.

This function executes the KMeans clustering algorithm (via sklearn.cluster.KMeans) on spatial data encapsulated by a SpatialData object. It optionally computes a UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction and ranks genes based on their contributions to the clustering. The clustering results, along with optional UMAP and gene ranking, are added to the sdata.tables[output_layer] for downstream analysis.

Parameters:
  • sdata (SpatialData) – The input SpatialData object.

  • labels_layer (str | list[str] | None) – The labels layer(s) of sdata used to select the cells via the _REGION_KEY in sdata.tables[table_layer].obs. Note that if output_layer is equal to table_layer and overwrite is True, cells in sdata.tables[table_layer] linked to other labels_layer (via the _REGION_KEY), will be removed from sdata.tables[table_layer]. If a list of labels layers is provided, they will therefore be clustered together (e.g. multiple samples).

  • table_layer (str) – The table layer in sdata on which to perform clustering.

  • output_layer (str) – The output table layer in sdata to which table layer with results of clustering will be written.

  • calculate_umap (bool (default: True)) – If True, calculates a UMAP via scanpy.tl.umap for visualization of computed clusters.

  • rank_genes (bool (default: True)) – If True, ranks genes based on their contributions to the clusters via scanpy.tl.rank_genes_groups. TODO: To be moved to a separate function.

  • n_neighbors (int (default: 35)) – The number of neighbors to consider when calculating neighbors via scanpy.pp.neighbors. Ignored if calculate_umap is False.

  • n_pcs (int (default: 17)) – The number of principal components to use when calculating neighbors via scanpy.pp.neighbors. Ignored if calculate_umap is False.

  • n_clusters (int (default: 5)) – The number of clusters to form.

  • key_added (default: 'kmeans') – The key under which the clustering results are added to the SpatialData object (in sdata.tables[table_layer].obs).

  • index_names_var (Optional[Iterable[str]] (default: None)) – List of index names to subset in sdata.tables[table_layer].var. If None, index_positions_var will be used if not None.

  • index_positions_var (Optional[Iterable[int]] (default: None)) – List of integer positions to subset in sdata.tables[table_layer].var. Used if index_names_var is None.

  • random_state (int (default: 100)) – A random state for reproducibility of the clustering.

  • overwrite (bool (default: False)) – If True, overwrites the output_layer if it already exists in sdata.

  • **kwargs – Additional keyword arguments passed to the KMeans algorithm (sklearn.cluster.KMeans).

Returns:

: The input sdata with the clustering results added.

Notes

  • The function adds a table layer, adding clustering labels, and optionally UMAP coordinates and gene rankings, facilitating downstream analyses and visualization.

  • Gene ranking based on cluster contributions is intended for identifying marker genes that characterize each cluster.

Warning

  • The function is intended for use with spatial omics data. Input data should be appropriately preprocessed (e.g. via sp.tb.preprocess_transcriptomics or sp.tb.preprocess_proteomics) to ensure meaningful clustering results.

  • The rank_genes functionality is marked for relocation to enhance modularity and clarity of the codebase.

See also

sparrow.tb.preprocess_transcriptomics

preprocess transcriptomics data.

sparrow.tb.preprocess_proteomics

preprocess proteomics data.