sparrow.tb.kmeans#
- sparrow.tb.kmeans(sdata, labels_layer, table_layer, output_layer, calculate_umap=True, rank_genes=True, n_neighbors=35, n_pcs=17, n_clusters=5, key_added='kmeans', index_names_var=None, index_positions_var=None, random_state=100, overwrite=False, **kwargs)#
Applies KMeans clustering on the
table_layer
of the SpatialData object with optional UMAP calculation and gene ranking.This function executes the KMeans clustering algorithm (via
sklearn.cluster.KMeans
) on spatial data encapsulated by a SpatialData object. It optionally computes a UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction and ranks genes based on their contributions to the clustering. The clustering results, along with optional UMAP and gene ranking, are added to thesdata.tables[output_layer]
for downstream analysis.- Parameters:
sdata (
SpatialData
) – The input SpatialData object.labels_layer (
str
|list
[str
] |None
) – The labels layer(s) ofsdata
used to select the cells via the _REGION_KEY insdata.tables[table_layer].obs
. Note that ifoutput_layer
is equal totable_layer
and overwrite is True, cells insdata.tables[table_layer]
linked to otherlabels_layer
(via the _REGION_KEY), will be removed fromsdata.tables[table_layer]
. If a list of labels layers is provided, they will therefore be clustered together (e.g. multiple samples).table_layer (
str
) – The table layer insdata
on which to perform clustering.output_layer (
str
) – The output table layer insdata
to which table layer with results of clustering will be written.calculate_umap (
bool
(default:True
)) – If True, calculates a UMAP viascanpy.tl.umap
for visualization of computed clusters.rank_genes (
bool
(default:True
)) – If True, ranks genes based on their contributions to the clusters viascanpy.tl.rank_genes_groups
. TODO: To be moved to a separate function.n_neighbors (
int
(default:35
)) – The number of neighbors to consider when calculating neighbors viascanpy.pp.neighbors
. Ignored ifcalculate_umap
is False.n_pcs (
int
(default:17
)) – The number of principal components to use when calculating neighbors viascanpy.pp.neighbors
. Ignored ifcalculate_umap
is False.n_clusters (
int
(default:5
)) – The number of clusters to form.key_added (default:
'kmeans'
) – The key under which the clustering results are added to the SpatialData object (insdata.tables[table_layer].obs
).index_names_var (
Optional
[Iterable
[str
]] (default:None
)) – List of index names to subset insdata.tables[table_layer].var
. If None,index_positions_var
will be used if not None.index_positions_var (
Optional
[Iterable
[int
]] (default:None
)) – List of integer positions to subset insdata.tables[table_layer].var
. Used ifindex_names_var
is None.random_state (
int
(default:100
)) – A random state for reproducibility of the clustering.overwrite (
bool
(default:False
)) – If True, overwrites theoutput_layer
if it already exists insdata
.**kwargs – Additional keyword arguments passed to the KMeans algorithm (
sklearn.cluster.KMeans
).
- Returns:
: The input
sdata
with the clustering results added.
Notes
The function adds a table layer, adding clustering labels, and optionally UMAP coordinates and gene rankings, facilitating downstream analyses and visualization.
Gene ranking based on cluster contributions is intended for identifying marker genes that characterize each cluster.
Warning
The function is intended for use with spatial omics data. Input data should be appropriately preprocessed (e.g. via
sp.tb.preprocess_transcriptomics
orsp.tb.preprocess_proteomics
) to ensure meaningful clustering results.The
rank_genes
functionality is marked for relocation to enhance modularity and clarity of the codebase.
See also
sparrow.tb.preprocess_transcriptomics
preprocess transcriptomics data.
sparrow.tb.preprocess_proteomics
preprocess proteomics data.