How to start

How to start#

%load_ext autoreload
%autoreload 2

import sparrow as sp

1. Read in the data#

The first step of the SPArrOW pipeline includes reading in the raw data. This, minimally, includes the stained image(s) and the transcript locations.

The example dataset for this notebook will be downloaded and cached using pooch via sparrow.dataset.registry.

For the input images, there are different options (see API documentation for more details)

input the image as a dask or numpy array
define the path to the location the image is stored at (preferably tiff or zarr)
define a list of paths for different stainings (stored in different channels)
define a pattern representing a collection of z-stacks. (these are currently by default max-projcted to form one image).

The transcript location file is expected to be a txt or csv file. For RESOLVE, StereoSeq and Vizgen, specific dataloaders are created (still in progress for xenium and CosMx). For a new datatype, you can read the API on the function sparrow.io.read_transcripts.

The image is then read in to a SpatialData object (see https://spatialdata.scverse.org/en/latest/ for more information). As the Spatial datasets are often larger-than memory, a path to where all data will be stored is asked.

When working on GPU, it is advisable to define a client to avoid memory issues.

from sparrow.datasets.registry import get_registry
import tempfile

# change this path. It is the directory where the spatialdata .zarr will be saved.
OUTPUT_DIR = tempfile.gettempdir()

registry = get_registry()
path_image = registry.fetch("transcriptomics/resolve/mouse/20272_slide1_A1-1_DAPI.tiff")
path_coordinates = registry.fetch("transcriptomics/resolve/mouse/20272_slide1_A1-1_results.txt")

crop_segment = True  # set to True to run tiling correction and segmentation step on a crop
if crop_segment:
    crop_param = [0, 6432, 0, 6432]  # x0,x1,y0,y1
else:
    crop_param = None

# if you stopped your analysis during the process and want to pick it up where you left it,
# you can read in your sdata object as follows:
# from spatialdata import read_zarr
# sdata_load=read_zarr( os.path.join(  OUTPUT_DIR, 'sdata_allocation.zarr' ) )

Convert to zarr and load the image.#

import os
import uuid

img_layer = "raw_image"

# you can choose any name for your zarr file
zarr_path = os.path.join(OUTPUT_DIR, f"sdata_{uuid.uuid4()}.zarr")

sdata = sp.io.create_sdata(
    input=path_image,
    output_path=zarr_path,
    img_layer=img_layer,
    chunks=1024,
)

INFO     The Zarr backing store has been changed from None the new file path:                                      
         /tmp/nix-shell.wcNPOY/sdata_bb7759bc-5cdf-428e-87bc-7a188a95b5f3.zarr

sdata

SpatialData object, with associated Zarr store: /tmp/nix-shell.wcNPOY/sdata_bb7759bc-5cdf-428e-87bc-7a188a95b5f3.zarr
└── Images
      └── 'raw_image': DataArray[cyx] (1, 12864, 10720)
with coordinate systems:
    ▸ 'global', with elements:
        raw_image (Images)

1.1 Plot the image(s)#

crd = [2000, 6000, 3000, 6000]  # crd used for visualization purposes

sp.pl.plot_image(sdata, img_layer="raw_image", crd=crd, figsize=(8, 8))

../../_images/f922debbfd51f5417709487284fe857fa1ff7cf260074f1779b65edf950b1eaa.png

2. Processing the image(s)#

2.1 tiling correction and inpainting#

When working with RESOLVE data, the data is acquired in tiles, and the illumination within a tile isn’t always constant. Sometimes one side of a tile is more illuminated than the other, influencing the downstream analysis greatly. RESOLVE assured us this isn’t linked to the counts of the transcripts, but this can be checked further on. In general this step is not necessary for other datatypes (you can check this by plotting the complete image).

Basic is a tool that can correct for this, and is used in this function. The size of the tile needs to be known in order to run the function. The dfault value for this function is the tile size of RESOLVe (2144).

This step also corrects for black lines in between the tiles, by using inpainting.

sdata, flatfields = sp.im.tiling_correction(
    sdata=sdata, img_layer="raw_image", crd=crop_param, output_layer="tiling_correction", overwrite=True
)

# sp.pl.flatfield(flatfields[0])
sp.pl.tiling_correction(sdata, crd=crd, img_layer=["raw_image", "tiling_correction"], figsize=(10, 10))

WARNING: Unable to find `y`, `x` or `z` dimension in `('y', 'x')`. Renaming to `('y', 'x', 'z', 'channels')`

../../_images/7e91c27a3045604e743121d710e01036ad76830f95de0f3ae22a73865f19feee.png

2.2 min-max filtering and contrast enhancing#

The second step of the preprocessing the data includes a couple of steps:

A min max filter can be added. The goal of this function is to substract background noise, and make the borders of the nuclei/cells cleaner, plus it will delete the occasional debris. If you take the size too small, smaller then the size of your nuclei, the function will create donuts, with black spots in the center of your cells. If the size of the min max filter is chosen too big, not enough background is substracted, so a tradeoff should be made. This might need some finetuning. For nuclei in RESOLVE data, 45-55 is a great starting point. Bigger for whole cells. Adapt this parameter to make sure you delete debris and HALO’s.
We recommend to perform contrast enhancing on your image. SPArrOW does this by using histogram equalization (CLAHE function). The amount of correction needed can be decided by adapting the contrast_clip value. If the image is already quite bright, 3.5 might be a good starting value. For dark images, you can go up to 10 or even more. Make sure at the end the whole image is evenly illuminated and no cells are dark in the background.

In every cleaning step, a new image is created and save to memory, hat is then used as an input for the next step. Note that it is currently not possible to overwrite these objects. You will have to define a new name every time you run the function, and thne in the end delete the images you don’t need on disk.

If you think you need more image processing, you can perform other steps using our apply function (see API? maybe not yet..) These images can then be added to the SpatialData object too

sdata = sp.im.min_max_filtering(
    sdata=sdata, img_layer="tiling_correction", output_layer="min_max_filtered", size_min_max_filter=45, overwrite=True
)
sp.pl.plot_image(sdata, img_layer="min_max_filtered", crd=crd, figsize=(5, 5))
sdata = sp.im.enhance_contrast(
    sdata=sdata, img_layer="min_max_filtered", output_layer="clahe", contrast_clip=3.5, chunks=20000, overwrite=True
)
sp.pl.plot_image(sdata, img_layer="clahe", crd=crd, figsize=(5, 5))

../../_images/ced04281bf1dfb5c679d733978ce002fcb614766f80792b4d81439a008beaee0.png

../../_images/95ca1ce554d35d601018e537a2ff09d3cd0de684c69926724f30c2d4e0228d8f.png

3. Segmenting the image#

For the segmentation, we here show an example on how to use cellpose, a deep learning network based on a UNET architecture. This is a great tool , because it segments both nuclei and whole cells.

Multiple paramters need to be given as an input to the cellpose algorithm. We recommend tuning for the optimal segmentation quality.

diameter: Includes an estimate of the diameter of a nucleus. If put to none, cellpose will do the estimation by himself, but this estimation might take longer than the actual segmentation, and if often far off. Estimate around 7 micrometer (in this case 50 pixels at 0.138 micrometer per pixel) for a standard nucleus, and more for whole cells. Input is in pixels.This of course is tissue and method dependent. You can run the algorthim on a small piece (I[0:1000,0:1000] for example), to get an estimate of the size. However, this estimate isn’t always accurate. So check the quality at the end. If you see all nuclei/cells are estimated too small, enlarge this parameter.
device: Defines the device you want to work on, if you only have cpu, you can skip this input parameter. If only having CPU, please tune the parameters on a small subset, and then make it to the big one. This might take a while for large images, but it should work.
flow_threshold: Indicates something about the shape of the masks, if you increase it, more masks with less round shapes will be accepted. Up to one: I take it between 0.6 and 0.95, depending on the cell shapes. Higher is less round. Lower it if you start segmenting artefacts, up it if you miss non-round shaped cells.
mask_threshold: Indicates how many of the possible masks are kept. Making it smaller (up to -6), will give you more masks, bigger is less masks. I take it between 0 and -6. Be careful, you can oversegment: always check the quality
min_size: Indicates the minimal size of a nucleus.
If segmenting whole cells instead of nuclei, set the parameter model_type to ‘cyto’.
If using nuclei together with whole cells, put model_type to ‘cyto’, make sure your image is 3D and and that the first channel is you complete cell staining and you second one is the nucleus channel, put the parameter channel to np.array([1,0])

import torch
import cellpose

from packaging import version
from dask.distributed import Client, LocalCluster

cellpose_version = version.parse(cellpose.version)

if torch.backends.mps.is_available() and cellpose_version >= version.parse("4.0"):  # mps bugged in cellpose < 4.0
    device = "mps"
    chunks = 6432  # take 'large' chunks on gpu
elif torch.cuda.is_available():
    device = "cuda"
    chunks = 6432  # take 'large' chunks on gpu
else:
    device = "cpu"  # on cpu we benefit from taking smaller chunks and running in parallel
    chunks = 2048

print(f"Using device: {device}.")

cluster = LocalCluster(
    n_workers=8
    if device == "cpu"
    else 1,  # if cuda or mps device available, it is better to increase chunk size to maximal value that fits on the gpu, and set n_workers to 1.
    threads_per_worker=1,
    memory_limit="32GB",
)

client = Client(cluster)

print(client.dashboard_link)

sdata = sp.im.segment(
    sdata=sdata,
    img_layer="clahe",
    output_labels_layer="segmentation_mask",
    output_shapes_layer="segmentation_mask_boundaries",
    crd=crop_param,
    device=device,
    min_size=80,
    flow_threshold=0.9,
    diameter=50,
    cellprob_threshold=-4,
    pretrained_model="nuclei",
    chunks=chunks,
    depth=(2 * 50, 2 * 50),
    overwrite=True,
)

client.close()

sp.pl.segment(sdata=sdata, crd=crd, img_layer="clahe", shapes_layer="segmentation_mask_boundaries")

Using device: cpu.
http://127.0.0.1:8787/status

../../_images/ce6fbd8881c9fb41649d5ad9c4423e34b7eddc6af221ea7e494738cca3a9fafc.png

This is in general not recommended, but it is possible to expand cells beyond the segmented bodies.

expand_masks = True
if expand_masks:
    sdata = sp.im.expand_labels_layer(
        sdata,
        labels_layer="segmentation_mask",
        distance=10,
        output_labels_layer="segmentation_mask_expanded",
        output_shapes_layer="segmentation_mask_expanded_boundaries",
        overwrite=True,
    )
    sp.pl.segment(sdata=sdata, crd=crd, img_layer="clahe", shapes_layer="segmentation_mask_expanded_boundaries")

../../_images/5b5365eae3bddaaae55a29601a3bd7fd4643c78ae2a9cca4ddacf51958d21bc8.png

4. Allocating the transcripts#

4.1 Creating the count matrix#

In this step we

load in the transcipts: in the case of RESOLVE this is done with a specific loader. If no specific loader exist for your datatype, you can use the general sparrow.io.read_transcripts function.
allocate the transcripts to the correct cell. This allocation step creates the count matrix, saved in an anndata object that is reachable by typing sdata.table
Visual checks

sdata = sp.io.read_resolve_transcripts(
    sdata, output_layer="transcripts", path_count_matrix=path_coordinates, overwrite=True
)

sdata = sp.tb.allocate(
    sdata=sdata,
    labels_layer="segmentation_mask",
    points_layer="transcripts",
    output_layer="table_transcriptomics",
    overwrite=True,
)

This next plot can be used to check counts on specific cells

# some sanity checks
# some sanity checks
matching_indexes = [
    index for index in sdata.tables["table_transcriptomics"].to_df().index if index.startswith("34_segmentation_mask")
]
random_cell = matching_indexes[0]
random_gene = "Ghr"

sdata.tables["table_transcriptomics"][random_cell, random_gene].to_df()

	Ghr
cells
34_segmentation_mask_3671d7c6	4

diameter = 50

if random_cell is not None and random_gene is not None:
    cell_center = sdata.tables["table_transcriptomics"].obsm["spatial"][
        sdata.tables["table_transcriptomics"].to_df().index.get_loc(str(random_cell))
    ]
    crd_sanity = [
        cell_center[0] - (diameter + 50),
        cell_center[0] + (diameter + 50),
        cell_center[1] - (diameter + 50),
        cell_center[1] + (diameter + 50),
    ]

    print(f"plot transcripts matrix for cell {random_cell} and gene {random_gene}")

    sp.pl.sanity(
        sdata=sdata,
        img_layer="clahe",
        points_layer="transcripts",
        shapes_layer="segmentation_mask_boundaries",
        crd=crd_sanity,
        plot_cell_number=True,
        gene=random_gene,
        figsize=(8, 8),
    )

plot transcripts matrix for cell 34_segmentation_mask_3671d7c6 and gene Ghr

../../_images/e8e9c30d209e3e47d464f7967032f3b6f4687af9cac52c489c54131279618bcc.png

In this step, there might be semgented entities that do not contain any transcripts. These entities are filtered out of the semgentation_mask_boundaries shapes layer and aren’t added to the anndata. Instead these shapes are now saved in filtered_semgentation_segmentation_mask_boundaries.

It is possible to visualize these entities on the image with red borders, by making use of the shapes_layer_filtered parameter. As you can see, in this case, only 17 entities were filtered. This will be very dataset dependent. When there are regions with necrosis for example, this number will be way higher.

Visualizing these cells shows clearly on plots that those cells weren’t missed with segmentation, but were missed becasue of no transcripts measured, an issue?

sdata

SpatialData object, with associated Zarr store: /tmp/nix-shell.wcNPOY/sdata_bb7759bc-5cdf-428e-87bc-7a188a95b5f3.zarr
├── Images
│     ├── 'clahe': DataArray[cyx] (1, 6432, 6432)
│     ├── 'min_max_filtered': DataArray[cyx] (1, 6432, 6432)
│     ├── 'raw_image': DataArray[cyx] (1, 12864, 10720)
│     └── 'tiling_correction': DataArray[cyx] (1, 6432, 6432)
├── Labels
│     ├── 'segmentation_mask': DataArray[yx] (6432, 6432)
│     └── 'segmentation_mask_expanded': DataArray[yx] (6432, 6432)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│     ├── 'filtered_segmentation_segmentation_mask_boundaries': GeoDataFrame shape: (16, 1) (2D shapes)
│     ├── 'filtered_segmentation_segmentation_mask_expanded_boundaries': GeoDataFrame shape: (16, 1) (2D shapes)
│     ├── 'segmentation_mask_boundaries': GeoDataFrame shape: (3360, 1) (2D shapes)
│     └── 'segmentation_mask_expanded_boundaries': GeoDataFrame shape: (3360, 1) (2D shapes)
└── Tables
      └── 'table_transcriptomics': AnnData (3360, 99)
with coordinate systems:
    ▸ 'global', with elements:
        clahe (Images), min_max_filtered (Images), raw_image (Images), tiling_correction (Images), segmentation_mask (Labels), segmentation_mask_expanded (Labels), transcripts (Points), filtered_segmentation_segmentation_mask_boundaries (Shapes), filtered_segmentation_segmentation_mask_expanded_boundaries (Shapes), segmentation_mask_boundaries (Shapes), segmentation_mask_expanded_boundaries (Shapes)

sp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    shapes_layer="segmentation_mask_boundaries",
    shapes_layer_filtered="filtered_segmentation_segmentation_mask_boundaries",
)

../../_images/50a7370efa66a6f7fc02f04f242079265976408598bdd7b1e22580bc1ae3acdc.png

4.2 Transcript quality plot#

After we have created the anndata object, we control the transcript quality.

First we create a plot to chekc if the transcript density is similar across the whole tissue. If this isn’t the case, it can have multiple reasons. Most likely, there will be regions in which the transcript pick-up was less succesfull. Also gene panel choices can influence this plot.

sdata = sp.im.transcript_density(
    sdata,
    img_layer="clahe",
    points_layer="transcripts",
    output_layer="transcript_density",
    overwrite=True,
)
sp.pl.transcript_density(sdata, img_layer=["clahe", "transcript_density"])

../../_images/1dd3a5a1b99039e28d707799fd94280ed8d252429dda095f323f30b54b349cbe.png

As not all of the image surface is segmented, there will be most likely transcripts that weren’t assigned to a cell. For sure in the case of nucleus segmentation (like this example), this will be the case.

In general, we hope to not lose any genes. So we hope there aren’t genes with low abundances and a low proportion kept. In general we see a downward trend. The more a gene is measured, the less it is located in cells (in ratio).

We also provide a table with the genes that are the least located in cells. If a lot of these genes arem akrers for the same celltype, the staining might bem issing this celltype and you should for sure check this. However, it might also be the case that is celltype just has a lot of cytoplasm and you are only semgenting the nucleus.

filtered = sp.pl.analyse_genes_left_out(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics",
    points_layer="transcripts",
)

../../_images/5b818856fcc37b6e0b3c156dd37708dafb411b8f45dcd5c83cfbc4d1a63eacb7.png

../../_images/199d617010507e159321692602f8f2088d27e34943ff7c4bdcac227abee3f49d.png

5. Preprocess adata#

5.1 Filtering and Normalization#

The anndata object is now processed:

calculate QC metrics
filter cells with less then 10 gene counts and genes with less then 5 cells (adaptations possible by adapting the function). These filtered cells are again filtered out of the shapes layer and the anndata obejct and saved in an extra shapes layer.
Normalization: For small gene panel (<500), we recommend to normalize the data based on the size of the segmented object (size_norm=True). For transcriptome-wide methods, we recommend standard library size normalization (size_norm=False).

The last plot shows the size of the nucleus related to the counts. When working with whole cells, if there are some really big xcells with really low counts, they are probably not real cells and you should filter based on max size.

sdata

SpatialData object, with associated Zarr store: /tmp/nix-shell.wcNPOY/sdata_bb7759bc-5cdf-428e-87bc-7a188a95b5f3.zarr
├── Images
│     ├── 'clahe': DataArray[cyx] (1, 6432, 6432)
│     ├── 'min_max_filtered': DataArray[cyx] (1, 6432, 6432)
│     ├── 'raw_image': DataArray[cyx] (1, 12864, 10720)
│     ├── 'tiling_correction': DataArray[cyx] (1, 6432, 6432)
│     └── 'transcript_density': DataArray[cyx] (1, 6432, 6432)
├── Labels
│     ├── 'segmentation_mask': DataArray[yx] (6432, 6432)
│     └── 'segmentation_mask_expanded': DataArray[yx] (6432, 6432)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│     ├── 'filtered_segmentation_segmentation_mask_boundaries': GeoDataFrame shape: (16, 1) (2D shapes)
│     ├── 'filtered_segmentation_segmentation_mask_expanded_boundaries': GeoDataFrame shape: (16, 1) (2D shapes)
│     ├── 'segmentation_mask_boundaries': GeoDataFrame shape: (3360, 1) (2D shapes)
│     └── 'segmentation_mask_expanded_boundaries': GeoDataFrame shape: (3360, 1) (2D shapes)
└── Tables
      └── 'table_transcriptomics': AnnData (3360, 99)
with coordinate systems:
    ▸ 'global', with elements:
        clahe (Images), min_max_filtered (Images), raw_image (Images), tiling_correction (Images), transcript_density (Images), segmentation_mask (Labels), segmentation_mask_expanded (Labels), transcripts (Points), filtered_segmentation_segmentation_mask_boundaries (Shapes), filtered_segmentation_segmentation_mask_expanded_boundaries (Shapes), segmentation_mask_boundaries (Shapes), segmentation_mask_expanded_boundaries (Shapes)

# Perform preprocessing.
sdata = sp.tb.preprocess_transcriptomics(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics",
    output_layer="table_transcriptomics_preprocessed",  # write results to a new slot, we could also write to the same slot (when passing overwrite==True).
    min_counts=10,
    min_cells=5,
    size_norm=True,
    n_comps=50,
    overwrite=True,
)

the raw counts are saved in sdata.tables['table_transcriptomics_preprocessed'].layers['raw_counts']. The size normalized data in .raw.

sdata.tables["table_transcriptomics_preprocessed"].raw.to_adata().to_df()

	Acta2	Adamtsl2	Adgre1	Adgrg6	Atp6v0d2	Axl	C5ar1	Ccr2	Ccr7	Cd14	...	Svep1	Timd4	Tmem119	Trem2	Vsig4	Vwf	Wnt2	Wnt9b	Wt1	Xcr1
cells
17_segmentation_mask_3671d7c6	0.0	0.027087	0.000000	0.0	0.00000	0.027087	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.053460	0.0
18_segmentation_mask_3671d7c6	0.0	0.000000	0.438225	0.0	0.00000	0.168322	0.087699	0.0	0.0	0.0	...	0.000000	0.0	0.242928	0.0	0.0	0.0	0.0	0.0	0.000000	0.0
32_segmentation_mask_3671d7c6	0.0	0.000000	0.000000	0.0	0.00000	0.039360	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.0
33_segmentation_mask_3671d7c6	0.0	0.333588	0.000000	0.0	0.00000	0.035364	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.134521	0.0	0.0	0.0	0.0	0.0	0.134521	0.0
34_segmentation_mask_3671d7c6	0.0	0.158128	0.024179	0.0	0.00000	0.115442	0.000000	0.0	0.0	0.0	...	0.070851	0.0	0.047787	0.0	0.0	0.0	0.0	0.0	0.070851	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7534_segmentation_mask_3671d7c6	0.0	0.000000	0.000000	0.0	0.00000	0.000000	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.0
7548_segmentation_mask_3671d7c6	0.0	0.000000	0.000000	0.0	0.00000	0.000000	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.0
7560_segmentation_mask_3671d7c6	0.0	0.000000	0.000000	0.0	0.00000	0.000000	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.0
7563_segmentation_mask_3671d7c6	0.0	0.000000	0.000000	0.0	0.00000	0.000000	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.0
7576_segmentation_mask_3671d7c6	0.0	0.000000	0.000000	0.0	0.05891	0.000000	0.000000	0.0	0.0	0.0	...	0.000000	0.0	0.000000	0.0	0.0	0.0	0.0	0.0	0.000000	0.0

3253 rows × 98 columns

sdata.tables["table_transcriptomics_preprocessed"].layers["raw_counts"]

<3253x98 sparse matrix of type '<class 'numpy.uint32'>'
	with 44138 stored elements in Compressed Sparse Row format>

sp.pl.preprocess_transcriptomics(
    sdata,
    table_layer="table_transcriptomics_preprocessed",
)

../../_images/de5b45d1a32e10b1f3c07f80ff4017e6241ec0726c07aa40af58a88aa5c2ed2a.png

../../_images/a10521d94f1d301d53b44686d6877850bd2810e7fa716979878697dc393de280.png

../../_images/b11d9d8bb402be031b765e7c70f8b3dbdb77e2702ef2de248e0705b690937012.png

../../_images/2dd77d241d1306978c290cc45157095721be6e2263619b0182913f2cc6cb3369.png

sp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_preprocessed",
    column="total_counts",
    shapes_layer="segmentation_mask_boundaries",
)

../../_images/29ae89ad67f1619794594a73fb9e0788a8eb43b0b62ed6067df9c8822eddebc7.png

In this step you can filter cells based on their size: are you sure cells need to be bigger, or sure your cells can not be larger than X?

You can delete them with this function by defining min_size and max_size.

sdata = sp.tb.filter_on_size(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_preprocessed",
    output_layer="table_transcriptomics_filter",
    min_size=500,
    max_size=100000,
    overwrite=True,
)

sp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_filter",
    column="shapeSize",
    shapes_layer="segmentation_mask_boundaries",
)

../../_images/aca8f93b04bddd11fc416e94e7550282fdff2bf9980c532fd811ea00af486b13.png

You can see all filtered layers present in your sdata object under shapes

sdata

SpatialData object, with associated Zarr store: /tmp/nix-shell.wcNPOY/sdata_bb7759bc-5cdf-428e-87bc-7a188a95b5f3.zarr
├── Images
│     ├── 'clahe': DataArray[cyx] (1, 6432, 6432)
│     ├── 'min_max_filtered': DataArray[cyx] (1, 6432, 6432)
│     ├── 'raw_image': DataArray[cyx] (1, 12864, 10720)
│     ├── 'tiling_correction': DataArray[cyx] (1, 6432, 6432)
│     └── 'transcript_density': DataArray[cyx] (1, 6432, 6432)
├── Labels
│     ├── 'segmentation_mask': DataArray[yx] (6432, 6432)
│     └── 'segmentation_mask_expanded': DataArray[yx] (6432, 6432)
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
├── Shapes
│     ├── 'filtered_low_counts_segmentation_mask_boundaries': GeoDataFrame shape: (107, 1) (2D shapes)
│     ├── 'filtered_low_counts_segmentation_mask_expanded_boundaries': GeoDataFrame shape: (107, 1) (2D shapes)
│     ├── 'filtered_segmentation_segmentation_mask_boundaries': GeoDataFrame shape: (16, 1) (2D shapes)
│     ├── 'filtered_segmentation_segmentation_mask_expanded_boundaries': GeoDataFrame shape: (16, 1) (2D shapes)
│     ├── 'filtered_size_segmentation_mask_boundaries': GeoDataFrame shape: (74, 1) (2D shapes)
│     ├── 'filtered_size_segmentation_mask_expanded_boundaries': GeoDataFrame shape: (74, 1) (2D shapes)
│     ├── 'segmentation_mask_boundaries': GeoDataFrame shape: (3179, 1) (2D shapes)
│     └── 'segmentation_mask_expanded_boundaries': GeoDataFrame shape: (3179, 1) (2D shapes)
└── Tables
      ├── 'table_transcriptomics': AnnData (3360, 99)
      ├── 'table_transcriptomics_filter': AnnData (3179, 98)
      └── 'table_transcriptomics_preprocessed': AnnData (3253, 98)
with coordinate systems:
    ▸ 'global', with elements:
        clahe (Images), min_max_filtered (Images), raw_image (Images), tiling_correction (Images), transcript_density (Images), segmentation_mask (Labels), segmentation_mask_expanded (Labels), transcripts (Points), filtered_low_counts_segmentation_mask_boundaries (Shapes), filtered_low_counts_segmentation_mask_expanded_boundaries (Shapes), filtered_segmentation_segmentation_mask_boundaries (Shapes), filtered_segmentation_segmentation_mask_expanded_boundaries (Shapes), filtered_size_segmentation_mask_boundaries (Shapes), filtered_size_segmentation_mask_expanded_boundaries (Shapes), segmentation_mask_boundaries (Shapes), segmentation_mask_expanded_boundaries (Shapes)

5.2 Clustering#

This function performs the neighborhood analysis and the leiden clustering and the UMAP calculations using standar scanpy functions.

You need to define 2 parameters:

the amount of PC’s used: I normally choose something between 15-20 based on the plot of PC’s.
The amount of neighbors used: Normally I go for 35. Less neighbors means more spread, more means everything tighter, in general.

It returns the UMAP and marker gene list per cluster, that can be looked at for finding celltypes.

sdata = sp.tb.leiden(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_filter",
    output_layer="table_transcriptomics_clustered",
    calculate_umap=True,
    calculate_neighbors=True,
    n_pcs=17,
    n_neighbors=35,
    resolution=0.8,
    rank_genes=True,
    key_added="leiden",
    overwrite=True,
)

sp.pl.cluster(
    sdata,
    table_layer="table_transcriptomics_clustered",
    key_added="leiden",
)

sp.pl.plot_shapes(
    sdata,
    img_layer="clahe",
    table_layer="table_transcriptomics_clustered",
    column="leiden",
    shapes_layer="segmentation_mask_boundaries",
)

../../_images/31bf1f27bb8d1dfa9d926456215bd6fb443b8982201c16e2415284a51d77ed1a.png

../../_images/7dafe1324cf769850d6d397622ad0682b34016314c6297feb3f6bed5974825de.png

../../_images/a47c937181df4ded891d8490543cc2ca3dc0fb29d310e450b766bb16e52c6a75.png

Here we reuse the plot_shapes function to actually plot a subfigure of our image with the cells colored based on the clustering.

6. Annotating the cells#

Afte we created the count matrix, we are going to annotate the cells. SPArrOW has its own annotation algorithm, that starts fro m a marker gene list. It is an enrichment score algorithm, that compares the expression of marker genes in cells with a celltype weighted average. For this reason, the annotation happens iteratively.

The input marker gene list should be a csv file, with the marker genes in the rows and the celltypes in the columns. The weight given to a celltype/ marker gene combination should reflect the marker gene specifity of the gene to the celltype. binary inputs (zero for no marker, 1 for marker) are also accepted. A tutorial on how to create such marker genes lists from an atlas will follow.

A cell is annotated to be unknown if its score is lower than zero for all celltypes.

As the approach is iterative, the results of every step are shown.

import pandas as pd
path_mg = registry.fetch("transcriptomics/resolve/mouse/markerGeneListMartinNoLow.csv")

path_mg = registry.fetch("transcriptomics/resolve/mouse/markerGeneListMartinNoLow.csv")

df = pd.read_csv(path_mg, index_col=0, delimiter=",")
df.columns = df.columns.str.replace(" ", "_", regex=False)  # whitespaces no longer allowed since spatialdata>=0.3.0

# Inspect annotation file containing markers
display(
    df.head()
)  # This is one-hot encoded matrix with cell types listed in the first row, and marker genes in the first column.

sdata, celltypes_scored, celltypes_all = sp.tb.score_genes_iter(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_clustered",
    output_layer="table_transcriptomics_score_genes_iter",
    path_marker_genes=df,
    overwrite=True,
)

	portal_vein_EC45	LSEC45	LSEC_Portal	LSEC_Central	central_vein_EC45	stellateAll	stellate_PeriPortal	stellate_PeriCentral	FibroblastAll	fibroblastPortal	...	ILC1s	T_cells	pDCs	B_cells	cDC1s	Kupffer_cells	Capsule_and_Central_Vein_Mac	Portal_LAM_macrophages	Monocytes	LECs
Column1
Acta2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Adamtsl2	0	0	0	0	0	1	0	1	1	1	...	0	0	0	0	0	0	0	0	0	0
Adgre1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	1	1	1	1	0
Adgrg6	1	1	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Atp6v0d2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 33 columns

../../_images/b09a1d64f09119e803853882e9066d74edbc8ccf4dbc248e9b718a94ebb9846f.png

../../_images/a644a08035697df565fa1609ff00b2627b070e497065259001324d283b780b2f.png

../../_images/76843ae0a9a110ab4d48ffbcc52a3407c2c94499394ecb4fd87e03c8ed2af235.png

../../_images/8b830d99601dfeef66efd743d4415b55ccbb3d0093901a04235218085bd88e93.png

../../_images/bd47be7298efcece9ab73cb9c45d99dee1b421a3ba80e85ab775d6752d6c7cf5.png

../../_images/9a7da1c95461442228bf13c317b24a254e0a679617cc784f9342d21dd6a650ee.png

../../_images/82ce1c1e9d7620c631efe1c64f24864bf5973dac3ecb21bf70c4dc5914e964dd.png

../../_images/90f9863568ad19f20f7bf116bfc2c65c90eaa2284557a0d6b91ce1b3d3f8cbdb.png

../../_images/bbb9b8c34510e3ce23f922c93bf05d027322af6fb942d90bfae15eca434e29da.png

../../_images/16fa273c7ff5937ec25eaf71a38de479111e50ee8150fe17f35d884d1e0402bc.png

../../_images/f8217eff3204aca04479701be51fd46ec3660076bcfe6eda1a0e0d5c3f3c7bfd.png

sdata.tables["table_transcriptomics_score_genes_iter"]

AnnData object with n_obs × n_vars = 3179 × 98
    obs: 'cell_ID', 'fov_labels', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_2_genes', 'pct_counts_in_top_5_genes', 'n_counts', 'shapeSize', 'leiden', 'Cleanliness', 'annotation', 'portal_vein_EC45', 'LSEC45', 'LSEC_Portal', 'LSEC_Central', 'central_vein_EC45', 'stellateAll', 'stellate_PeriPortal', 'stellate_PeriCentral', 'FibroblastAll', 'fibroblastPortal', 'FibroblastsCentral', 'VSMC', 'capsular_fibroblasts45', 'Mesothelial_cells', 'Hepatocytes', 'HepatocytesPortal', 'HepatocytesCentral', 'Cholangiocytes', 'cDC2s', 'Mig._cDCs', 'Neutrophils', 'Basophils', 'NK_cells', 'ILC1s', 'T_cells', 'pDCs', 'B_cells', 'cDC1s', 'Kupffer_cells', 'Capsule_and_Central_Vein_Mac', 'Portal_LAM_macrophages', 'Monocytes', 'LECs'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'mean', 'std'
    uns: 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'spatialdata_attrs', 'umap'
    obsm: 'X_pca', 'X_umap', 'spatial'
    varm: 'PCs'
    layers: 'raw_counts'
    obsp: 'connectivities', 'distances'

from sparrow.utils._keys import _ANNOTATION_KEY

import scanpy as sc

sc.pl.umap(sdata.tables["table_transcriptomics_score_genes_iter"], color=_ANNOTATION_KEY)

filtered_layers = [
    "filtered_segmentation_segmentation_mask_boundaries",
    "filtered_low_counts_segmentation_mask_boundaries",
    "filtered_size_segmentation_mask_boundaries",
]

sp.pl.plot_shapes(
    sdata,
    column=_ANNOTATION_KEY,
    img_layer="clahe",
    table_layer="table_transcriptomics_score_genes_iter",
    shapes_layer="segmentation_mask_boundaries",
    alpha=0.7,
    shapes_layer_filtered=filtered_layers,
)

../../_images/629060589e989012616de034f335f1a4f37b60f3ad1d9e49da655609b8d6cbde.png

Funcitonality from squidpy can be easily used from within SPArrOW, just run it on sdata.table. We impemented the neighborhood enrichment from scratch.

sdata = sp.tb.nhood_enrichment(
    sdata,
    labels_layer="segmentation_mask",
    table_layer="table_transcriptomics_score_genes_iter",
    output_layer="table_transcriptomics_score_genes_enrichemnt",
    overwrite=True,
)

sp.pl.nhood_enrichment(
    sdata, table_layer="table_transcriptomics_score_genes_enrichemnt", celltype_column=_ANNOTATION_KEY
)

../../_images/453f05bd93a74d08845238e8e6205c4fe0b017166dd2838ff5ef2c9fa76b64ee.png

Use squidpy to see the enrichement between the different celltypes. For more info, see squidpy!