sparrow.io.read_transcripts

sparrow.io.read_transcripts#

sparrow.io.read_transcripts(sdata, path_count_matrix, path_transform_matrix=None, output_layer='transcripts', overwrite=False, debug=False, column_x=0, column_y=1, column_z=None, column_gene=3, column_midcount=None, delimiter=',', header=None, comment=None, crd=None, to_coordinate_system='global', filter_gene_names=None, blocksize='64MB')#

Reads transcript information from a file with each row listing the x and y coordinates, along with the gene name.

If a transform matrix is provided a linear transformation is applied to the coordinates of the transcripts. The transformation is applied to the dask dataframe before adding it to sdata. The SpatialData object is augmented with a points layer named output_layer that contains the transcripts.

Parameters:
  • sdata (SpatialData) – The SpatialData object to which the transcripts will be added.

  • path_count_matrix (str | Path) – Path to a .parquet file or .csv file containing the transcripts information. Each row should contain an x, y coordinate and a gene name. Optional a midcount column is provided. If a midcount column is provided, rows are repeated.

  • path_transform_matrix (Union[str, Path, None] (default: None)) – This file should contain a 3x3 transformation matrix for the affine transformation. The matrix defines the linear transformation to be applied to the coordinates of the transcripts. If no transform matrix is specified, the identity matrix will be used.

  • output_layer (str, default='transcripts'.) – Name of the points layer of the SpatialData object to which the transcripts will be added.

  • overwrite (bool, default=False) – If True overwrites the output_layer (a points layer) if it already exists.

  • debug (bool, default=False) – If True, a sample of the data is processed for debugging purposes.

  • column_x (int (default: 0)) – Column index of the X coordinate in the count matrix.

  • column_y (int (default: 1)) – Column index of the Y coordinate in the count matrix.

  • column_z (Optional[int] (default: None)) – Column index of the Z coordinate in the count matrix.

  • column_gene (int (default: 3)) – Column index of the gene information in the count matrix.

  • column_midcount (Optional[int] (default: None)) – Column index for the count value to repeat rows in the count matrix. Ignored when set to None.

  • delimiter (str (default: ',')) – Delimiter used to separate values in the .csv file. Ignored if path_count_matrix is a .parquet file.

  • header (Optional[int] (default: None)) – Row number to use as the header in the CSV file. If None, no header is used. Ignored if path_count_matrix is a .parquet file.

  • comment (Optional[str] (default: None)) – Character indicating that the remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Ignored if path_count_matrix is a .parquet file.

  • crd (Optional[tuple[int, int, int, int]] (default: None)) – The coordinates (in pixels) for the region of interest in the format (xmin, xmax, ymin, ymax). If None, all transcripts are considered.

  • to_coordinate_system (str (default: 'global')) – Coordinate system to which output_layer will be added.

  • filter_gene_names (Union[str, list, None] (default: None)) – Regular expression(s) of gene names that need to be filtered out (via str.contains), mostly control genes that were added, and which you don’t want to use. If list of strings, all items in the list are seen as regular expressions. Filtering is case insensitive.

  • blocksize (str (default: '64MB')) – Block size of the partions of the dask dataframe stored as points_layer in sdata.

Return type:

SpatialData

Returns:

: The updated SpatialData object containing the transcripts.

Notes

This function reads a .csv file using Dask and applies a transformation matrix to the coordinates. It can also repeat rows based on the MIDCount value and can work in a debug mode that samples the data.