spin's Introduction

SPIN: spatial integration of spatially resolved transcriptomics (SRT) data

⬅️ manuscript
⬅️ data

SPIN is a simple, Scanpy-based implementation of the subsampling and smoothing approach described in the manuscript Mitigating autocorrelation during spatially resolved transcriptomics data analysis. It enables the alignment and analysis of transcriptionally defined tissue regions across multiple SRT datasets, regardless of morphology or experimental technology, using conventional single-cell tools. Here we include information regarding:

A conceptual overview of the approach
Package requirements
Installation instructions
Basic usage principles

For examples of downstream analysis (e.g. differentially expressed gene analysis and trajectory inference), see the tutorial notebook. For further details on SPIN parameters, import SPIN into Python as shown below and run help(spin).

1. Conceptual overview

Conventional single-cell analysis can identify molecular cell types by considering each cell individually.
However, it does not incorporate spatial information.

Arguably the simplest way to incorporate spatial information and identify molecular tissue regions is to spatially smooth gene expression features across neighboring cells in the tissue.
This can be done by setting the features of each cell to the average of its spatial neighborhood.

However, a problem arises when smoothed representations of each cell are compared to one another.
Physically adjacent cells will have almost identical neighborhoods and thus almost identical smoothed representations.

Thus, we end up with nearest neighbors in feature space that are just nearest neighbors in physical space.
Because conventional methods for downstream anlaysis rely on the nearest neighbors graph in feature space, this leads to reconstruction of physical space in latent space rather than representing the true underlying large scale molecular patterns.
Here, we implement an approach in which each cell's spatial neighborhood is randomly subsampled before averaging, allowing the exact neighborhood composition to vary while still maintaining the general molecular composition.

Ultimately, this approach enables the application of conventional single-cell tools to spatial molecular features in SRT data, yielding regional analogies for each tool. For more details and examples, please refer to the manuscript and tutorial.

2. Requirements:

Software:

Tested on MacOS (Monterey, Ventura) and Linux (Red Hat Enterprise Linux 7).
Command Line Tools is required for pip installing this package from GitHub. While it comes standard on most machines, those without it may encounter an xcrun: error when following the installation instructions below. See here for simple instructions on how to install it.
Python >= 3.9
The only dependency is Scanpy. For details, see pyproject.toml.

Data:

One or more SRT datasets in .h5ad format
An expression matrix under .X (both sparse and dense representations supported)
Spatial coordinates under .obsm (key can be specified with argument spatial_key)
Batch information
- If multiple batches in single dataset, batch labels provided under column in .obs with column name batch_key.
- If multiple batches in separate datasets, batch labels for each dataset provided as input.

3. Installation

From GitHub:

pip install git+https://github.com/wanglab-broad/spin@main

Takes ~5 mins.

4. Usage

In Python:

Consider the marmoset and mouse data from the manuscript which we provide as a demo:

import scanpy as sc

adata_marmoset = sc.read(
    'data/marmoset.h5ad',
    backup_url='https://zenodo.org/record/8092024/files/marmoset.h5ad?download=1'
)
adata_mouse = sc.read(
    'data/mouse.h5ad',
    backup_url='https://zenodo.org/record/8092024/files/mouse.h5ad?download=1'
)

These datasets can be spatially integrated and clustered using spin. The batch_key argument corresponds to the name of a new column in adata.obs that stores the batch labels for each dataset. The batch_labels argument is a list of these batch labels in the same order as the input AnnDatas:

from spin import spin

adata = spin(
    adatas=[adata_marmoset, adata_mouse],
    batch_key='species',
    batch_labels=['marmoset', 'mouse'],
    resolution=0.7
)

This performs the following steps:

integrate:
1. Subsampling and smoothing of each dataset individually (stored under adata.layers['smooth'])
2. Joint PCA across both smoothed datasets
3. Integration of the resulting PCs using Harmony (stored under adata.obsm['X_pca_spin'])
cluster:
1. Latent nearest neighbor search
2. Leiden clustering with a resolution of 0.7 (stored under adata.obs['region'])
3. UMAP (stored under adata.obsm['X_umap_spin'])

Note that spin can equivalently take as input a single AnnData containing multiple labeled batches. It can also take a single AnnData containing one batch for finding regions in a single dataset. For examples, see the tutorial.

The resulting region clusters can then be visualized using standard Scanpy functions:

# In physical space
sc.set_figure_params(figsize=(7,5))
sc.pl.embedding(adata, basis='spatial', color='region')

# In UMAP space
sc.set_figure_params(figsize=(4,4))
sc.pl.embedding(adata, basis='X_umap_spin', color='region')

Downstream analysis (e.g. DEG analysis, trajectory inference) can then be performed using standard Scanpy functions as well. For examples of downstream analysis, see the tutorial. For further details on the parameters of spin, import SPIN into Python as shown above and run help(spin).

From the shell:

SPIN can be executed from the shell using the spin command as shown below (the path is identified automatically; see spin_cli and pyproject.toml)

Shell submission requires a read path to the relevant dataset(s) as well as a write path for the output dataset. Otherwise, provide the same parameters you would when running in Python as above:

spin \
--adata_paths data/marmoset.h5ad data/mouse.h5ad \
--write_path data/marmoset_mouse_spin.h5ad \
--batch_key species \
--batch_labels marmoset mouse \
--resolution "0.7"

Just as when running in Python, a single AnnData containing multiple batches can be passed in instead, as well as just a single dataset containing a single batch.

spin's People

Contributors

Stargazers

Watchers

spin's Issues

connectivity_key error - tutorial not working

Hello I am interesting in trying to use SPIN, but the code isn't working on my end both with own data and with the tutorial:
I have installed SPIN into a completely new virtual environment.

Here is the error (seems to be something with the adata structure). Not sure if this is an issue with a newer version of scanpy or something else.

adata_marmoset = sc.read('data/spin_paper/test/marmoset.h5ad', backup_url='https://zenodo.org/record/8092024/files/marmoset.h5ad?download=1')

adata_marmoset = spin(adata_marmoset,  n_pcs=20, resolution=0.3 )

2024-04-19 09:17:19,269 - SPIN - INFO - Smoothing

2024-04-19 09:17:20,658 - SPIN - INFO - Performing PCA
2024-04-19 09:17:21,690 - SPIN - INFO - Finding latent neighbors
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2024-04-19 09:18:00,296 - SPIN - INFO - Leiden clustering
/Users/inofechmozes/Documents/venvs/spin/lib/python3.9/site-packages/spin/spin.py:295: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.
  sc.tl.leiden(
2024-04-19 09:18:32,336 - SPIN - INFO - Performing UMAP
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/inofechmozes/Documents/venvs/spin/lib/python3.9/site-packages/spin/spin.py", line 113, in spin
    adata = _cluster(
  File "/Users/inofechmozes/Documents/venvs/spin/lib/python3.9/site-packages/spin/spin.py", line 304, in _cluster
    umap = sc.tl.umap(
  File "/Users/inofechmozes/Documents/venvs/spin/lib/python3.9/site-packages/legacy_api_wrap/__init__.py", line 80, in fn_compatible
    return fn(*args_all, **kw)
  File "/Users/inofechmozes/Documents/venvs/spin/lib/python3.9/site-packages/scanpy/tools/_umap.py", line 160, in umap
    neighbors = NeighborsView(adata, neighbors_key)
  File "/Users/inofechmozes/Documents/venvs/spin/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 1019, in __init__
    self._conns_key = self._neighbors_dict["connectivities_key"]
KeyError: 'connectivities_key'

Let me know how I can fix this if possible.

Thank you.

Quick R implementation

Hi Kamal,

I saw your talk and I really liked the simplicity and scalability of the method you described. I also tried it on some other tissues except brain and it seems to be actually capturing structure! Since I mostly use R, I transcribed the one sample (no harmony) implementation to it and it scales pretty well.

Leaving this here in case you're interested.

Cheers!

library(Seurat)
library(data.table)
maher_smooth <- function(seur_obj,nn_graph,n_samples=NULL,n_nbrs=30,assay="Spatial",
                         layer="counts"){
    print("Converting data to data.table")
    #Grab expression data
    mat = GetAssayData(seur_obj,assay=assay,slot=layer)
    #Dense matrix has better column access...
    mat = as.data.table(mat)
    print("Conversion complete.")
    if(is.null(n_samples)){
        n_samples = round(n_nbrs/3,0)
    }
    #Sample neighbors for each cell
    print("Sampling neighbors")
    sampled = apply(nn_graph,1,function(x) sample(x,n_samples))
    #Turn into list
    sampled = as.list(as.data.frame(sampled))
    #Calculate new representation. Can take a while, using DT for it now. <1 min for 55k spots
    new_mat = lapply(sampled, function(x) rowMeans(mat[,..x]))
    new_mat = do.call(cbind, new_mat)
    return(new_mat)
}
#Not implementing the integration part for now. Suppose we only have 1 sample
maher_get_neighbors <- function(seur_obj,n_nbrs){
    #Grab coordinates. If your coordinates are not there, just add them to 
    #@images$slice_1@coordinates as a named df (cells x c(x,y))
    coords = GetTissueCoordinates(seur_obj)[,c(1,2)]
    #Incudes self for now. Use seurat to get the kNN, grab indices of top k
    nns = FindNeighbors(as.matrix(coords),k.param = n_nbrs,compute.SNN=F,return.neighbor = TRUE)
    nns = [email protected][,1:n_nbrs]
    return(nns)
}
#Wrapper
maher_spin <- function(seur_obj,n_nbrs = 30, n_samples=NULL,n_pcs=30,
                       random_state= 0,assay="Spatial",layer="counts",
                      resolution=0.5){
    set.seed(random_state)
    print("Computing nearest neighbors")
    nns = maher_get_neighbors(seur_obj,n_nbrs = n_nbrs)
    print("Computing Smoothed representations")
    new_repr = maher_smooth(seur_obj,nns,n_nbrs=n_nbrs,assay = assay,layer=layer,n_samples=n_samples)
    #Add new representation as Assay
    rownames(new_repr) = rownames(seur_obj)
    print("Normalizing")
    #Sparsify
    new_repr = as.sparse(new_repr)
    seur_obj_spin = CreateSeuratObject(counts = new_repr,meta.data = [email protected],verbose=F)
    seur_obj_spin = NormalizeData(seur_obj_spin,verbose=F)
    seur_obj_spin = FindVariableFeatures(seur_obj_spin,verbose=F)
    seur_obj_spin = ScaleData(seur_obj_spin,assay="RNA",verbose=F)
    print("PCA, SNN, UMAP, Louvain.")
    seur_obj_spin <- RunPCA(seur_obj_spin,verbose=F) 
    seur_obj_spin <- FindNeighbors(seur_obj_spin,dims=1:n_pcs,verbose=F)
    seur_obj_spin <- RunUMAP(seur_obj_spin,
                        dims=1:n_pcs,verbose=F)
    seur_obj_spin <- FindClusters(seur_obj_spin,resolution=resolution,verbose=F)
    domains = seur_obj_spin$seurat_clusters
    names(domains) = NULL
    seur_obj$SPINDomain = domains
    return(list(seur_obj,seur_obj_spin))
}

Recommend Projects

wanglab-broad / spin Goto Github PK