Giter Club home page Giter Club logo

spavae's Introduction

Spatial dependency-aware deep generative models

SpaVAE, spaPeakVAE, spaMultiVAE, spaLDVAE and spaPeakLDVAE are dependency-aware deep generative models for multitasking analysis of spatial genomics data. Different models are designed for different analytical tasks of spatial genomics data.
spaVAE is a negative binomial (NB) model-based variational autoencoder (VAE) with a hybrid embedding of Gaussian process (GP) prior and Gaussian prior. The model is for multitasking analysis of spatially resolved transcriptomics (SRT) data, including dimensionality reduction, visualization, clustering, batch integration, denoising, differential expression, spatial interpolation, and resolution enhancement.
spaPeakVAE is a variant model of spaVAE, which uses a Bernoulli decoder to characterize spatial ATAC-seq binary data. The analytical tasks in spaVAE can also be fulfilled by spaPeakVAE for spatial ATAC-seq data.
spaMultiVAE characterizes spatial multi-omics data, which profiles gene expression and surface protein intensity simultaneously. Besides the analyses aforementioned, spaMultiVAE uses a NB mixture decoder to denoise backgrounds in proteins.
spaLDVAE and spaPeakLDVAE are spaVAE variants with a linear decoder, which also contains two hybrid latent embedding components, one follows GP prior and the other follows standard normal prior. The model can be used for detecting spatial variable genes and peaks.

Table of contents

Network diagram

Diagram of spaVAE (a), spaPeakVAE (a), spaMultiVAE (b), spaLDVAE (c), and spaPeakLDVAE (c) networks:

Requirements

Python: 3.9.7
PyTorch: 1.11.0 (https://pytorch.org)
Scanpy: 1.9.1 (https://scanpy.readthedocs.io/en/stable)
Numpy: 1.21.5 (https://numpy.org)
Pandas: 1.4.2 (https://pandas.pydata.org)
h5py: 3.6.0 (https://pypi.org/project/h5py)

Folders

Tutorial: we provide Jupyter notebooks demonstrating the functionalities of the spaVAE models.
src: source code of spaVAE models.

Usage

For human DLPFC dataset:

python run_spaVAE.py --data_file HumanDLPFC_151673.h5 --inducing_point_steps 6

For integrating 4 human DLPFC samples:

python run_spaVAE_Batch.py --data_file 151673_151674_151675151676_samples_union.h5 --inducing_point_steps 6

For mouse hippocampus Slide-seq V2 dataset:

python run_spaVAE.py --data_file Mouse_hippocampus.h5 --grid_inducing_points False --inducing_point_nums 400 --loc_range 40

For spatial ATAC-seq dataset of mouse embryonic (E15.5) brain tissues in the MISAR-seq dataset:

python run_spaPeakVAE.py --data_file MISAR_seq_mouse_E15_brain_ATAC_data.h5 --inducing_point_steps 19

For spatial multi-omics Spatial-CITE-seq data:

python run_spaMultiVAE.py --data_file Multiomics_Spatial_Human_tonsil_SVG_data.h5 --inducing_point_steps 19

--data_file specifies the data file name, in the h5 file. For SRT data, spot-by-gene count matrix is stored in "X" and 2D location is stored in "pos". For spatial ATAC-seq data, "X" represents spot-by-peak count matrix. For spatial multi-omics data, "X_gene" represents spot-by-gene count matrix, and "X_protein" represents spot-by-protein count matrix.

API documents

API documents

Parameters

--data_file: data file name.
--select_genes: number of selected genes for analysis, default = 0 means no filtering. It will use the mean-variance relationship to select informative genes.
--batch_size: mini-batch size, default = "auto", which means if sample size <= 1024 then batch size = 128, if 1024 < sample size <= 2048 then batch size = 256, if sample size > 2048 then batch size = 512.
--maxiter: number of max training iterations, default = 5000.
--train_size: proportion of training set, others will be validating set, default = 0.95. In small datasets, e.g. there are only hundreds of spots, we recommend to set train_size to 1, and fix the maxiter to 1000.
--patience: patience of early stopping when using validating set, default = 200.
--lr: learning rate, default = 1e-3 for spaVAE and spaPeakVAE, and defualt = 5e-3 for spaMultiVAE.
--weight_decay: weight decay coefficient, default = 1e-6.
--noise: coefficient of random Gaussian noise for the encoder, default = 0.
--dropoutE: dropout probability for encoder, default = 0.
--dropoutD: dropout probability for decoder, default = 0.
--encoder_layers: hidden layer sizes of encoder, default = [128, 64].
--GP_dim: dimension of the latent Gaussian process embedding, default = 2 for spaVAE and spaMultiVAE, and default = 5 for spaPeakVAE.
--Normal_dim: dimension of the latent standard Gaussian embedding, default = 8 for spaVAE and spaPeakVAE, and 18 for spaMultiVAE.
--decoder_layers: hidden layer sizes of decoder, default = [128].
--dynamicVAE: whether to use dynamicVAE to tune the value of beta, if setting to false, then beta is fixed to initial value.
--init_beta: initial coefficient of the KL loss, default = 10.
--min_beta: minimal coefficient of the KL loss, default = 4.
--max_beta: maximal coefficient of the KL loss, default = 25. min_beta, max_beta, and KL_loss are used for dynamic VAE algorithm.
--KL_loss: desired KL_divergence value (GP and standard normal combined), default = 0.025.
--num_samples: number of samplings of the posterior distribution of latent embedding during training, default = 1.
--fix_inducing_points: fixed or trainable inducing points, default = True, which means inducing points are fixed.
--grid_inducing_points: whether to use 2D grid inducing points or k-means centroids of positions as inducing points, default = True. "True" for 2D grid, "False" for k-means centroids.
--inducing_point_steps: if using 2D grid inducing points, set the number of 2D grid steps, default = None. Needed when grid_inducing_points = True.
--inducing_point_nums: if using k-means centroids on positions, set the number of inducing points, default = None. Needed when grid_inducing_points = False.
--fixed_gp_params: kernel scale is fixed or not, default = False, which means kernel scale is trainable.
--loc_range: positional locations will be scaled to the specified range. For example, loc_range = 20 means x and y locations will be scaled to the range 0 to 20, default = 20. This value can be set larger if it isn't numerical stable during training.
--kernel_scale: initial kernel scale, default = 20.
--model_file: file name to save weights of the model, default = model.pt
--final_latent_file: file name to output final latent representations, default = final_latent.txt.
--denoised_counts_file: file name to output denoised counts, default = denoised_mean.txt.
--device: pytorch device, default = cuda.

Datasets

Datasets used in the study can be found

https://figshare.com/articles/dataset/Spatial_genomics_datasets/21623148

Reference

Contact

Tian Tian [email protected]

spavae's People

Contributors

ttgump avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

jingmingcn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.