beyondpie / cemba_wmb_snatac Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 3.0 11.65 MB

Whole mouse brain snATAC seq analysis

License: MIT License

R 59.35% Python 35.55% Makefile 3.17% Shell 1.80% Scala 0.12%

cemba_wmb_snatac's Introduction

📫 How to reach me: zsp07 [at] tsinghua dot org dot cn

cemba_wmb_snatac's People

Contributors

Stargazers

Watchers

Forkers

ren-lab shaohuaihan

cemba_wmb_snatac's Issues

Consultation on snATAC-seq Data Analysis for differential peak analysis

TL;DR
In doing differential peak analysis,

what's the difference between union_bed_file and rnd_bed_file
which to use for pp.make_peak_matrix

Original post

Hi, @beyondpie. Thanks for making this remarkable pipeline publicly available.

I'd like to work with processed files from catlas.org, subset out my cell types of interest, create a peak matrix and perform differential peak analysis.

I downloaded files with the _rm_dlt.h5ad suffix under snapatac2_anndata_per_sample_with_fragment_dir, and used the following code to get peak matrix:

peak_mat = sa2.pp.make_peak_matrix(
adata = snap2,
inplace = False,
file = outdir / "sa2pmat.union_pmat.h5ad",
backend = "hdf5",
peak_file = union_bed_file,
chunk_size = 10000,
use_x = False
)
(ref. sa2pmay.py lines 66-72)

where

union_bed_file = 'mba.whole.sa2.final.peak.srt.bed'

Then do differential peak analysis with tl.marker_regions or tl.diff_test.

however I'm unsure if this is correct as I noticed two peak files:
union_bed_file and rnd_bed_file, where

rnd_bed_file = 'mba.whole.shuffle.removeovlp.bed'

What's the purpose of having both bed files and which one should I use for diff-peak analysis?

Thanks!
Ray

organize 00.data preprocess

organize the files in this directory like what we have in others.

RuntimeError: neither 'fragment_single' nor 'fragment_paired' is present in the '.obsm'

Dear CEMBA team,

Thank you for this great resource. I am attempting to save subclass-specific fragments files (as BED files) for one of the h5ad files available under this folder — specifically, the file “CEMBA180111_4E_rm_dlt.h5ad” — and am getting the following error (I am using SnapATAC 2.5.1):

RuntimeError                              Traceback (most recent call last)
Cell In[29], line 1
----> 1 snapatac2.ex.export_fragments(anndata_test,
      2                               groupby='Subclass',
      3                              out_dir=anndata_dir)

File ~/miniconda3/envs/snapatac2/lib/python3.9/site-packages/snapatac2/export/__init__.py:68, in export_fragments(adata, groupby, selections, ids, out_dir, prefix, suffix, compression, compression_level)
     65     elif suffix.endswith(".zst"):
     66         compression = "zstandard"
---> 68 return internal.export_fragments(
     69     adata, list(ids), list(groupby), out_dir, prefix, suffix, selections, compression, compression_level,
     70 )

RuntimeError: neither 'fragment_single' nor 'fragment_paired' is present in the '.obsm'

when running

snapatac2.ex.export_fragments(adata,
                              groupby='Subclass',
                             out_dir=anndata_dir)

where ‘adata’ is loaded with snapatac2.read(), and ’Subclass’ is a column of subclass labels retrieved from Supplementary Table 2 (downloaded from this folder). The error occurs with both backed=None and backed=‘r+’.

I verified that the keys are indeed missing by checking anndata_test.obsm, which returns AxisArrays with keys: insertion. I am wondering if there is an error in my data-loading/handling?

Best,
Jay

The subclass column under .obs is as follows.

share the supplementary data with more details

Some supplementary tables are big or lack of description on how to read them, for example, Supplementary Table 6, which is 3.3GB, and a huge binary matrix.

We need to:

Write a detailed description about how to read the supplementary files.
For some supplementary files, we can provide other formats to let users easily get what they need (not in a programming way).

provide bedpe files like 2021 format.

provide both L4-cluster level and subclass-level bed files.

PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed

Dear CEMBA team,

I am trying to write fragments bed files to disk through an AnnDataSet object (with all 2.3M cells from all .h5ad sample files; AnnDataSet object with n_obs x n_vars = 2355842 x 5451055) using SnapATAC 2.4 below. ‘Class’ is a manually added column, using AIT21_annotation_freeze_081523.tsv (under this folder) for class-to-subclass conversion.

snapatac2.ex.export_bed(adataset,
                       groupby='Class',
                       suffix='.bed.gz',
                       out_dir = class_level_fragments_dir
                       )

and encountered the following error (I’ve replaced my personal paths with /path/to/):

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Sun Dec 17 07:33:29 2023
, filename = ‘/path/to/CEMBA180129_3A_rm_dlt.h5ad', file descriptor = 116, errno = 5, error message = 'Input/output error', buf = 0x50871390, total read size = 2096, bytes this sub-read = 2096, bytes actually read = 18446744073709551615, offset = 0', /root/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/d54b207/anndata/src/container/base.rs:1160:85
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[51], line 5
      3 print(f'Writing outputs to f{class_level_fragments_dir}')
      4 print('='*50)
----> 5 snapatac2.ex.export_bed(adataset,
      6                        groupby='Class',
      7                        suffix='.bed.gz',
      8                        out_dir = class_level_fragments_dir
      9                        )

File ~/miniconda3/envs/conda_env/lib/python3.8/site-packages/snapatac2/export/__init__.py:55, in export_bed(adata, groupby, selections, ids, out_dir, prefix, suffix)
     52 elif isinstance(ids, str):
     53     ids = adata.obs[ids]
---> 55 return internal.export_bed(
     56     adata, list(ids), list(groupby), selections, out_dir, prefix, suffix,
     57 )

PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Sun Dec 17 07:33:29 2023
, filename = ‘/path/to/CEMBA180129_3A_rm_dlt.h5ad', file descriptor = 116, errno = 5, error message = 'Input/output error', buf = 0x50871390, total read size = 2096, bytes this sub-read = 2096, bytes actually read = 18446744073709551615, offset = 0

I have verified that each cell is assigned the appropriate class label. At the time of this error (which occurred after hours of writing to disk), the line counts of the written bed.gz files do not match the number of fragments (grouped by class) in the metadata (Supplementary Table 2 of the paper), as expected.

Counting the number of lines (at the time of error) using zcat <filename> | wc -l yields:

{'HY_GABA.bed.gz': 226103206,
 'CTX-CGE_GABA.bed.gz': 290152494,
 'CTX-MGE_GABA.bed.gz': 550481310,
 'MH-LH_Glut.bed.gz': 34759628,
 'NP-CT-L6b_Glut.bed.gz': 996287794,
 'MY_Glut.bed.gz': 79333970,
 'Astro-Epen.bed.gz': 1258567744,
 'CNU-HYa_Glut.bed.gz': 166911254,
 'LSX_GABA.bed.gz': 147192400,
 'DG-IMN_Glut.bed.gz': 336481046,
 'HY_Gnrh1_Glut.bed.gz': 783042,
 'OB-CR_Glut.bed.gz': 6790320,
 'CNU-MGE_GABA.bed.gz': 93358728,
 'MB_Glut.bed.gz': 815605023,
 'MY_GABA.bed.gz': 113820670,
 'HY_Glut.bed.gz': 177754869,
 'TH_Glut.bed.gz': 455732012,
 'CNU-LGE_GABA.bed.gz': 403055016,
 'P_GABA.bed.gz': 133969424,
 'Immune.bed.gz': 234100458,
 'Vascular.bed.gz': 262067892,
 'CB_Glut.bed.gz': 476247064,
 'OPC-Oligo.bed.gz': 1912232980,
 'IT-ET_Glut.bed.gz': 4824967792,
 'CNU-HYa_GABA.bed.gz': 285071090,
 'CB_GABA.bed.gz': 57209872,
 'MB_Dopa.bed.gz': 50937284,
 'P_Glut.bed.gz': 233308761,
 'MB_GABA.bed.gz': 572258267,
 'OB-IMN_GABA.bed.gz': 50274352,
 'MB-HB_Sero.bed.gz': 25413152,
 'OEC.bed.gz': 509394,
 'Pineal_Glut.bed.gz': 1446016}

while grouping the metadata by class (for all 2.3M cells) yields

Class
IT-ET Glut        3911482310
OPC-Oligo         1837739725
Astro-Epen        1147969857
NP-CT-L6b Glut     784204001
MB Glut            750316389
CNU-LGE GABA       493293499
MB GABA            448868487
CTX-MGE GABA       443671252
CB Glut            427319942
TH Glut            356853925
DG-IMN Glut        339200608
P Glut             264327471
CNU-HYa GABA       238441380
Vascular           237619853
CTX-CGE GABA       233809950
Immune             210318453
HY GABA            205488208
LSX GABA           194601248
MY GABA            162226871
HY Glut            161648292
CNU-HYa Glut       143951176
P GABA             126838072
CNU-MGE GABA       115760498
MY Glut            105435880
OB-IMN GABA         72832099
MB Dopa             54655953
CB GABA             50524434
MH-LH Glut          34059796
MB-HB Sero          26239541
OB-CR Glut           7518986
Pineal Glut          2123205
OEC                   851273
HY Gnrh1 Glut         785959
Name: # of Fragments, dtype: int64

Strangely, I do notice that for the HY GABA class, the number of written lines (226103206) exceeds the number of cells belonging to that class in the metadata (205488208) by ~20M entries, so I’m not sure what is going on there.

I'm also not sure if this is due to the particularly large files, but the error occurred after much saving has already occurred. Would appreciate any thoughts on this!

Best,
Jay

data

Hello, may I ask what process is used in this document https://www.nature.com/articles/s41586-023-06824-9#data-availability from fastq files to fragments and other files? Is there a corresponding tutorial? Thank you, because now I need files such as fragments provided by 10× company, but I did not find them in the data you provided. Or can you provide the fragment files of each sample? Thank you very much!

rename 17, 18, 19 directory labels in codes

After we change the directory to 00, 01, 02, we need to modify the codes accordingly.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.