- 📫 How to reach me: zsp07 [at] tsinghua dot org dot cn
beyondpie / cemba_wmb_snatac Goto Github PK
View Code? Open in Web Editor NEWWhole mouse brain snATAC seq analysis
License: MIT License
Whole mouse brain snATAC seq analysis
License: MIT License
TL;DR
In doing differential peak analysis,
union_bed_file
and rnd_bed_file
pp.make_peak_matrix
Original post
Hi, @beyondpie. Thanks for making this remarkable pipeline publicly available.
I'd like to work with processed files from catlas.org, subset out my cell types of interest, create a peak matrix and perform differential peak analysis.
I downloaded files with the _rm_dlt.h5ad
suffix under snapatac2_anndata_per_sample_with_fragment_dir
, and used the following code to get peak matrix:
peak_mat = sa2.pp.make_peak_matrix(
adata = snap2,
inplace = False,
file = outdir / "sa2pmat.union_pmat.h5ad",
backend = "hdf5",
peak_file = union_bed_file,
chunk_size = 10000,
use_x = False
)
(ref. sa2pmay.py lines 66-72)
where
union_bed_file = 'mba.whole.sa2.final.peak.srt.bed'
Then do differential peak analysis with tl.marker_regions
or tl.diff_test
.
however I'm unsure if this is correct as I noticed two peak files:
union_bed_file
and rnd_bed_file
, where
rnd_bed_file = 'mba.whole.shuffle.removeovlp.bed'
What's the purpose of having both bed files and which one should I use for diff-peak analysis?
Thanks!
Ray
organize the files in this directory like what we have in others.
Dear CEMBA team,
Thank you for this great resource. I am attempting to save subclass-specific fragments files (as BED files) for one of the h5ad files available under this folder — specifically, the file “CEMBA180111_4E_rm_dlt.h5ad” — and am getting the following error (I am using SnapATAC 2.5.1):
RuntimeError Traceback (most recent call last)
Cell In[29], line 1
----> 1 snapatac2.ex.export_fragments(anndata_test,
2 groupby='Subclass',
3 out_dir=anndata_dir)
File ~/miniconda3/envs/snapatac2/lib/python3.9/site-packages/snapatac2/export/__init__.py:68, in export_fragments(adata, groupby, selections, ids, out_dir, prefix, suffix, compression, compression_level)
65 elif suffix.endswith(".zst"):
66 compression = "zstandard"
---> 68 return internal.export_fragments(
69 adata, list(ids), list(groupby), out_dir, prefix, suffix, selections, compression, compression_level,
70 )
RuntimeError: neither 'fragment_single' nor 'fragment_paired' is present in the '.obsm'
when running
snapatac2.ex.export_fragments(adata,
groupby='Subclass',
out_dir=anndata_dir)
where ‘adata’ is loaded with snapatac2.read(), and ’Subclass’ is a column of subclass labels retrieved from Supplementary Table 2 (downloaded from this folder). The error occurs with both backed=None and backed=‘r+’.
I verified that the keys are indeed missing by checking anndata_test.obsm
, which returns AxisArrays with keys: insertion
. I am wondering if there is an error in my data-loading/handling?
Best,
Jay
Some supplementary tables are big or lack of description on how to read them, for example, Supplementary Table 6, which is 3.3GB, and a huge binary matrix.
We need to:
Dear CEMBA team,
I am trying to write fragments bed files to disk through an AnnDataSet object (with all 2.3M cells from all .h5ad sample files; AnnDataSet object with n_obs x n_vars = 2355842 x 5451055
) using SnapATAC 2.4 below. ‘Class’ is a manually added column, using AIT21_annotation_freeze_081523.tsv (under this folder) for class-to-subclass conversion.
snapatac2.ex.export_bed(adataset,
groupby='Class',
suffix='.bed.gz',
out_dir = class_level_fragments_dir
)
and encountered the following error (I’ve replaced my personal paths with /path/to/
):
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Sun Dec 17 07:33:29 2023
, filename = ‘/path/to/CEMBA180129_3A_rm_dlt.h5ad', file descriptor = 116, errno = 5, error message = 'Input/output error', buf = 0x50871390, total read size = 2096, bytes this sub-read = 2096, bytes actually read = 18446744073709551615, offset = 0', /root/.cargo/git/checkouts/anndata-rs-087569d311f7caf2/d54b207/anndata/src/container/base.rs:1160:85
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[51], line 5
3 print(f'Writing outputs to f{class_level_fragments_dir}')
4 print('='*50)
----> 5 snapatac2.ex.export_bed(adataset,
6 groupby='Class',
7 suffix='.bed.gz',
8 out_dir = class_level_fragments_dir
9 )
File ~/miniconda3/envs/conda_env/lib/python3.8/site-packages/snapatac2/export/__init__.py:55, in export_bed(adata, groupby, selections, ids, out_dir, prefix, suffix)
52 elif isinstance(ids, str):
53 ids = adata.obs[ids]
---> 55 return internal.export_bed(
56 adata, list(ids), list(groupby), selections, out_dir, prefix, suffix,
57 )
PanicException: called `Result::unwrap()` on an `Err` value: H5Dread(): can't read data: file read failed: time = Sun Dec 17 07:33:29 2023
, filename = ‘/path/to/CEMBA180129_3A_rm_dlt.h5ad', file descriptor = 116, errno = 5, error message = 'Input/output error', buf = 0x50871390, total read size = 2096, bytes this sub-read = 2096, bytes actually read = 18446744073709551615, offset = 0
I have verified that each cell is assigned the appropriate class label. At the time of this error (which occurred after hours of writing to disk), the line counts of the written bed.gz files do not match the number of fragments (grouped by class) in the metadata (Supplementary Table 2 of the paper), as expected.
Counting the number of lines (at the time of error) using zcat <filename> | wc -l
yields:
{'HY_GABA.bed.gz': 226103206,
'CTX-CGE_GABA.bed.gz': 290152494,
'CTX-MGE_GABA.bed.gz': 550481310,
'MH-LH_Glut.bed.gz': 34759628,
'NP-CT-L6b_Glut.bed.gz': 996287794,
'MY_Glut.bed.gz': 79333970,
'Astro-Epen.bed.gz': 1258567744,
'CNU-HYa_Glut.bed.gz': 166911254,
'LSX_GABA.bed.gz': 147192400,
'DG-IMN_Glut.bed.gz': 336481046,
'HY_Gnrh1_Glut.bed.gz': 783042,
'OB-CR_Glut.bed.gz': 6790320,
'CNU-MGE_GABA.bed.gz': 93358728,
'MB_Glut.bed.gz': 815605023,
'MY_GABA.bed.gz': 113820670,
'HY_Glut.bed.gz': 177754869,
'TH_Glut.bed.gz': 455732012,
'CNU-LGE_GABA.bed.gz': 403055016,
'P_GABA.bed.gz': 133969424,
'Immune.bed.gz': 234100458,
'Vascular.bed.gz': 262067892,
'CB_Glut.bed.gz': 476247064,
'OPC-Oligo.bed.gz': 1912232980,
'IT-ET_Glut.bed.gz': 4824967792,
'CNU-HYa_GABA.bed.gz': 285071090,
'CB_GABA.bed.gz': 57209872,
'MB_Dopa.bed.gz': 50937284,
'P_Glut.bed.gz': 233308761,
'MB_GABA.bed.gz': 572258267,
'OB-IMN_GABA.bed.gz': 50274352,
'MB-HB_Sero.bed.gz': 25413152,
'OEC.bed.gz': 509394,
'Pineal_Glut.bed.gz': 1446016}
while grouping the metadata by class (for all 2.3M cells) yields
Class
IT-ET Glut 3911482310
OPC-Oligo 1837739725
Astro-Epen 1147969857
NP-CT-L6b Glut 784204001
MB Glut 750316389
CNU-LGE GABA 493293499
MB GABA 448868487
CTX-MGE GABA 443671252
CB Glut 427319942
TH Glut 356853925
DG-IMN Glut 339200608
P Glut 264327471
CNU-HYa GABA 238441380
Vascular 237619853
CTX-CGE GABA 233809950
Immune 210318453
HY GABA 205488208
LSX GABA 194601248
MY GABA 162226871
HY Glut 161648292
CNU-HYa Glut 143951176
P GABA 126838072
CNU-MGE GABA 115760498
MY Glut 105435880
OB-IMN GABA 72832099
MB Dopa 54655953
CB GABA 50524434
MH-LH Glut 34059796
MB-HB Sero 26239541
OB-CR Glut 7518986
Pineal Glut 2123205
OEC 851273
HY Gnrh1 Glut 785959
Name: # of Fragments, dtype: int64
Strangely, I do notice that for the HY GABA class, the number of written lines (226103206) exceeds the number of cells belonging to that class in the metadata (205488208) by ~20M entries, so I’m not sure what is going on there.
I'm also not sure if this is due to the particularly large files, but the error occurred after much saving has already occurred. Would appreciate any thoughts on this!
Best,
Jay
Hello, may I ask what process is used in this document https://www.nature.com/articles/s41586-023-06824-9#data-availability from fastq files to fragments and other files? Is there a corresponding tutorial? Thank you, because now I need files such as fragments provided by 10× company, but I did not find them in the data you provided. Or can you provide the fragment files of each sample? Thank you very much!
After we change the directory to 00, 01, 02, we need to modify the codes accordingly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.