Hi there--I'm an analyst in the Luo lab (associated with one of the single-cell methyl

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

feature request: merging files for memory issues? about scbs HOT 9 CLOSED

lkremer commented on August 25, 2024 1

feature request: merging files for memory issues?

from scbs.

Comments (9)

LKremer commented on August 25, 2024 1

Hi @chooliu ,
thanks for the nice feedback! Sounds like you got quite an impressive data set.

In the long run, we want to re-write the whole prepare script to make it faster and more memory efficient. Currently, we first read the methylation files and write them to a sparse matrix in COO format, and then convert the COO matrix to CSR format (stored in the form of .npz files). I'm sure there must be a way to either skip COO entirely and write straight to CSR, or to convert COO to CSR without reading the whole COO matrix into memory. I'll think about it and see if I can find a better way.

Your suggestion to process the data in chunks would also work, of course. I'm not sure yet what the best solution to this problem is.

We didn't implement a function to resume the prepare script starting from COO files, so the easiest way would be to re-run scbs prepare. Manually recovering the COO file is possible in theory, but I think it's a little tricky. I can't really recommend it. If you still want to give it a shot, you can use Python to read the COO file, convert it to CSR format, and store it as .npz file. Have a look at _load_csr_from_coo() to see how to load a COO file. You can then save it with scipy.sparse.save_npz(). But this seems pretty tedious and error-prone, and you still need to get all of the .npz files of the other chromosomes somehow. Another problem is that scbs prepare produces a bunch of meta data (a file listing the cell names, quality metrics, etc.) and you wouldn't get these files if you do everything manually. Or, if you re-run without chromosome 1, the quality metrics would miss some values, of course. So yeah, I think re-running is the best way, even if it takes a while.

Thanks for using scbs and reporting this issue.
I will let you know once I found a way to decrease the memory requirements.

from scbs.

LKremer commented on August 25, 2024 1

I did some testing on a large data set with 2568 cells. I quantified GpC sites which are ~10x more frequent than CpG sites. So this data set is almost as big as the one you tried.

I measured the peak memory usage of different scbs versions and these are the results:

scbs 0.3.3:   43.72 gigabytes
scbs 0.4.0:   14.75 gigabytes

Surprisingly, lowering --chunksize didn't decrease memory usage further, so I think these ~15 GB are used by another part of the code that I didn't change. In any case, 15 GB seems manageable. For your larger data sets it may bigger than that. But you can definitely fit many more cells into your 64 gigs of RAM now!

from scbs.

LKremer commented on August 25, 2024 1

No problem, and thanks for making me aware of this issue! I agree, CH methylation is interesting and we also had a look at it with scbs. We didn't notice the memory issues though, cause we had fewer cells and I we were using a 126gb RAM machine. So thanks for making me aware of this issue, and also thanks for sharing scbs with your peers :)

from scbs.

LKremer commented on August 25, 2024 1

You're right, we didn't discuss it in the preprint. But of course you can also input other data types such as CH methylation data. Good point actually, maybe we should discuss it in the next version of the paper.

from scbs.

chooliu commented on August 25, 2024 1

I'm obviously biased here, but I think that inclusion would be interesting!

In particular, our group typically use both CG & CH together for clustering as CH is very useful for most brain datasets (https://lhqing.github.io/ALLCools/intro.html from collaborators in San Diego; joins separate CH-PCs and CG-PCs as input features) and one direction I've been exploring in my methods development work is whether CH-DMR calling requires distinct considerations.

Cheers,
Choo

from scbs.

LKremer commented on August 25, 2024

Hi @chooliu,

I rewrote the memory-inefficient part of scbs prepare that caused your crash. The COO file is now read in chunks instead of reading the whole chromosome. Before I release this version, could you please try this version and tell me if it fixed your problem? I tested it on our own data and it seems to work.

scbs-0.4.0.tar.gz

After downloading the .tar.gz file, you can install it like this:

python3 -m pip install --upgrade scbs-0.4.0.tar.gz

Then check if you have the correct version (0.4.0) by just typing scbs.

After updating to 0.4.0 you can just use scbs prepare like you did before, but it should use less memory.

If you're still running out of memory, you can also lower the size of the chunks now. By default each chromosome is now read in chunks of 10 megabases each, so that e.g. mouse chr1 consists of 20 chunks. If you want to lower the memory requirements even further, you can set e.g. --chunksize 1000000 and then it would be 200 chunks of 1 Mb each. Might be a little slower but it will save more RAM.

Please let me know if it solved your issue :)

from scbs.

LKremer commented on August 25, 2024

closing for now, since this was addressed in release 0.4.0

from scbs.

chooliu commented on August 25, 2024

Sorry Lucas, I could have sworn that I responded to your message from way back when and got re-notified when the issue was closed.

Thanks so much for looking into memory requirements! I think non-CpG methylation is somewhat niche, but also vitally important to a lot of folks working in brain, development, etc (where a lot of single-cell methylation work is being done).

I shared the scbs preprint with my group last year & more folks besides me are now playing with it--will let you know how it goes and thanks again :)

from scbs.

chooliu commented on August 25, 2024

Oops, my apologies on that: my recollection was the preprint exclusively discussed CpGs. Very excited to see how our field moves forward as larger cell count datasets emerge. :)

from scbs.

feature request: merging files for memory issues? about scbs HOT 9 CLOSED

Comments (9)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent