Giter Club home page Giter Club logo

Comments (9)

LKremer avatar LKremer commented on August 25, 2024 1

Hi @chooliu ,
thanks for the nice feedback! Sounds like you got quite an impressive data set.

In the long run, we want to re-write the whole prepare script to make it faster and more memory efficient. Currently, we first read the methylation files and write them to a sparse matrix in COO format, and then convert the COO matrix to CSR format (stored in the form of .npz files). I'm sure there must be a way to either skip COO entirely and write straight to CSR, or to convert COO to CSR without reading the whole COO matrix into memory. I'll think about it and see if I can find a better way.

Your suggestion to process the data in chunks would also work, of course. I'm not sure yet what the best solution to this problem is.

We didn't implement a function to resume the prepare script starting from COO files, so the easiest way would be to re-run scbs prepare. Manually recovering the COO file is possible in theory, but I think it's a little tricky. I can't really recommend it. If you still want to give it a shot, you can use Python to read the COO file, convert it to CSR format, and store it as .npz file. Have a look at _load_csr_from_coo() to see how to load a COO file. You can then save it with scipy.sparse.save_npz(). But this seems pretty tedious and error-prone, and you still need to get all of the .npz files of the other chromosomes somehow. Another problem is that scbs prepare produces a bunch of meta data (a file listing the cell names, quality metrics, etc.) and you wouldn't get these files if you do everything manually. Or, if you re-run without chromosome 1, the quality metrics would miss some values, of course. So yeah, I think re-running is the best way, even if it takes a while.

Thanks for using scbs and reporting this issue.
I will let you know once I found a way to decrease the memory requirements.

from scbs.

LKremer avatar LKremer commented on August 25, 2024 1

I did some testing on a large data set with 2568 cells. I quantified GpC sites which are ~10x more frequent than CpG sites. So this data set is almost as big as the one you tried.

I measured the peak memory usage of different scbs versions and these are the results:

scbs 0.3.3:   43.72 gigabytes
scbs 0.4.0:   14.75 gigabytes

Surprisingly, lowering --chunksize didn't decrease memory usage further, so I think these ~15 GB are used by another part of the code that I didn't change. In any case, 15 GB seems manageable. For your larger data sets it may bigger than that. But you can definitely fit many more cells into your 64 gigs of RAM now!

from scbs.

LKremer avatar LKremer commented on August 25, 2024 1

No problem, and thanks for making me aware of this issue! I agree, CH methylation is interesting and we also had a look at it with scbs. We didn't notice the memory issues though, cause we had fewer cells and I we were using a 126gb RAM machine. So thanks for making me aware of this issue, and also thanks for sharing scbs with your peers :)

from scbs.

LKremer avatar LKremer commented on August 25, 2024 1

You're right, we didn't discuss it in the preprint. But of course you can also input other data types such as CH methylation data. Good point actually, maybe we should discuss it in the next version of the paper.

from scbs.

chooliu avatar chooliu commented on August 25, 2024 1

I'm obviously biased here, but I think that inclusion would be interesting!

In particular, our group typically use both CG & CH together for clustering as CH is very useful for most brain datasets (https://lhqing.github.io/ALLCools/intro.html from collaborators in San Diego; joins separate CH-PCs and CG-PCs as input features) and one direction I've been exploring in my methods development work is whether CH-DMR calling requires distinct considerations.

Cheers,
Choo

from scbs.

LKremer avatar LKremer commented on August 25, 2024

Hi @chooliu,

I rewrote the memory-inefficient part of scbs prepare that caused your crash. The COO file is now read in chunks instead of reading the whole chromosome. Before I release this version, could you please try this version and tell me if it fixed your problem? I tested it on our own data and it seems to work.

scbs-0.4.0.tar.gz

After downloading the .tar.gz file, you can install it like this:

python3 -m pip install --upgrade scbs-0.4.0.tar.gz

Then check if you have the correct version (0.4.0) by just typing scbs.

After updating to 0.4.0 you can just use scbs prepare like you did before, but it should use less memory.

If you're still running out of memory, you can also lower the size of the chunks now. By default each chromosome is now read in chunks of 10 megabases each, so that e.g. mouse chr1 consists of 20 chunks. If you want to lower the memory requirements even further, you can set e.g. --chunksize 1000000 and then it would be 200 chunks of 1 Mb each. Might be a little slower but it will save more RAM.

Please let me know if it solved your issue :)

from scbs.

LKremer avatar LKremer commented on August 25, 2024

closing for now, since this was addressed in release 0.4.0

from scbs.

chooliu avatar chooliu commented on August 25, 2024

Sorry Lucas, I could have sworn that I responded to your message from way back when and got re-notified when the issue was closed.

Thanks so much for looking into memory requirements! I think non-CpG methylation is somewhat niche, but also vitally important to a lot of folks working in brain, development, etc (where a lot of single-cell methylation work is being done).

I shared the scbs preprint with my group last year & more folks besides me are now playing with it--will let you know how it goes and thanks again :)

from scbs.

chooliu avatar chooliu commented on August 25, 2024

Oops, my apologies on that: my recollection was the preprint exclusively discussed CpGs. Very excited to see how our field moves forward as larger cell count datasets emerge. :)

from scbs.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.