Light

lkremer / scbs Goto Github PK

View Code? Open in Web Editor NEW

11.0 4.0 5.0 217.7 MB

Python package with CLI for downstream analysis of single cell methylation data.

License: GNU General Public License v3.0

Python 100.00%

dna-methylation dna-methylation-data methylation single-cell single-cell-methylation single-cell-rna-seq

scbs's Introduction

`scbs`: A Command Line Tool for the Analysis of Single-Cell Bisulfite-Sequencing Data

Note

This package was renamed to MethSCAn.
All further development will take place at its new home at https://github.com/anders-biostat/MethSCAn.

Contributors

scbs's People

Contributors

Stargazers

Watchers

Forkers

ast87 anders-biostat ashutoshtomar

scbs's Issues

make "scbs filter" crash when you filter all cells

If you set your filtering thresholds too strict, scbs filter currently produces a weird directory with empty matrices. This is useless and confusing, instead it should just raise an exception when you try to filter 100% of cells.

Thanks @gleb-gavrish for the suggestion!

feature request: merging files for memory issues?

Hi there--I'm an analyst in the Luo lab (associated with one of the single-cell methylomes datasets used in the preprint!). Thanks for developing this software-- I love the user-friendliness and the VMR concept in scbs, and running scbs on CpG methylation is quite smooth.

However, I'm running into some issues as I frequently need to work with non-CpG methylation (CH methylation), which can be associated with ~20-fold more loci. Using 64GB memory and ~2,000 cells, larger chromosomes will fail at the end of the scbs prepare step in what looks like the .coo to .npz conversion.

While I'm currently attempting to re-run with more memory, this is a relatively low cell count dataset for us. I'd be great to somehow merge sets of cells (so maybe I could run on 1,000 of the dataset at a time?) or somehow read/write each chromosome in blocks to use less memory, though I'm not sure either is possible with the details of the sparse format. Are there any ways to "recover" a run to convert the 1.coo file (which is successfully created) without re-running prepare, or any other recommendations?

Cheers!

Populating 57334996 x 2165 matrix for chromosome 19...
Converting from COO to CSR...
Writing to scbs_out_CH/19.npz ...
Populating 260521582 x 2165 matrix for chromosome 1...
Traceback (most recent call last):
  File "/u/home/c/lib/python3.8/site-packages/scbs/prepare.py", line 156, in _load_csr_from_coo
coo = pd.read_csv(coo_path, delimiter=",", header=None).values
File "/u/home/c/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/u/home/c/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/u/home/c/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/u/home/c/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1269, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "/u/home/c/lib/python3.8/site-packages/pandas/core/frame.py", line 636, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 156, in arrays_to_mgr
return create_block_manager_from_column_arrays(
  File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1954, in create_block_manager_from_column_arrays
  blocks = _form_blocks(arrays, consolidate)
  File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2028, in _form_blocks
  values, placement = _stack_arrays(list(tup_block), dtype)
  File "/u/home/c/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 2067, in _stack_arrays
  stacked = np.empty(shape, dtype=dtype)
  numpy.core._exceptions.MemoryError: Unable to allocate 51.7 GiB for an array with shape (3, 2313401180) and data type int64
  
  During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    File "/u/home/c/bin/scbs", line 8, in <module>
    sys.exit(cli())
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
  return self.main(*args, **kwargs)
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1053, in main
  rv = self.invoke(ctx)
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
  return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
  return ctx.invoke(self.callback, **ctx.params)
  File "/u/home/c/lib/python3.8/site-packages/click/core.py", line 754, in invoke
  return __callback(*args, **kwargs)
  File "/u/home/c/lib/python3.8/site-packages/scbs/cli.py", line 157, in prepare_cli
  prepare(**kwargs)
  File "/u/home/c/lib/python3.8/site-packages/scbs/prepare.py", line 44, in prepare
  mat = _load_csr_from_coo(coo_path, chrom_size, n_cells)
  File "/u/home/c/lib/python3.8/site-packages/scbs/prepare.py", line 165, in _load_csr_from_coo
  raise type(exc)(f"{exc} (problematic file: {coo_path})").with_traceback(
    TypeError: __init__() missing 1 required positional argument: 'dtype'

add more VMR info to the scbs scan output

Similar to #15, scbs scan should report more info about the resulting VMRs, such as the number of cells with coverage in this VMR, the number of CpG sites, etc.

allow `scbs filter` to overwrite the input directory

Currently, scbs filter creates a new directory containing the filtered data. Some users might want to filter "in place", which would mean that the unfiltered input directory is overwritten with the new, filtered directory. Any attempts to do this currently result in buggy behavior. So I should either implement this feature, or, if it's too complicated to do, explicitly forbid this behavior by raising an exception if the output and input directory are the same.

wrong repo

scbs matrix: Allow user to choose between long or wide matrix format

Currently, the output of scbs matrix is a long table, but for most purposes you need a regular cell x region matrix.

Add a method to allow for easy cell filtering (e.g. after QC)

Currently, you have do this manually by re-running scbs prepare on high-quality cells only. It would be nice if you could just filter low quality or uninteresting cells from DATA_DIR directly.

add more DMR information to the scbs diff output

It would be useful to add more information to the resulting DMR list. For instance, the mean methylation % in both cell groups, the number of CpG sites, the number of cells that have coverage in this DMR, and the raw p-value (for volcano plots etc.).

custom format specification

For option '--input-format', column 4 is defined as follows:

 4. The column number that contains either unmethylated counts (u) or the total
                       coverage (c) followed by either 'm' or 'c', e.g. '4c' to denote that the 4th column
                       contains the coverage

Why do I have to suffix the number with 'm' and not 'u'?

scbs filter should preserve the run_info log file

scbs prepare writes a data_directory and a log file denoting the scbs version number and the command line parameters that were used.
scbs filter takes a data_directory and creates a new, filtered data_directory. But it doesn't copy the previous unfiltered run_info.txt file. It should copy at least copy the log file and maybe append one line to log the filter parameters.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

lkremer / scbs Goto Github PK

scbs's Introduction

scbs: A Command Line Tool for the Analysis of Single-Cell Bisulfite-Sequencing Data

Contributors

scbs's People

Contributors

Stargazers

Watchers

Forkers

scbs's Issues

Recommend Projects

Recommend Topics

Recommend Org

`scbs`: A Command Line Tool for the Analysis of Single-Cell Bisulfite-Sequencing Data