Comments (23)
Ah - I think the error occurs when one is running CellBender using the CellRanger matrix format (a dir with barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz) rather than h5 format as input. I am testing out using the CellRanger h5 file now and will report back.
from cellbender.
Yes - this is it. I can confirm loading the CellBender output in scanpy works just fine when I run CellBender using the h5 input format.
from cellbender.
I also ran into this problem. Maybe a good solution would be to always add a genome
, just empty strings if we don't know what it was?
If you support this decision, I'd be happy to sketch a pull-request
For now I ended up fixing cellbender h5 files with this:
tables.copy_file(orig_h5, fixed_h5)
with tables.open_file(fixed_h5, "r+") as f:
n = f.get_node("/matrix/features")
n_genes = f.get_node("/matrix/shape")[0]
if "genome" not in n:
f.create_array(n, "genome", np.repeat("GRCh38", n_genes))
from cellbender.
@sjfleming thank you for posting this, it is appreciated. Will keep you posted on the progress.
from cellbender.
I also used mtx fille generated from h5ad as input, since I do not have CellRanger file.
from cellbender.
Thank you, Stephen. I just want to clarify, that the problem for me was not in v2 CellRanger, but in v3 CellRanger that started with mtx
folder. In the folder, there's no record of genome
.
from cellbender.
This problem is easily alleviable with a custom import function
def readCellbenderH5(filename):
import h5py
import scanpy as sc
import scipy
f = h5py.File(filename, 'r')
mat=f['matrix']
cols=['latent_cell_probability','latent_RT_efficiency',"latent_dirichlet_precision"]
obsdict={x:mat[x] for x in cols}
ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:],
mat['indices'][:],
mat['indptr'][:]),
shape=(mat['shape'][1],mat['shape'][0])),
var=pd.DataFrame(dict(mat['features'])),
obs=pd.DataFrame(obsdict,index=[x.decode('ascii') for x in mat['barcodes']]),
uns={'test_elbo':list(mat['test_elbo']),'test_epoch':list(mat['test_epoch'])})
ad.var.index=[x.decode('ascii') for x in ad.var['name']]
return(ad)
from cellbender.
Thank you CellBender group for developing such a useful and user-friendly tool. recently I have the same error when loading the CellBender h5 file to Scanpy. I'm using the latest CellBender 2.0 and Scanpy 1.6.0. And @mtvector 's suggestion also doesn't work for me.
adata = readCellbenderH5('Day9_10x_cellbender_filtered.h5')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-11-248abec2df25> in <module>
----> 1 adata = readCellbenderH5('Day9_10x_cellbender_filtered.h5')
<ipython-input-4-9b15cfa9372a> in readCellbenderH5(filename)
6 mat=f['matrix']
7 cols=['latent_cell_probability','latent_RT_efficiency',"latent_dirichlet_precision"]
----> 8 obsdict={x:mat[x] for x in cols}
9 ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:],
10 mat['indices'][:],
<ipython-input-4-9b15cfa9372a> in <dictcomp>(.0)
6 mat=f['matrix']
7 cols=['latent_cell_probability','latent_RT_efficiency',"latent_dirichlet_precision"]
----> 8 obsdict={x:mat[x] for x in cols}
9 ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:],
10 mat['indices'][:],
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
~/anaconda3/envs/scanpy/lib/python3.8/site-packages/h5py/_hl/group.py in __getitem__(self, name)
262 raise ValueError("Invalid HDF5 object reference")
263 else:
--> 264 oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
265
266 otype = h5i.get_type(oid)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5o.pyx in h5py.h5o.open()
KeyError: "Unable to open object (object 'latent_dirichlet_precision' doesn't exist)"
My operation system is MacOS Catalina 10.15.7
Will appreciate any suggestion.
from cellbender.
I had the same issue with cellbender version downloaded a few days ago. I tried to modify the import function. - Could someone let me know if I modified it correctly?
import h5py
import scanpy as sc
import scipy
def readCellbenderH5(filename):
f = h5py.File(filename, 'r')
mat=f['background_removed']
cols=['latent_cell_probability','latent_RT_efficiency']
obsdict={x:mat[x] for x in cols}
ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:],
mat['indices'][:],
mat['indptr'][:]),
shape=(mat['shape'][1],mat['shape'][0])),
# I had only one column of gene names
var=pd.DataFrame(mat['gene_names']),
obs=pd.DataFrame(obsdict,index=[x.decode('ascii') for x in mat['barcodes']]),
uns={'test_elbo':list(mat['test_elbo']),'test_epoch':list(mat['test_epoch'])})
ad.var.index=[x.decode('ascii') for x in ad.var[0]]
# Remove non decoded gene names from var
ad.var.drop(0,axis=1,inplace=True)
return(ad)
from cellbender.
I am having the same issue right now, reading the new output .h5 file. can't be read by scanpy. Besides the above mentioned 'fix', is there a different way to resolve the key error on 'genome'?
from cellbender.
For me, running the master branch on a 10x Genomics CellRanger v3 file (pbmc_1k_v3) produces an output file that will load in scanpy 1.5 automatically:
Scanpy 1.6 will also load it:
How about a CellRanger v2 format file? Like 10x's pbmc_8k_v2? Scanpy 1.5 loads it okay:
Scanpy 1.6 will also load it:
Clearly this isn't working for everybody... it seems to me that scanpy
loading is a bit fragile.
Here is an immediate fix for scanpy
loading, for output files you already have. (This script will be part of the 0.2.1 release, but for now feel free to copy and paste the code.)
"""Functions for downstream work with outputs of remove-background."""
import tables
import numpy as np
import scipy.sparse as sp
from typing import Dict
def dict_from_h5(file: str) -> Dict[str, np.ndarray]:
"""Read in everything from an h5 file and put into a dictionary."""
d = {}
with tables.open_file(file) as f:
# read in everything
for array in f.walk_nodes("/", "Array"):
d[array.name] = array.read()
return d
def anndata_from_h5(file: str,
analyzed_barcodes_only: bool = True) -> 'anndata.AnnData':
"""Load an output h5 file into an AnnData object for downstream work.
Args:
file: The h5 file
analyzed_barcodes_only: False to load all barcodes, so that the size of
the AnnData object will match the size of the input raw count matrix.
True to load a limited set of barcodes: only those analyzed by the
algorithm. This allows relevant latent variables to be loaded
properly into adata.obs and adata.obsm, rather than adata.uns.
Returns:
adata: The anndata object, populated with inferred latent variables
and metadata.
"""
try:
import anndata
except ImportError:
raise ImportError('The anndata package must be installed to use the '
'function anndata_from_h5()')
d = dict_from_h5(file)
X = sp.csc_matrix((d.pop('data'), d.pop('indices'), d.pop('indptr')),
shape=d.pop('shape')).transpose().tocsr()
if analyzed_barcodes_only:
if 'barcodes_analyzed_inds' in d.keys():
X = X[d['barcodes_analyzed_inds'], :]
d['barcodes'] = d['barcodes'][d['barcodes_analyzed_inds']]
elif 'barcode_indices_for_latents' in d.keys():
X = X[d['barcode_indices_for_latents'], :]
d['barcodes'] = d['barcodes'][d['barcode_indices_for_latents']]
else:
print('Warning: analyzed_barcodes_only=True, but the key '
'"barcodes_analyzed_inds" or "barcode_indices_for_latents" '
'is missing from the h5 file. '
'Will output all barcodes, and proceed as if '
'analyzed_barcodes_only=False')
# Construct the count matrix.
adata = anndata.AnnData(X=X,
obs={'barcode': d.pop('barcodes').astype(str)},
var={'gene_name': (d.pop('gene_names') if 'gene_names' in d.keys()
else d.pop('name')).astype(str)})
adata.obs.set_index('barcode', inplace=True)
adata.var.set_index('gene_name', inplace=True)
# Add other information to the adata object in the appropriate slot.
for key, value in d.items():
try:
value = np.asarray(value)
if len(value.shape) == 0:
adata.uns[key] = value
elif value.shape[0] == X.shape[0]:
if (len(value.shape) < 2) or (value.shape[1] < 2):
adata.obs[key] = value
else:
adata.obsm[key] = value
elif value.shape[0] == X.shape[1]:
if value.dtype.name.startswith('bytes'):
adata.var[key] = value.astype(str)
else:
adata.var[key] = value
else:
adata.uns[key] = value
except Exception:
print('Unable to load data into AnnData: ', key, value, type(value))
if analyzed_barcodes_only:
for col in adata.obs.columns[adata.obs.columns.str.startswith('barcodes_analyzed')
| adata.obs.columns.str.startswith('barcode_indices')]:
try:
del adata.obs[col]
except Exception:
pass
return adata
def load_anndata_from_input_and_output(input_file: str,
output_file: str,
analyzed_barcodes_only: bool = True,
input_layer_key: str = 'cellranger') -> 'anndata.AnnData':
"""Load remove-background output count matrix into an anndata object,
together with remove-background metadata and the raw input counts.
Args:
input_file: Raw h5 file used as input for remove-background.
output_file: Output h5 file created by remove-background (can be
filtered or not).
analyzed_barcodes_only: Argument passed to anndata_from_h5().
False to load all barcodes, so that the size of
the AnnData object will match the size of the input raw count matrix.
True to load a limited set of barcodes: only those analyzed by the
algorithm. This allows relevant latent variables to be loaded
properly into adata.obs and adata.obsm, rather than adata.uns.
input_layer_key: Key of the anndata.layer that is created for the raw
input count matrix.
Return:
adata_out: AnnData object with counts before and after remove-background,
as well as inferred latent variables from remove-background.
"""
# Load input data.
adata_raw = anndata_from_h5(input_file, analyzed_barcodes_only=False)
# Load remove-background output data.
adata_out = anndata_from_h5(output_file, analyzed_barcodes_only=analyzed_barcodes_only)
# Subset the raw dataset to the relevant barcodes.
adata_raw = adata_raw[adata_out.obs.index]
# Put count matrices into 'layers' in anndata for clarity.
adata_out.layers[input_layer_key] = adata_raw.X.copy()
adata_out.layers['cellbender'] = adata_out.X.copy()
# Pre-compute a bit of metadata.
adata_out.var['n_cellranger'] = np.array(adata_out.layers['cellranger'].sum(axis=0)).squeeze()
adata_out.var['n_cellbender'] = np.array(adata_out.layers['cellbender'].sum(axis=0)).squeeze()
return adata_out
Example usage:
Further example of loading CellBender output and the original raw data:
from cellbender.
@ckrilow @Hrovatin @qrzhang
Please let me know if the above code works for you.
from cellbender.
@letaylor When you run into this issue during the Cellbender V2 update, the above fix will come in handy.
from cellbender.
I am running into this bug as well @sjfleming will try out your solution momentarily.
However looking at the scanpy code:
https://github.com/theislab/scanpy/blob/master/scanpy/readwrite.py
The error is driven by line 279: genome=dsets['genome'].astype(str),
. It looks like scanpy expects there to be a genome slot that is empty. If I look at the dsets
after manually running _read_v3_10x_h5
, genome is missing:
This did not occur in CellBender v1 because scanpy was reading the data using _read_legacy_10x_h5
rather than _read_v3_10x_h5
, the later is default if matrix
is the first level in the h5 file.
from cellbender.
Thanks @letaylor
Yes, it seems that scanpy
does expect a "genome"
entry in /matrix/features/genome
in the h5 file, if it's produced by CellRanger v3+ (which it determines by checking if there's a "/matrix"
entry).
What I am puzzled by is that if I run the 0.2.0 release of CellBender, I do get an entry in /matrix/features/genome
, as long as the input file had one. For example, the public 10x Genomics pbmc_1k dataset (CellRanger v3) gives this output from CellBender 0.2.0
which includes the genome
entry and loads fine using sc.read_10x_h5()
.
Is it possible your input file does not have genome
?
from cellbender.
@letaylor You are totally right, I did not handle genome
properly for the v3 MTX format! I will look into this.
from cellbender.
Thanks, @letaylor, I think you are correct. The CellRanger V3 raw_matrix folder is my input to CellBender which leads to this error. And the error message means the genome
is missing in the output h5 file. I'll try to use the CellRnage h5 file as input and try again.
from cellbender.
Hi @mxposed , that's an interesting thought. We might be able to always add a "genome" field, even if the input format is CellRanger v2 format which does not have the genome. The potential problem is that if some other file reader for CellRanger v2 fails because we have this extra field, then that would have been caused by us. But I might still add it... I don't think it would cause a problem with loading in scanpy
or Seurat
, which are our major use cases, so it might be worth it.
I am working on a larger update, so I will do some testing and possibly incorporate the addition of "genome" to v2 format files. But it might be counterproductive to make a PR for the code as it currently stands, since there will be some sizable changes coming soon.
Thanks!
from cellbender.
Perfect, okay, thanks for that clarification!
from cellbender.
This issue will be handled in #97
from cellbender.
I also ran into this problem. Maybe a good solution would be to always add a
genome
, just empty strings if we don't know what it was? If you support this decision, I'd be happy to sketch a pull-requestFor now I ended up fixing cellbender h5 files with this:
tables.copy_file(orig_h5, fixed_h5) with tables.open_file(fixed_h5, "r+") as f: n = f.get_node("/matrix/features") n_genes = f.get_node("/matrix/shape")[0] if "genome" not in n: f.create_array(n, "genome", np.repeat("GRCh38", n_genes))
I've bumped into this issue a few times by now, and the quoted solution is the most practical one until the bug gets fixed. I just added a call to the above, modified to skip copying, after running cellbender. Now the files respond to an sc.read_10x_h5()
and future people don't need to worry about it anymore.
from cellbender.
I appear to still be running into this issue. Specifically ran CellBender using CellRanger output folder (raw_feature_bc_matrix/ folder contents which include matrix.mtx.gz, barcodes.tsv.gz and features.tsv.gz files). Can you point me to the appropriate solution to load in the ambient-corrected h5 file if CellBender was (successfully) run with this starting folder input, as mentioned by others above?
Error highlighted below:
genome=dsets['genome'].astype(str),
KeyError: 'genome'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vinay/anaconda3/envs/vinay_bbknn/lib/python3.10/site-packages/scanpy/readwrite.py", line 183, in read_10x_h5
adata = _read_v3_10x_h5(filename, start=start)
File "/home/vinay/anaconda3/envs/vinay_bbknn/lib/python3.10/site-packages/scanpy/readwrite.py", line 294, in _read_v3_10x_h5
raise Exception('File is missing one or more required datasets.')
Exception: File is missing one or more required datasets.
from cellbender.
Hi @vkartha , sorry for the delayed response. I think the solution mentioned above by @mxposed (#57 (comment)) would work for you.
Another potential solution would be to use the function anndata_from_h5()
from cellbender/remove_background/downstream.py
here
This will load the cellbender output as an AnnData
object, bypassing scanpy's loading code (which is causing a problem by demanding a genome
field when the raw data does not have a genome
field).
That option is shown in more detail here (#57 (comment))
from cellbender.
Related Issues (20)
- Index Error in priors.py HOT 2
- ERROR: Could not build wheels for pyzmq, which is required to install pyproject.toml-based projects HOT 1
- Fix readthedocs HOT 1
- HTML output fails due to lxml change HOT 5
- Feature: tool for users to rescue v0.3.1 runs and to re-compute output counts in general HOT 1
- Problem installing cellbender HOT 2
- Is it possible to adjust the priors so that cellbender can work with shallow-sequenced data?
- How to run CellBender for a pooling library?
- Numpy build failed when installing cell bender from source
- cellbender v3.0 doesn't generate most of the output files, but doesn't have any errors
- "Trying to use CUDA, " \ AssertionError: Trying to use CUDA, but CUDA is not available.
- Number of cells after cellbender much more than number of cells from cellranger (filtered)
- New fileformat output from BD rhapsody HOT 1
- Should I keep decreasing the learning rate?
- Question about input h5 file HOT 1
- Computing the output in asynchronous chunks in parallel takes longer than 144 hours HOT 1
- Unhandled division by zero
- Never mind
- Can't Computing target noise counts per gene for MCKP estimator HOT 1
- Importance of model loss
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cellbender.