On most recent commit "c051c44" on v2, the output .h5 files can't be read by scanpy. S

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

10x H5 Format Error,about broadinstitute/cellbender

Comments (23)

letaylor commented on September 16, 2024 3

Ah - I think the error occurs when one is running CellBender using the CellRanger matrix format (a dir with barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz) rather than h5 format as input. I am testing out using the CellRanger h5 file now and will report back.

from cellbender.

letaylor commented on September 16, 2024 2

Yes - this is it. I can confirm loading the CellBender output in scanpy works just fine when I run CellBender using the h5 input format.

from cellbender.

mxposed commented on September 16, 2024 2

I also ran into this problem. Maybe a good solution would be to always add a genome, just empty strings if we don't know what it was?
If you support this decision, I'd be happy to sketch a pull-request

For now I ended up fixing cellbender h5 files with this:

tables.copy_file(orig_h5, fixed_h5)
with tables.open_file(fixed_h5, "r+") as f:
    n = f.get_node("/matrix/features")
    n_genes = f.get_node("/matrix/shape")[0]
    if "genome" not in n:
        f.create_array(n, "genome", np.repeat("GRCh38", n_genes))

from cellbender.

ckrilow commented on September 16, 2024 1

@sjfleming thank you for posting this, it is appreciated. Will keep you posted on the progress.

from cellbender.

Hrovatin commented on September 16, 2024 1

I also used mtx fille generated from h5ad as input, since I do not have CellRanger file.

from cellbender.

mxposed commented on September 16, 2024 1

Thank you, Stephen. I just want to clarify, that the problem for me was not in v2 CellRanger, but in v3 CellRanger that started with mtx folder. In the folder, there's no record of genome.

from cellbender.

mtvector commented on September 16, 2024

This problem is easily alleviable with a custom import function

def readCellbenderH5(filename):
    import h5py
    import scanpy as sc
    import scipy
    f = h5py.File(filename, 'r')
    mat=f['matrix']
    cols=['latent_cell_probability','latent_RT_efficiency',"latent_dirichlet_precision"]
    obsdict={x:mat[x] for x in cols}
    ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:], 
                                          mat['indices'][:], 
                                          mat['indptr'][:]),
                                        shape=(mat['shape'][1],mat['shape'][0])),
              var=pd.DataFrame(dict(mat['features'])),
              obs=pd.DataFrame(obsdict,index=[x.decode('ascii') for x in mat['barcodes']]),
                uns={'test_elbo':list(mat['test_elbo']),'test_epoch':list(mat['test_epoch'])})
    ad.var.index=[x.decode('ascii') for x in ad.var['name']]
    return(ad)

from cellbender.

qrzhang commented on September 16, 2024

Thank you CellBender group for developing such a useful and user-friendly tool. recently I have the same error when loading the CellBender h5 file to Scanpy. I'm using the latest CellBender 2.0 and Scanpy 1.6.0. And @mtvector 's suggestion also doesn't work for me.

adata = readCellbenderH5('Day9_10x_cellbender_filtered.h5')

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-11-248abec2df25> in <module>
----> 1 adata = readCellbenderH5('Day9_10x_cellbender_filtered.h5')

<ipython-input-4-9b15cfa9372a> in readCellbenderH5(filename)
      6     mat=f['matrix']
      7     cols=['latent_cell_probability','latent_RT_efficiency',"latent_dirichlet_precision"]
----> 8     obsdict={x:mat[x] for x in cols}
      9     ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:], 
     10                                           mat['indices'][:],

<ipython-input-4-9b15cfa9372a> in <dictcomp>(.0)
      6     mat=f['matrix']
      7     cols=['latent_cell_probability','latent_RT_efficiency',"latent_dirichlet_precision"]
----> 8     obsdict={x:mat[x] for x in cols}
      9     ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:], 
     10                                           mat['indices'][:],

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/anaconda3/envs/scanpy/lib/python3.8/site-packages/h5py/_hl/group.py in __getitem__(self, name)
    262                 raise ValueError("Invalid HDF5 object reference")
    263         else:
--> 264             oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
    265 
    266         otype = h5i.get_type(oid)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5o.pyx in h5py.h5o.open()

KeyError: "Unable to open object (object 'latent_dirichlet_precision' doesn't exist)"

My operation system is MacOS Catalina 10.15.7

Will appreciate any suggestion.

from cellbender.

Hrovatin commented on September 16, 2024

I had the same issue with cellbender version downloaded a few days ago. I tried to modify the import function. - Could someone let me know if I modified it correctly?

import h5py
import scanpy as sc
import scipy
def readCellbenderH5(filename):
    f = h5py.File(filename, 'r')
    mat=f['background_removed']
    cols=['latent_cell_probability','latent_RT_efficiency']
    obsdict={x:mat[x] for x in cols}
    ad=sc.AnnData(X=scipy.sparse.csr_matrix((mat['data'][:], 
                                          mat['indices'][:], 
                                          mat['indptr'][:]),
                                        shape=(mat['shape'][1],mat['shape'][0])),
                  # I had only one column of gene names
              var=pd.DataFrame(mat['gene_names']),
              obs=pd.DataFrame(obsdict,index=[x.decode('ascii') for x in mat['barcodes']]),
                uns={'test_elbo':list(mat['test_elbo']),'test_epoch':list(mat['test_epoch'])})
    ad.var.index=[x.decode('ascii') for x in ad.var[0]]
    # Remove non decoded gene names from var
    ad.var.drop(0,axis=1,inplace=True)
    return(ad)

from cellbender.

ckrilow commented on September 16, 2024

I am having the same issue right now, reading the new output .h5 file. can't be read by scanpy. Besides the above mentioned 'fix', is there a different way to resolve the key error on 'genome'?

from cellbender.

sjfleming commented on September 16, 2024

For me, running the master branch on a 10x Genomics CellRanger v3 file (pbmc_1k_v3) produces an output file that will load in scanpy 1.5 automatically:

Scanpy 1.6 will also load it:

How about a CellRanger v2 format file? Like 10x's pbmc_8k_v2? Scanpy 1.5 loads it okay:

Scanpy 1.6 will also load it:

Clearly this isn't working for everybody... it seems to me that scanpy loading is a bit fragile.

Here is an immediate fix for scanpy loading, for output files you already have. (This script will be part of the 0.2.1 release, but for now feel free to copy and paste the code.)

"""Functions for downstream work with outputs of remove-background."""

import tables
import numpy as np
import scipy.sparse as sp
from typing import Dict


def dict_from_h5(file: str) -> Dict[str, np.ndarray]:
    """Read in everything from an h5 file and put into a dictionary."""
    d = {}
    with tables.open_file(file) as f:
        # read in everything
        for array in f.walk_nodes("/", "Array"):
            d[array.name] = array.read()
    return d


def anndata_from_h5(file: str,
                    analyzed_barcodes_only: bool = True) -> 'anndata.AnnData':
    """Load an output h5 file into an AnnData object for downstream work.

    Args:
        file: The h5 file
        analyzed_barcodes_only: False to load all barcodes, so that the size of
            the AnnData object will match the size of the input raw count matrix.
            True to load a limited set of barcodes: only those analyzed by the
            algorithm. This allows relevant latent variables to be loaded
            properly into adata.obs and adata.obsm, rather than adata.uns.

    Returns:
        adata: The anndata object, populated with inferred latent variables
            and metadata.

    """

    try:
        import anndata
    except ImportError:
        raise ImportError('The anndata package must be installed to use the '
                          'function anndata_from_h5()')

    d = dict_from_h5(file)
    X = sp.csc_matrix((d.pop('data'), d.pop('indices'), d.pop('indptr')),
                      shape=d.pop('shape')).transpose().tocsr()

    if analyzed_barcodes_only:
        if 'barcodes_analyzed_inds' in d.keys():
            X = X[d['barcodes_analyzed_inds'], :]
            d['barcodes'] = d['barcodes'][d['barcodes_analyzed_inds']]
        elif 'barcode_indices_for_latents' in d.keys():
            X = X[d['barcode_indices_for_latents'], :]
            d['barcodes'] = d['barcodes'][d['barcode_indices_for_latents']]
        else:
            print('Warning: analyzed_barcodes_only=True, but the key '
                  '"barcodes_analyzed_inds" or "barcode_indices_for_latents" '
                  'is missing from the h5 file. '
                  'Will output all barcodes, and proceed as if '
                  'analyzed_barcodes_only=False')

    # Construct the count matrix.
    adata = anndata.AnnData(X=X,
                            obs={'barcode': d.pop('barcodes').astype(str)},
                            var={'gene_name': (d.pop('gene_names') if 'gene_names' in d.keys()
                                               else d.pop('name')).astype(str)})
    adata.obs.set_index('barcode', inplace=True)
    adata.var.set_index('gene_name', inplace=True)

    # Add other information to the adata object in the appropriate slot.
    for key, value in d.items():
        try:
            value = np.asarray(value)
            if len(value.shape) == 0:
                adata.uns[key] = value
            elif value.shape[0] == X.shape[0]:
                if (len(value.shape) < 2) or (value.shape[1] < 2):
                    adata.obs[key] = value
                else:
                    adata.obsm[key] = value
            elif value.shape[0] == X.shape[1]:
                if value.dtype.name.startswith('bytes'):
                    adata.var[key] = value.astype(str)
                else:
                    adata.var[key] = value
            else:
                adata.uns[key] = value
        except Exception:
            print('Unable to load data into AnnData: ', key, value, type(value))

    if analyzed_barcodes_only:
        for col in adata.obs.columns[adata.obs.columns.str.startswith('barcodes_analyzed')
                                     | adata.obs.columns.str.startswith('barcode_indices')]:
            try:
                del adata.obs[col]
            except Exception:
                pass

    return adata


def load_anndata_from_input_and_output(input_file: str,
                                       output_file: str,
                                       analyzed_barcodes_only: bool = True,
                                       input_layer_key: str = 'cellranger') -> 'anndata.AnnData':
    """Load remove-background output count matrix into an anndata object,
    together with remove-background metadata and the raw input counts.

    Args:
        input_file: Raw h5 file used as input for remove-background.
        output_file: Output h5 file created by remove-background (can be
            filtered or not).
        analyzed_barcodes_only: Argument passed to anndata_from_h5().
            False to load all barcodes, so that the size of
            the AnnData object will match the size of the input raw count matrix.
            True to load a limited set of barcodes: only those analyzed by the
            algorithm. This allows relevant latent variables to be loaded
            properly into adata.obs and adata.obsm, rather than adata.uns.
        input_layer_key: Key of the anndata.layer that is created for the raw
            input count matrix.

    Return:
        adata_out: AnnData object with counts before and after remove-background,
            as well as inferred latent variables from remove-background.

    """

    # Load input data.
    adata_raw = anndata_from_h5(input_file, analyzed_barcodes_only=False)

    # Load remove-background output data.
    adata_out = anndata_from_h5(output_file, analyzed_barcodes_only=analyzed_barcodes_only)

    # Subset the raw dataset to the relevant barcodes.
    adata_raw = adata_raw[adata_out.obs.index]

    # Put count matrices into 'layers' in anndata for clarity.
    adata_out.layers[input_layer_key] = adata_raw.X.copy()
    adata_out.layers['cellbender'] = adata_out.X.copy()

    # Pre-compute a bit of metadata.
    adata_out.var['n_cellranger'] = np.array(adata_out.layers['cellranger'].sum(axis=0)).squeeze()
    adata_out.var['n_cellbender'] = np.array(adata_out.layers['cellbender'].sum(axis=0)).squeeze()

    return adata_out

Example usage:

Further example of loading CellBender output and the original raw data:

from cellbender.

sjfleming commented on September 16, 2024

@ckrilow @Hrovatin @qrzhang
Please let me know if the above code works for you.

from cellbender.

ckrilow commented on September 16, 2024

@letaylor When you run into this issue during the Cellbender V2 update, the above fix will come in handy.

from cellbender.

letaylor commented on September 16, 2024

I am running into this bug as well @sjfleming will try out your solution momentarily.

However looking at the scanpy code:
https://github.com/theislab/scanpy/blob/master/scanpy/readwrite.py

The error is driven by line 279: genome=dsets['genome'].astype(str),. It looks like scanpy expects there to be a genome slot that is empty. If I look at the dsets after manually running _read_v3_10x_h5, genome is missing:

This did not occur in CellBender v1 because scanpy was reading the data using _read_legacy_10x_h5 rather than _read_v3_10x_h5, the later is default if matrix is the first level in the h5 file.

from cellbender.

sjfleming commented on September 16, 2024

Thanks @letaylor
Yes, it seems that scanpy does expect a "genome" entry in /matrix/features/genome in the h5 file, if it's produced by CellRanger v3+ (which it determines by checking if there's a "/matrix" entry).

What I am puzzled by is that if I run the 0.2.0 release of CellBender, I do get an entry in /matrix/features/genome, as long as the input file had one. For example, the public 10x Genomics pbmc_1k dataset (CellRanger v3) gives this output from CellBender 0.2.0

which includes the genome entry and loads fine using sc.read_10x_h5().

Is it possible your input file does not have genome?

from cellbender.

sjfleming commented on September 16, 2024

@letaylor You are totally right, I did not handle genome properly for the v3 MTX format! I will look into this.

CellBender/cellbender/remove_background/data/dataset.py

Line 891 in 41cf6ec

'genomes': None, # TODO: check if this info is available in either version

from cellbender.

qrzhang commented on September 16, 2024

Thanks, @letaylor, I think you are correct. The CellRanger V3 raw_matrix folder is my input to CellBender which leads to this error. And the error message means the genome is missing in the output h5 file. I'll try to use the CellRnage h5 file as input and try again.

from cellbender.

sjfleming commented on September 16, 2024

Hi @mxposed , that's an interesting thought. We might be able to always add a "genome" field, even if the input format is CellRanger v2 format which does not have the genome. The potential problem is that if some other file reader for CellRanger v2 fails because we have this extra field, then that would have been caused by us. But I might still add it... I don't think it would cause a problem with loading in scanpy or Seurat, which are our major use cases, so it might be worth it.

I am working on a larger update, so I will do some testing and possibly incorporate the addition of "genome" to v2 format files. But it might be counterproductive to make a PR for the code as it currently stands, since there will be some sizable changes coming soon.

Thanks!

from cellbender.

sjfleming commented on September 16, 2024

Perfect, okay, thanks for that clarification!

from cellbender.

sjfleming commented on September 16, 2024

This issue will be handled in #97

from cellbender.

ktpolanski commented on September 16, 2024

I also ran into this problem. Maybe a good solution would be to always add a genome, just empty strings if we don't know what it was? If you support this decision, I'd be happy to sketch a pull-request

For now I ended up fixing cellbender h5 files with this:
tables.copy_file(orig_h5, fixed_h5)
with tables.open_file(fixed_h5, "r+") as f:
    n = f.get_node("/matrix/features")
    n_genes = f.get_node("/matrix/shape")[0]
    if "genome" not in n:
        f.create_array(n, "genome", np.repeat("GRCh38", n_genes))

I've bumped into this issue a few times by now, and the quoted solution is the most practical one until the bug gets fixed. I just added a call to the above, modified to skip copying, after running cellbender. Now the files respond to an sc.read_10x_h5() and future people don't need to worry about it anymore.

from cellbender.

vkartha commented on September 16, 2024

I appear to still be running into this issue. Specifically ran CellBender using CellRanger output folder (raw_feature_bc_matrix/ folder contents which include matrix.mtx.gz, barcodes.tsv.gz and features.tsv.gz files). Can you point me to the appropriate solution to load in the ambient-corrected h5 file if CellBender was (successfully) run with this starting folder input, as mentioned by others above?

Error highlighted below:

genome=dsets['genome'].astype(str),
KeyError: 'genome'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vinay/anaconda3/envs/vinay_bbknn/lib/python3.10/site-packages/scanpy/readwrite.py", line 183, in read_10x_h5
    adata = _read_v3_10x_h5(filename, start=start)
  File "/home/vinay/anaconda3/envs/vinay_bbknn/lib/python3.10/site-packages/scanpy/readwrite.py", line 294, in _read_v3_10x_h5
    raise Exception('File is missing one or more required datasets.')
Exception: File is missing one or more required datasets.

from cellbender.

sjfleming commented on September 16, 2024

Hi @vkartha , sorry for the delayed response. I think the solution mentioned above by @mxposed (#57 (comment)) would work for you.

Another potential solution would be to use the function anndata_from_h5() from cellbender/remove_background/downstream.py here

CellBender/cellbender/remove_background/downstream.py

Line 29 in 02dd763

def anndata_from_h5(file: str,

This will load the cellbender output as an AnnData object, bypassing scanpy's loading code (which is causing a problem by demanding a genome field when the raw data does not have a genome field).

That option is shown in more detail here (#57 (comment))

from cellbender.

10x H5 Format Error about cellbender HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent