Hi, While CellBender works as expected on 10X hgmm12k (v2), on 10X h

Strange results on 10X hgmm10k_v3 dataset about cellbender HOT 9 CLOSED

broadinstitute commented on August 13, 2024

Strange results on 10X hgmm10k_v3 dataset

from cellbender.

Comments (9)

sjfleming commented on August 13, 2024 2

This run might not have totally converged, but this is the result of running

cellbender remove-background --input 10k_hgmm_v3_nextgem_raw_feature_bc_matrix.h5 --output 10k_hgmm_v3_nextgem_out.h5 --cuda --expected-cells 10000 --total-droplets-included 20000 --epochs 300 --z-dim 20 --z-layers 100

from cellbender.

mtvector commented on August 13, 2024 1

Just wanted to add that the HgMm mix datasets are sometimes problematic in assessing ambient RNAs, as I've found 0%-3% of UMIs (depending on cell type) will end up in the wrong species cells simply due to mismapping thanks to genome, and annotation and sequencing error.

from cellbender.

sjfleming commented on August 13, 2024 1

Found it. It was coming from the use of the datatype uint16 to store gene indices during the creation of the output sparse count matrix...
I guess at some point way back, I thought, "There won't be transcriptomes with more than 65k genes, right?" Not right.

I will push a fix for this soon.

from cellbender.

sjfleming commented on August 13, 2024

I believe I have a parsing error in the newer v3 format HDF5 files from CellRanger that involve multiple genomes! I will track down this bug asap.

Thanks for reporting. I think what you're seeing there is essentially a garbled output due to an input parsing error.

from cellbender.

sjfleming commented on August 13, 2024

For an urgent workaround, I believe you can input your data using the CellRanger mtx directory format, and then even the v3 multiple-genome data should be parsed correctly. But this is a hunch, and I still need to try it myself. Either way, I will be working on fixing that bug soon.

from cellbender.

nh3 commented on August 13, 2024

Thank you for the reply! If what I saw was simply garbled output, I would expect to see some cells with high mouse gene counts. The fact that 1) 90% of mouse gene counts are removed from all cells, 2) tens of thousands human gene counts are added to cells that originally had only hundreds, and 3) the inferred priors and cutoffs look correct, makes me suspect it might be due to something else.

from cellbender.

nh3 commented on August 13, 2024

Tried supplying mtx and got exactly the same result.

The code for loading/parsing input seems alright. Though it doesn't read "/matrix/features/genome", the inference shouldn't care about an extra gene label, should it?

CellBender/cellbender/remove_background/data/dataset.py

Lines 845 to 871 in d68bf9d

 elif cellranger_version == 3: 

 # Read in data for this genome, and put it into a 

 # scipy.sparse.csc.csc_matrix 

 barcodes = getattr(f.root.matrix, 'barcodes').read() 

 data = getattr(f.root.matrix, 'data').read() 

 indices = getattr(f.root.matrix, 'indices').read() 

 indptr = getattr(f.root.matrix, 'indptr').read() 

 shape = getattr(f.root.matrix, 'shape').read() 

 csc_list.append(sp.csc_matrix((data, indices, indptr), 

 shape=shape)) 

 # Read in 'feature' information 

 feature_group = f.get_node(f.root.matrix, 'features') 

 feature_types = getattr(feature_group, 

 'feature_type').read() 

 feature_names = getattr(feature_group, 'name').read() 

 feature_ids = getattr(feature_group, 'id').read() 

 # The only 'feature' we want is 'Gene Expression' 

 is_gene_expression = (feature_types == b'Gene Expression') 

 gene_names.extend(feature_names[is_gene_expression]) 

 gene_ids.extend(feature_ids[is_gene_expression]) 

 # Excise other 'features' from the count matrix 

 gene_feature_inds = np.where(is_gene_expression)[0] 

 csc_list[-1] = csc_list[-1][gene_feature_inds, :]

What I notice is that in hgmm5k_v3 and hgmm10k_v3, nUMI per cell is distinctively lower in mouse cells than human cells, whereas in hgmm12k_v2 the difference is smaller. See plots below (blue: human, green: mouse, green: empty droplets)

hgmm10k_v3
hgmm5k_v3
hgmm12k_v2

Could it be that this distribution somehow confused the method to (partially) model empty droplets out of mouse cells?

from cellbender.

sjfleming commented on August 13, 2024

I will look into that, but the fact that you gave it the --expected-cells parameter should enable it to figure out a good prior on cell counts that can cover both human and mouse...

from cellbender.

nh3 commented on August 13, 2024

Thank you for the quick fix!

from cellbender.

Strange results on 10X hgmm10k_v3 dataset about cellbender HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	elif cellranger_version == 3:

	# Read in data for this genome, and put it into a
	# scipy.sparse.csc.csc_matrix
	barcodes = getattr(f.root.matrix, 'barcodes').read()
	data = getattr(f.root.matrix, 'data').read()
	indices = getattr(f.root.matrix, 'indices').read()
	indptr = getattr(f.root.matrix, 'indptr').read()
	shape = getattr(f.root.matrix, 'shape').read()
	csc_list.append(sp.csc_matrix((data, indices, indptr),
	shape=shape))

	# Read in 'feature' information
	feature_group = f.get_node(f.root.matrix, 'features')
	feature_types = getattr(feature_group,
	'feature_type').read()
	feature_names = getattr(feature_group, 'name').read()
	feature_ids = getattr(feature_group, 'id').read()

	# The only 'feature' we want is 'Gene Expression'
	is_gene_expression = (feature_types == b'Gene Expression')
	gene_names.extend(feature_names[is_gene_expression])
	gene_ids.extend(feature_ids[is_gene_expression])

	# Excise other 'features' from the count matrix
	gene_feature_inds = np.where(is_gene_expression)[0]
	csc_list[-1] = csc_list[-1][gene_feature_inds, :]