Ambrose says Normalization and integration methods perform better when conditioned on the "batch" that data were produced in, as different batches tend to vary due to technical (nuisance) factors.
Note: Capturing #single-cell-data-wrangling thread [with minor edits] before it evanesces.
@ambrosejcarr
batch
is categorical. Most often I find they are integer valued. Unless batches have been dropped, they are usually sequential integers.
For normalization (and integration), batch
is as important as donor or disease, and should therefore be mandatory.
batch
can sometimes take a scalar value when you run all the data on a single lane of a sequencer -- so, sometimes it’s not a useful thing to record (batch
is just all “0") Of course, this isn’t unique to batch
-- you can also have datasets generated from a single donor.
@mckinsel
so the thing is, i don’t think anybody has submitted data that has a clear concept of batch
... like it’s in there in some sense, and if we wanted to we could make batch a mandatory field. but if we leave it an optional field, then what exactly are we saying? “you can have a meta[data] field called batch
and it can take on whatever values you want” that’s already true. everything not prohibited is permitted
@bkmartinjr
a) is there a practical way to normalize and/or integrate datasets if we do not know which metadata is associated with batch
(ie, can be used to condition models)? If we do not have batch
, is there an alternative where we can do without, or automatically detect it, to achieve the same result? I have been operating on the assumption that the answer is "no", and that we must add some support for this if we want to hit our longer-term goal of enabling integration. Is this misinformed?
b) I imagine that the concept of batch
will often be ambiguous in scope, at least when used for model building (eg, scarches label transfer). It really boils down to "which metadata do I condition the model on", which will often be more than traditional "lab batch" identity. It might even be multiple metadata fields - I have seen several examples where the conditioning required multiple "batch ids". Can we actually mandate a single field, with a single meaning or do we need something more flexible.
I am wondering if an alternative is a DataSet-wide field that encodes which per-cell metadata are (together) the batch / condition variable? Ie. if adata.obs["patient"]
and adata.obs["seqBatch"]
exist, then adata.uns["conditions"] = ["patient", "seqBatch"]
@ambrosejcarr
Agree with "I imagine that the concept of batch
will often be ambiguous in scope," from (b). What we're lacking is a batch-other
field to capture ideas like seqBatch that aren't in our schema. I anticipate that conditioning on multiple metadata will be critical in the future.
@bkmartinjr
So, why not simply let people encode which fields are the "conditions", ie, should be used in combination as a batch
? Mandate that, but don't mandate the actual encoding of the individual fields.
@ambrosejcarr
That may be the right path. Making sure I'm understanding -- some of the fields referenced as "conditions" may be non-standard fields, right?
@bkmartinjr
actually, I was proposing adding a "meta field", which names the fields that are suggested conditions. ie, adata.uns["condition_fields"] = ["batch", "patient", "some_other_column"]
and you could make that condition_fields dataset attribute mandatory.
@ambrosejcarr
I think the existence of some_other_column in the reference answers the question I had -- I wasn't articulating it well. I wanted to know if you saw any requirement that fields referenced in the "meta-field" (batch, patient, some_other_field) in your example, needed to be defined in our schema -- I think the answer is no, based on your responses.
@bkmartinjr
correct, I didn't see any reason to mandate which columns are batch, only that the indirect pointer to the batch/condition columns exist. Likewise, I didn't see a reason to mandate the type of the batch/condition columns - while they are often enumerated types, I don't think they will always be so (and the algos don't really care).
@ambrosejcarr
(and the algos don't really care). Say more about this? My understanding is that so long as the algo can convert the column into an enumerated type, the algo is happy. Does that match your understanding?
@bkmartinjr
I think a lot of the algos don't even care how big the enumeration is, ie, it can effectively be continuous and it will work OK. TL;DR - you can feed it almost anything. Probably worth confirming this if we end up doubling down on this schema.