chanzuckerberg / single-cell-curation Goto Github PK

Code and documentation for the curation of cellxgene datasets

License: MIT License

Python 57.96% Makefile 0.60% Jupyter Notebook 40.64% Jinja 0.80%

single-cell-curation's Introduction

cellxgene curation tools

This repository contains documents and code used by cellxgene's curation team. Issues/suggestions pertaining to datasets and how they interact with cellxgene should be created here.

For information/issues about cellxgene and its portal please refer to:

Installation

The primary curation tool is the cellxgene-schema CLI. It enables curators to perform schema validation for datasets to be hosted on the cellxgene Data Portal.

It requires Python >= 3.8. It is available through pip:

pip install cellxgene-schema

It can also be installed from the source by cloning this repository and running:

make install

And you can run the tests with:

make unit-test

Usage

The CLI validates an AnnData file (*.h5ad) to ensure that it addresses the schema requirements.

Datasets can be validated using the following command line:

cellxgene-schema validate input.h5ad

If the validation succeeds, the command returns a zero exit code; otherwise, it returns a non-zero exit code and prints validation failure messages.

The data portal runs the following in the backend:

cellxgene-schema validate --add-labels output.h5ad input.h5ad

This execution validates the dataset as above AND adds the human-readable labels for the ontology and gene IDs as defined in the schema. If the validation is successful, a new AnnData file (output.h5ad) is written to disk with the labels appended.

This option SHOULD NOT be used by data contributors.

Contributing

Please read our contributing guidelines and make sure adhere to the Contributor Covenant code of conduct.

Reporting Security Issues

Please read our security reporting policy

single-cell-curation's People

Contributors

Stargazers

Watchers

Forkers

signechambers1 ghastvhenry rcannood kishorikonwar baevermann wkt8 clevercanary swangam emrutherford liaanym newlimit evanbiederstedt ubyndr dosumis whitelabgenomics lucidapp brianraymor

single-cell-curation's Issues

Out-of-memory error during conversion from AnnData to Seurat if X is numpy array.

TLDR: An upload to the corpora data portal can fail due to an out-of-memory error during the conversion from AnnData to Seurat when X is a numpy array. Currently, there's no error message that indicates so. This could be avoided if X is always a scipy sparse matrix.

The issue was discovered while processing this:

And the process of discovery was the following

The dataset upload failed, however the explorer was working ok. The downloads were available for the AnnData and Loom files but not the Seurat file -- indicating that conversion to Seurat was causing the issue. The error was raised when calling the update_db() function from process_cxg() during the processing of the upload.
Trying to manually run the conversion in a 700+G machine gave this error from R 'Realloc' could not re-allocate memory (18446744070851829760 bytes).
Upon closer inspection, the count matrix was a numpy array. This leads to R trying to load the entire matrix inefficiently and ends up running out of memory. Changing it to a scipy sparse matrix solves the issue as R loads it as sparse object as well.
Processing in the portal works after shifting to scipy sparse matrix.

There are two potential solutions I thought of:

We require users to always use scipy sparse matrices when creating the h5ad file (and implement this as well in cellxgene-schema apply). Then the validator should check for this requirement
We transform numpy arrays to scipy sparse matrices if any are found during the processing of uploads.

It's worth noting that while these solutions will likely reduce the number of times we run into out-of-memory issues, they don't guarantee the issues won't happen again. Moreover there's no error message delivered when this happens, neither in the backend or frontend.

Update dataset metadata in A single-cell transcriptional roadmap of the mouse and human lymph node lymphatic vasculature

The Homo sapiens dataset in A single-cell transcriptional roadmap of the mouse and human lymph node lymphatic vasculature should be corrected to:

Use 'human adult stage' rather than 'adult' for the development_stage label as indicated by HsapDv:0000087
Use 'HsapDv:0000087' rather than 'EFO:0001272' for the development_stage_ontology_term_id per the schema HsapDv term if human, child of EFO:0000399 otherwise

Note that 'adult' is correct for the Mus musculus dataset.

Remove footnotes from Tabula Muris Senis

The Description for Tabula Muris Senis includes footnote numbers.

See bolded cases below:

Ageing is characterized by a progressive loss of physiological integrity, leading to impaired function and increased vulnerability to death1. Despite rapid advances over recent years, many of the molecular and cellular processes that underlie the progressive loss of healthy physiology are poorly understood2. To gain a better insight into these processes, here we generate a single-cell transcriptomic atlas across the lifespan of Mus musculus that includes data from 23 tissues and organs. We found cell-specific changes occurring across multiple cell types and organs, as well as age-related changes in the cellular composition of different organs. Using single-cell transcriptomic data, we assessed cell-type-specific manifestations of different hallmarks of ageing—such as senescence3, genomic instability4 and changes in the immune system2. This transcriptomic atlas—which we denote Tabula Muris Senis, or ‘Mouse Ageing Cell Atlas’—provides molecular information about how the most important hallmarks of ageing are reflected in a broad range of tissues and cell types.

How to encode datasets where raw and normalized count matrices have different shapes.

@MaximilianLombardo and @pablo-gar identify that we often receive datasets where the "normalized" count data contain a subset of the gene features of the "raw" count data. This occurs because toolchains tend to filter genes to generate better clusters and low dimensional embeddings. This filtering typically removes low variance genes, and genes that are thought to explain more technical variation (ribosomal) or confounding stress information (mitochondrial) than interesting biological information.

While filtering is helpful for downstream steps in the initial analysis, it produces a data reuse problem when other scientists want to explore the expression of a specific gene or set of genes. Many times their genes of interest are not present in the filtered, normalized matrix. It also causes a visualization problem for cellxgene. Since normalized counts are the data that we visualize, and users may only visualize features that are present in the matrix.

I would prefer that data submitters use distance metrics that are aware of feature variance, so they don't need to filter variable genes. We'd also prefer that they regress out confounding signatures instead of filtering "troublesome genes" like mitochondrial and ribosomal genes. However, this is not common practice and the schema's primary goal is to best represent scientific data.

Proposed solution: @MaximilianLombardo suggests the following solution:

Our schema should require that raw and normalized data have the same set of features.
To meet this requirement, when we receive submissions, any feature present in raw but not in normalized should be added to normalized and its values filled with np.nan
An amendment is written to the schema explaining why including all features in the normalized dataset increases the breadth of users who will be able to reuse the data, and maximizes the value of submitted data, and explaining how step (2) above is applied to data that do not meet this requirement.

Corollary: This decision will affect the format of our downloaded files, which can't support matrices of different shapes.

Unresolved question: How do we visualize these np.nan columns in the explorer? cc @signechambers1 @colinmegill

cc @jahilton we're considering this approach instead of the one I suggested to you. The difference here is instead of subsetting to the features of the normalized matrix, we expand to the features of the raw matrix.

Amend cell type colors for "Cells of the adult human heart"

The following dataset has 13 colors assigned to the cell types under AnnData.uns["cell_type_colors"]. These colors correspond to the cell_type_original column in AnnData.obs and not to cell_type.

Dataset (All — Cells of the adult human heart):
https://cellxgene.cziscience.com/e/d4e69e01-3ba2-4d6b-a15d-e7048f78f22e.cxg/
Collection:
https://cellxgene.cziscience.com/collections/b52eb423-5d0d-4645-b217-e1c6d38b2e72

Solution:

Rename AnnData.uns["cell_type_colors"] to AnnData.uns["cell_type_original_colors"]
Re-upload dataset when revisions of public datasets is available.

Example of Configuration.yaml in Corpora Schema guide is outdated

@brianraymor commented on Wed Jan 20 2021

This was reported by @jahilton.

The schema version is 1.1.0, but the sample configuration.yaml in the Corpora Schema Guide has not been updated to reflect the changes, including some deprecations of fields.

@mckinsel commented on Tue Feb 16 2021

We're moving this to the single-cell-curation repo (which will be public shortly), and the guide has been updated.

@brianraymor commented on Tue Feb 16 2021

You can use Transfer issue to accomplish that task.

Amend assay ontology term and id for "A molecular cell atlas of the human lung from single cell RNA sequencing"

This collection has a dataset from a Smart-seq experiment. The correct AnnData.obs["assay"] ontology term is "Smart-seq" (EFO:0008930), currently set as "Smart-seq2 protocol" (EFO:0008442)

Human Development Stages (HsapDv) must be downloaded and prepared

The following ontology dependency is pinned for this version of the schema.

Ontology	OBO Prefix	Required version
Human Developmental Stages	HsapDv	hsapdv.owl : 2016-07-06 (0.1)

Parsing requirements:

Extract the term identifier and its human-readable label

There are recommended terms for different stages in development_stage_ontology_term_id, but enforcement should rely more on curation review by Stanford than validation warnings.

Note: The ontology.json could annotate STRONGLY RECOMMENDED HsapDv terms with a recommended flag. This could be used to warn curators when development stage values did not fall into the recommended ranges.

HGNC Database version is unknown

The HGNC database version is unknown. Its version should be tracked and documented somewhere.

https://github.com/chanzuckerberg/single-cell-curation/blob/main/cellxgene_schema_cli/cellxgene_schema/hgnc_complete_set.txt.gz

Separate schema and encoding

Description: cellxgene maintains one schema, one ingestion format (AnnData) and three download formats. The team decided to create three classes of documents to support its data model.

The Schema, which should be unopinionated about implementation
The encodings, which should describe AnnData (ingestion and download) and Seurat v3 (download)
A curation tutorial, which describes how to generate an AnnData file that adheres to our schema, validate it, and upload it.

The work will be broken down into two tasks:

Separating the schema and AnnData encoding into two documents (currently there is AnnData logic in the schema)
Writing a separate document that describes the Seurat encoding.

This issue tracks task 1, separation of schema and encoding.

Update tabula muris brain datasets to better reflect differences

In the Tabula Muris Senis collection, there are two different brain datasets with identical titles (stored in AnnData.uns['title']).

When the feature "Revision of public collections" is ready, these datasets should be updated to reflect their differences

Current dataset title: "Brain — A single-cell transcriptomic atlas characterizes ageing tissues in the mouse"

Updated titles:
"Brain myeloid cells — A single-cell transcriptomic atlas characterizes ageing tissues in the mouse"
https://cellxgene.cziscience.com/e/c08f8441-4a10-4748-872a-e70c0bcccdba.cxg/

"Brain non-myeloid cells — A single-cell transcriptomic atlas characterizes ageing tissues in the mouse"
https://cellxgene.cziscience.com/e/66ff82b4-9380-469c-bc4b-cfa08eacd325.cxg/

Potential schema entry: Post-mortem interval

The time that elapses between sample extraction and processing (post-mortem interval) was a highly mentioned sample quality metadata. Anecdotally, it often correlates with poor sample quality and scientists may use it as a filter when selecting data to reuse.

When sufficient data are accumulated, it would be valuable to test the extent to which post-mortem interval is predictable from commonly computed QC metrics, stress, and cell death pathway expression. If this metadata is predictable then we can continue to disregard it.

Change ownership in portal of collection: "Single-Cell RNAseq analysis of diffuse neoplastic infiltrating cells at the migrating front of human glioblastoma"

Due to technical issues at UCSC, this collection was uploaded to the portal by me but it was contributed by Rachel Schwartz ([email protected])

https://cellxgene.cziscience.com/collections/558385a4-b7b7-4eca-af0c-9e54d010e8dc

When the feature of ownership change is available in the portal, Rachel must be turned into the owner of this collection.

Validator should check for the existence of raw layer

In the schema v1.1.0 we require to have a "raw" matrix and suggest to store it AnnData.raw.X.

The validator does not currently check for this requirement.

We should:

Include a check in the validator for the existence of a raw layer.
[Suggestion] check for int as the type in that layer.
[Suggestion] to make integration processes easily streamable and to make checks easier, we should require to have the raw layer in AnnData.raw, as opposed to just have this as a suggestion. And added plus to this is that AnnData has some extra checks/restrictions on AnnData.raw

The schema references outdated PII content

The schema references an earlier version of the PII content. It should be direct personal identifiers

Editorial: "indentifier" -> "identifier" in schema 2.0.0

Doh. In the Changelog:

Curators are responsible for annotating ontology and gene identifiers. The cellxgene Data Portal adds the assigned human-readable names for all indentifiers.

Remove footnotes from Molecular, spatial and projection diversity of neurons in primary motor cortex revealed by in situ single-cell transcriptomics

The Description for Molecular, spatial and projection diversity of neurons in primary motor cortex revealed by in situ single-cell transcriptomics includes footnote numbers.

See bolded cases below:

A mammalian brain is comprised of numerous cell types organized in an intricate manner to form functional neural circuits. Single-cell RNA sequencing provides a powerful approach to identify cell types based on their gene expression profiles and has revealed many distinct cell populations in the brain1-3. Single-cell epigenomic profiling4,5 further provides information on gene-regulatory signatures of different cell types. Understanding how different cell types contribute to brain function, however, requires knowledge of their spatial organization and connectivity, which is not preserved in sequencing-based methods that involve cell dissociation3,6. Here, we used an in situ single-cell transcriptome-imaging method, multiplexed error-robust fluorescence in situ hybridization (MERFISH)7 ...

Update dataset metadata in Single cell transcriptional and chromatin accessibility profiling redefine cellular heterogeneity in the adult human kidney

The datasets in Single cell transcriptional and chromatin accessibility profiling redefine cellular heterogeneity in the adult human kidney should be corrected to:

Use 'human adult stage' rather than 'adult' for the development_stage label as indicated by HsapDv:0000087

Cell Ontology (CL) must be downloaded and prepared

The following ontology dependency is pinned for this version of the schema.

Ontology	OBO Prefix	Required version
Cell Ontology	CL	cl.owl : 2021-06-21

Parsing requirement:

Extract the term identifier and its human-readable label

`organism_ontology_term_id` and `organism` in `uns` are insufficient in schema v1.1.0

@pablo-gar commented on Tue Dec 22 2020

Refers to this

Quick fix: allow for "multiple" in organism and empty string in organism_ontology_term_id

Long-term solution: move organism_ontology_term_id and organism from uns to obs in anndata

enrichment strategies [Jackson Labs]

@ambrosejcarr

I spoke with Bill Flynn this morning who is a staff scientist running the Single Cell Biology Facility at Jackson Labs, which generates ~15-30 libraries each week. He’s trying to stand up publication pages for datasets that they generate as a service to their researchers

Metadata schema made sense to him, thought there were two missing fields: enrichment strategies, (i.e. “cd45+ cells only”) and dissociation. I think we can safely ignore dissociation, but enrichment strategies is a good one for us to think about more ...

Empty values in `cell_type` should pass validation

According to our schema definition

The tissue field must be appended with " (cell culture)" or " (organoid)" if appropriate. Also, if the source of cells is cell culture or organiod, the cell_type field can be left empty.

But the validator throws an error if empty strings are found for certain fields, including cell_type:

single-cell-curation/cellxgene_schema_cli/cellxgene_schema/validate.py

Lines 89 to 93 in f2b872e

 if "nullable" in schema_def and not schema_def["nullable"]: 

 if any(_is_null(v) for v in column): 

 errors.append( 

 f"Column {column_name} in dataframe {df_name} contains empty values." 

 )

The error looks like this:
Column cell_type in dataframe obs contains empty values.

We should either update the schema definition to not allow empty strings in cell_type or add an exception in the validator that allows empty strings in cell_type

Should the schema include a counts field

@ambrosejcarr commented on Thu Aug 13 2020

Appetite: ?

This question is limited to 10x scRNA/snRNA and Smart-Seq2-like assays.

Should the schema include a counts field? If so, how is it modeled per framework/assay? UMI counts from 10x for example.

@ambrosejcarr commented on Fri Sep 11 2020

Addressing @mckinsel questions:

What's the technical cost of leaving this optional? The requirement could be "observation IDs for processed cells must be contained in the set of IDs of raw cells". I do not think we should require unfiltered barcodes to be present, so if the cost of being unopinionated is high, I would suggest that raw and processed observation sets should match.
Do you mean the supplementary table that 10x generates? (example link) Do archives capture these data? If we confirm, I'd support an optional "links to more data" section, and decline to hold these data.
I think we need to treat transcripts like a separate data modality that we currently do not support. If we are getting data from users who want to retain transcript information, we should tell them they can choose to collapse their data by gene, but we recognize that decision may compromise their experiment. We should not enforce any Science Program submission requirements for those data at this time, assuming this doesn't become a recognized loophole around data submission. Some thoughts on this below which could seed that epic.

"Detected molecules of RNA per gene" (typical 10x 3' processing), "detected molecules of RNA per transcript" (transcript-aware RNA-seq processing, more commonly associated with SS2), "detected molecules of protein" (CITE-seq, CyTOF, MIBI), and "sequencing reads from promoter regions adjacent to genes" (sc-ATAC-seq) are separate data modalities and we should be aware of that in some way.

They can all be reduced to "observations of gene", and we may want to enable that conversion, but we should be careful, deliberate, and have a separate set of rules for each modality. When we get to CITE-seq data, those naturally correspond better to transcript-level data. the PTPRC gene is a good example of where we'll get tripped up, and in the future I expect we'll start to see phospho (active) and non-phospho (inactive) forms of proteins detected with CITE-seq, introducing additional complexity beyond what's captured at the transcript level.

@ambrosejcarr commented on Fri Sep 11 2020

Created chanzuckerberg/single-cell#56 to track support for other data modalities.

@brianraymor commented on Tue Oct 20 2020

@ambrosejcarr to follow up on Do we want unfiltered barcodes from 10x? We actually got some feedback from one person when shopping around the schema that the answer is yes, thought it was a nice-to-have. The problem is this would not be proper layer in any format as its dimensions are different. and open a new issue as needed. The current position is that the answer is "no".

Amend title for "Shared and distinct transcriptomic cell types across neocortical areas"

The title of the only dataset in this collection is incorrect.

https://cellxgene.cziscience.com/collections/45f0f67d-4b69-4a3c-a4e8-a63b962e843f

Currently it is "Adult mouse cortical cell taxonomy revealed by single cell transcriptomics". It needs to be corrected to "Shared and distinct transcriptomic cell types across neocortical areas"

Datasets with multiple species break current schema

Current schema v1.1.0 requires that species is indicated in the uns object of the anndata.

Datasets that have integrated single-cell data from multiple organisms contain cells for each of those. As uns is dataset-level metadata it will be impossible to properly annotate this type of datasets.

Moving the organism_ontology_term_id and organism slots from uns to obs in the schema will solve this problem.

Mine the single cell portal for datasets

The Broad Single Cell Portal has at least 4 colon datasets in it that were not in our list. Likely there are many more.

Decide on a search strategy
Mine the Single Cell Portal for datasets.

Update dataset metadata in Time-resolved Systems Immunology Reveals a Late Juncture Linked to Fatal COVID-19

The datasets in Time-resolved Systems Immunology Reveals a Late Juncture Linked to Fatal COVID-19 should be corrected to:

Use 'blood' rather than 'Blood' as the tissue label
Use 'normal' and rather than 'Normal' as the disease label as indicated by PATO:0000461

Validator should check that embeddings in `AnnData.obsm` are numpy arrays

Cell embeddings are stored in AnnData.obsm which is a dictionary, for each key it stores a two or higher-dimensional ndarray (numpy array) of length AnnData.n_obs.

If something other than a numpy array is stored in the AnnData.obsm (use case has been pandas data frame), the conversion to loom and Seurat will fail when processing an upload in the portal, moreover the conversion error message in the logs is vague.

While this is an AnnData requirement (see here), it is not checked nor enforced.

Solution

cellxgene-schema apply tries to transform elements of AnnData.obsm to numpy arrays if they aren't already numpy arrays.
cellxgene-schema validate raises an error and appropriate message if elements of AnnData.obsm are not numpy arrays.

Link to encoding reference in schema is broken

https://github.com/chanzuckerberg/single-cell-curation/blob/main/docs/corpora_schema.md#schema-version

Datasets in the Data Portal must store the version of the schema they follow (that is, the version of this document) as well as the version of the particular encoding used. The encoding is documented elsewhere and describes techincal details of how the schema should be serialized in a particular file format.

techincal -> technical
elsewhere returns a 404

Out-of-memory issues when standardizing gene symbols with cellxgene-schema apply

cellxgene-schema apply updates gene symbols in an AnnData object to a static snapshot of the HGCN database.

While updating gene symbols, cellxgene-schema apply combines genes (columns) when necessary, and to do so it loads the entire count matrix into a pandas data frame. For datasets with large number of cells, this raises a python memory allocation error or the process gets killed. Depending on the number of cells this error occurs with machines that have up to 3TB of memory.

A proven but not implemented solution is to only load those genes (columns) that need to be updated into a pandas data frame, combine them as necessary, and then merge back to the expression matrix.

Cross-dataset queries (like "where's my gene") will need to filter out duplicated datasets that result from meta-analysis to provide accurate results.

Story: As a user, I want to evaluate where a gene is expressed across tissues by examining the expression of that gene across datasets hosted in cellxgene.

Problem: As we publish meta-analyses, datasets will begin to be represented multiple times in our database, and the above query will place higher weight on datasets that appear more often (are more popular) unless the original dataset can be distinguished from derivative publications.

Candidate Solution:

Create a schema flag that marks datasets "primary" when the authors generated the data themselves, and flag data as secondary when data are derivative uses of primary datasets.
Strongly encourage authors submitting meta-analyses to also submit the primary collections.

@pablo-gar can you please note the Kang/Aronow use case in this issue?

The schema Implementations section is outdated

Implementations describes the outdated plan to support uploads in multiple formats:

The Data Portal requires submitted count matrices and associated metadata to be in one of three formats: AnnData, Loom, or a Seurat v3 RDS. Other formats are rejected. Each of these formats has a way to include metadata along with the count data, so a submission can be entirely contained within a single file.

The portal only supports AnnData.

The sections on the Loom and Seurat implementations should be updated. Options:

Include in curation tutorials as recommendations for formatting seurat or loom prior to manual conversion to anndata
Rewrite to indicate that the portal performs automated conversions to these formats
Move to Portal documentation for download formats

2 dimensional spatial representation of MERFISH data

It has come to our attention that there's a need from cellxgene users to visualize cells using embeddings that encode the x,y,z coordinates available in MERFISH and alike data.

A solution to this issue seems to live in the curation process rather than adding a new feature to cellxgene. An example of what could be done:

Select a method that embeds x,y,z coordinates into two axes.
Add an extra embedding to AnnData.obsm, this can then visualized in cellxgene
[Optional] Make this process available in cellxgene-schema apply

What we need to figure out:

Is this something that we want to support or should we delegate to users the creation of these embeddings?
If we want to support it, there needs to be a discussion about how we want to approach it.

Validator should check for validity of ontology term id fields

Schema v 1.1.1 and 1.2.0 do not allow empty strings in ontology term id fields. The validator should raise an error if an empty string or an incorrect term is found.

The current allowed prefixes for ontology term ids are :

CL: (cell type)
UBERON: (tissue)
MONDO: or PATO:0000461 (disease)
EFO: (assay)
HANCESTRO (ethnicity)
HsapDv or EFO:0000399 (development stage)

Update Gene Sets File Format to reflect simplified design

The following statements are no longer true:

When new gene sets are being added to a data collection on the portal, validation MUST detect gene_set_name collisions with current gene sets in the collection, display an error message, and fail the upload.

gene_set_name and gene_set_description are presented to users viewing data collections in the portal.

Gene sets are now added to a collection as a dataset property.

Delete duplicate Lung dataset in Tabula Muris Senis

Tabula Muris Senis has a duplicate Lung dataset that needs to be deleted when revisions are available:

The gene set data format MUST be documented

Per the requirements in A curator can include gene sets with published datasets., the gene set data format must be documented for curators and for reference in the Gene Sets tab in the portal.

Review MmusDv term for mouse development stages

Regarding

development_stage_ontology_term_id	HsapDv term if human, child of EFO:0000399 otherwise

Would it be more appropriate to specify MmusDv term if mouse, child of EFO:0000399 otherwise ?

Pablo:

This also relates to expanding to species-specific info and our availability to support schema guidelines for non-human data (e.g. our rolling discussion on gene symbols).
I tend to lean towards our current system, i.e. enforcing human-specific guidelines and being loose with other species. To that end children of EFO:0000399 would be sufficient for non-human species.

David Fischer:

I found MMusdev a bit easier to navigate as EFO:0000399 is very broad, so mostly user experience, they both have a lot of the relevant terms i think (often under the same name also). So unless you guys decide differently I would go with Mmusdev probably

Create Seurat v3 encoding documentation

Description: cellxgene maintains one schema, one ingestion format (AnnData) and three download formats. The team decided to create three classes of documents to support its data model.

The Schema, which should be unopinionated about implementation
The encodings, which should describe AnnData (ingestion and download) and Seurat v3 (download)
A curation tutorial, which describes how to generate an AnnData file that adheres to our schema, validate it, and upload it.

The work will be broken down into two tasks:

Separating the schema and AnnData encoding into two documents (currently there is AnnData logic in the schema)
Writing a separate document that describes the Seurat encoding.

This issue tracks task 2, the creation of a separate Seurat encoding document.

It is blocked by #50.

Amend collection information of "A human cell atlas of fetal gene expression"

Requests from Jay Shendure about this collection:

is it possible to add Cole Trapnell [email protected] and Junyue
Cao [email protected] to the list of contacts as well?

can website be changed to https://descartes.brotmanbaty.org/

cellxgene schema apply does not permit empty strings as mapping keys

If a column being used to populate required schema fields has empty strings values, the empty string cannot be used as a mapping key. I believe this is because cellxgene schema apply automatically converts the values to nan. The workaround for this is to use nan as a mapping key. It would be great if curation software could handle empty values in the future.

Experimental Factor Ontology (EFO) must be downloaded and prepared

The following ontology dependency is pinned for this version of the schema.

Ontology	OBO Prefix	Required version
Experimental Factor Ontology	EFO	efo.owl : 2021-06-15 EFO 3.31.0

Parsing requirements:

Extract the term identifier and its human-readable label
Extract the following children per assay_ontology_term_id for warnings during validation:

An assay based on 10X Genomics products SHOULD either be "EFO:0008995" for 10x technology or preferably its most accurate child. Other assays SHOULD be the most accurate child of either EFO:0002772 for assay by molecule or EFO:0010183 for single cell library construction

Editorial: assay -> development_stage

Jenny Chien noted:

I was just going over schema 2.0.0, and saw a typo for "Key" (should be "development_stage" rather than "assay"): https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/2.0.0/corpora_schema.md#development_stage

Amend assay type for methylation data in "An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types""

The assay type for the methylation datasets (see below) was set as "methylation profiling (EFO:0000751)". However, according to the paper the specific assay is snmC-seq for which there is an appropriate ontology term "snmC-seq (EFO:0008939)"

Solution

When revisions of published collections are available, change assay type to "snmC-seq" with term id "EFO:0008939" for the following datasets:

An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types: DNA methylation (CGN)
An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types: DNA methylation (CHN)

Which are from this collection:

https://cellxgene.cziscience.com/collections/ae1420fe-6630-46ed-8b3d-cc6056a66467

Renaming metadata collisions to _original needs to be refined and documented

Original slack conversation

Its current behavior is not documented. @pablo-gar suggests adding to the cellxgene schema guide or its successor. Please note @MaximilianLombardo.
The granularity for renaming is coarse. In some cases, the value is exactly the same. In others, an ontology has been added, but the label value is the same. For example in https://cellxgene.cziscience.com/e/cfa3c355-ee77-4fc8-9a00-78e61d23024c.cxg/, the only major difference is between tissue and tissue_original. @pablo-gar notes:

We could do some extra checks in cellxgene-schema apply to avoid redundancy when found (i.e. not creating an "_original" when redundant) ...

Add mandatory "batch" field

Ambrose says Normalization and integration methods perform better when conditioned on the "batch" that data were produced in, as different batches tend to vary due to technical (nuisance) factors.

Note: Capturing #single-cell-data-wrangling thread [with minor edits] before it evanesces.

@ambrosejcarr

batch is categorical. Most often I find they are integer valued. Unless batches have been dropped, they are usually sequential integers.

For normalization (and integration), batch is as important as donor or disease, and should therefore be mandatory.
batch can sometimes take a scalar value when you run all the data on a single lane of a sequencer -- so, sometimes it’s not a useful thing to record (batch is just all “0") Of course, this isn’t unique to batch -- you can also have datasets generated from a single donor.

@mckinsel

so the thing is, i don’t think anybody has submitted data that has a clear concept of batch ... like it’s in there in some sense, and if we wanted to we could make batch a mandatory field. but if we leave it an optional field, then what exactly are we saying? “you can have a meta[data] field called batch and it can take on whatever values you want” that’s already true. everything not prohibited is permitted

@bkmartinjr

a) is there a practical way to normalize and/or integrate datasets if we do not know which metadata is associated with batch (ie, can be used to condition models)? If we do not have batch, is there an alternative where we can do without, or automatically detect it, to achieve the same result? I have been operating on the assumption that the answer is "no", and that we must add some support for this if we want to hit our longer-term goal of enabling integration. Is this misinformed?

b) I imagine that the concept of batch will often be ambiguous in scope, at least when used for model building (eg, scarches label transfer). It really boils down to "which metadata do I condition the model on", which will often be more than traditional "lab batch" identity. It might even be multiple metadata fields - I have seen several examples where the conditioning required multiple "batch ids". Can we actually mandate a single field, with a single meaning or do we need something more flexible.

I am wondering if an alternative is a DataSet-wide field that encodes which per-cell metadata are (together) the batch / condition variable? Ie. if adata.obs["patient"] and adata.obs["seqBatch"] exist, then adata.uns["conditions"] = ["patient", "seqBatch"]

@ambrosejcarr

Agree with "I imagine that the concept of batch will often be ambiguous in scope," from (b). What we're lacking is a batch-other field to capture ideas like seqBatch that aren't in our schema. I anticipate that conditioning on multiple metadata will be critical in the future.

@bkmartinjr

So, why not simply let people encode which fields are the "conditions", ie, should be used in combination as a batch? Mandate that, but don't mandate the actual encoding of the individual fields.

@ambrosejcarr

That may be the right path. Making sure I'm understanding -- some of the fields referenced as "conditions" may be non-standard fields, right?

@bkmartinjr

actually, I was proposing adding a "meta field", which names the fields that are suggested conditions. ie, adata.uns["condition_fields"] = ["batch", "patient", "some_other_column"]and you could make that condition_fields dataset attribute mandatory.

@ambrosejcarr

I think the existence of some_other_column in the reference answers the question I had -- I wasn't articulating it well. I wanted to know if you saw any requirement that fields referenced in the "meta-field" (batch, patient, some_other_field) in your example, needed to be defined in our schema -- I think the answer is no, based on your responses.

@bkmartinjr

correct, I didn't see any reason to mandate which columns are batch, only that the indirect pointer to the batch/condition columns exist. Likewise, I didn't see a reason to mandate the type of the batch/condition columns - while they are often enumerated types, I don't think they will always be so (and the algos don't really care).

@ambrosejcarr

(and the algos don't really care). Say more about this? My understanding is that so long as the algo can convert the column into an enumerated type, the algo is happy. Does that match your understanding?

@bkmartinjr

I think a lot of the algos don't even care how big the enumeration is, ie, it can effectively be continuous and it will work OK. TL;DR - you can feed it almost anything. Probably worth confirming this if we end up doubling down on this schema.

Amend PCA coordinate label for "Molecular characterization of foveal versus peripheral human retina by single-cell RNA sequencing."

PCA coordinates are labeled as UMAP when they should be tSNE.

https://cellxgene.cziscience.com/collections/7edef704-f63a-462c-8636-4bc86a9472bd

I have already fixed the h5ad and can re-submit when revisions are possible.

Human Ancestry Ontology (HANCESTRO) must be downloaded and prepared

The following ontology dependency is pinned for this version of the schema.

Ontology	OBO Prefix	Required version
Human Ancestry Ontology	HANCESTRO	hancestro.owl : 2021-01-04 (2.5)

Parsing requirement:

Extract the term identifier and its human-readable label

Raw Data in Phenotypic variation of transcriptomic cell types in mouse motor cortex returns Not Found

The Raw Data link in Phenotypic variation of transcriptomic cell types in mouse motor cortex returns Not Found.

Validator should check that one item of dict AnnData.uns["layer_descriptions"] is "raw"

According to schema v1.1.0 all layers must be described in AnnData.uns["layer_descriptions"] which is a dictionary mapping from {layer_name: layer_description, ...}

One of layer_description must be "raw" (not "Raw", "Raw counts", etc). Otherwise there will be an error when processing the data in the corpora data portal.

The validator should check for this.

	if "nullable" in schema_def and not schema_def["nullable"]:
	if any(_is_null(v) for v in column):
	errors.append(
	f"Column {column_name} in dataframe {df_name} contains empty values."
	)

chanzuckerberg / single-cell-curation Goto Github PK

single-cell-curation's Introduction

cellxgene curation tools

Installation

Usage

Contributing

Reporting Security Issues

single-cell-curation's People

Contributors

Stargazers

Watchers

Forkers

single-cell-curation's Issues

Recommend Projects

Recommend Topics

Recommend Org