cognoma / cancer-data Goto Github PK

View Code? Open in Web Editor NEW

21.0 11.0 28.0 19.83 MB

TCGA data acquisition and processing for Project Cognoma

License: Other

Jupyter Notebook 92.95% Shell 0.09% Python 6.96%

tcga xena xena-browser cancer mutation data-acquisition gene-expression dataphilly

cancer-data's Introduction

Project Cognoma

Putting machine learning in the hands of cancer biologists.

Project Cognoma is an open source project to create a webapp for analyzing cancer data. We're a community-driven philanthropic project that began as a collaboration between the Greene Lab, DataPhilly, and Code for Philly. Our contributors are primarily based in the Philadelphia area, but anyone anywhere is welcome. This GitHub repository is the administrative and informational home of Cognoma.

The Meetup phase of Cognoma is now complete! The Childhood Cancer Data Lab of Alex's Lemonade Stand Foundation will be providing longterm maintenance. Public contributions are still welcome through GitHub. The main priority is enhancements and bug fixes to improve http://cognoma.org. For a nice overview of the project, see its coverage by The Philadelphia Citizen.

Teams and Repositories

The project is composed of four teams with their own corresponding repositories:

Team Name	Repositories	Description
Cancer Data	`cancer-data`, `genes`, `figshare`	processing the underlying cancer data to the formats required for this project.
Machine Learning	`machine-learning`, `cognoml`	building classifiers to predict mutation status from gene expression data.
Backend	`core-service`, `task-service`, `ml-workers`, `infrastructure`	creating the infrastructure to power the webapp and glue the components together.
Frontend	`frontend`, `uiux`	building the webapp that users interact with.

New Here?

If you are a new user and would like to get involved, please introduce yourself. Contributions are made through GitHub, so if you are unfamiliar with git or GitHub, check out the sandbox for a place to learn by doing.

Meetup Schedule

We hold project meetups. Our usual meeting spot is at Industrious (where CandiDate is located). The address is 230 S Broad St, Floor 17, Philadelphia.

📅 Date	⌚ Time	🗺 Location	ℹ️ Meetup Details	💰 Sponsor
~~Wednesday, October 11, 2017~~	6:00 PM	MilkBoy	DataPhilly	Alex’s Lemonade Stand Foundation
~~Tuesday, August 15, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 11, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, June 27, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, May 30, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, April 25, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, April 4, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, February 28, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Monday, February 13, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, January 31, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Monday, January 16, 2017~~	9:00 AM	Philly Think Space	Frontend Only	MLK Day Volunteers from Think Company
~~Tuesday, January 10, 2017~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, December 20, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, December 6, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, November 15, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, November 1, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, October 18, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, October 4, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Monday, September 19, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, September 6, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, August 23, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, August 9, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 26, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 19, 2016~~	6:00 PM	CandiDate	DataPhilly	Penn Institute for Biomedical Informatics
~~Tuesday, July 12, 2016~~	6:00 PM	CandiDate	DataPhilly	MilkBoy
~~Tuesday, July 5, 2016~~	6:00 PM	CandiDate	DataPhilly	Neo Technology
~~Tuesday, June 28, 2016~~	6:00 PM	MilkBoy	DataPhilly / Code for Philly	MilkBoy

Contributing

Community contributions are the driving force behind Cognoma. The heatmap below shows which users have contributed to which repositories:

See the guidelines for contributing for more information.

Maintainers

Cognoma relies on our generous community maintainers to assist with contributions. Thanks to the following maintainers for their help:

Cancer Data: Claire McLeod (@clairemcleod)
Machine Learning: Patrick Miller (@patrick-miller), Ryan Velazquez (@rdvelazquez), Jesse Prestwood-Taylor (@jessept), Yichuan Liu (@yl565)
Backend: Derek Goss (@dcgoss), Andrew Madonna (@awm33), Kurt Wheeler (@kurtwheeler)
Frontend: Benjamin Dolly (@bdolly)
Community: Karin Wolok (@KarinSpiderwoman)
Wildcards: Daniel Himmelstein (@dhimmel), Gregory Way (@gwaygenomics), Casey Greene (@cgreene)

cancer-data's People

Contributors

Stargazers

Watchers

cancer-data's Issues

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release

I have had this issue in the past (see zenodo file) and it looks like the current PANCAN_mutation file from xena has less samples and less columns than a previous version.

One of the columns we don't have is the specific nucleotide mutation and is preventing us from completing #15

It may be good to ask a direct question to the UCSC Xena Google Group. They have been helpful in the past (see #14)

Repository licensing options that are appropriate for non-code content

Currently, this repository is licensed under a BSD 3-Clause License, which is the default license for Project Cognoma repositories. This license was chosen for its permissive and open nature as well as it's compatibility with other Greene Lab software products. However, the BSD 3-Clause License is intended for code. Rather than referencing all content, the license specifically mentions "source code".

However, as a data repository, cognoma/cancer-data will hold much more than source code including data, visualizations, writing, documentation, and notebooks which combine all of the aforementioned content types into a single file.

As far as openly releasing content that is not only code, there's currently a short list of recommended licenses. My favorite is CC0 because it waives all copyrights and effectively places work in the public domain. This means anyone can use CC0 content without agonizing over license compatibility.

Despite being one of the most liberating licensing options available, CC0 is not an OSI-approved open source license for complicated reasons. We're still investigating the license of TCGA and Xena Browser data (see #3), but it's possible our upstream data providers may impose restrictions on our use that may require attribution, in which case CC BY is an option. However, Creative Commons recommends against using its licenses (besides CC0) for software.

Hence, switching the entire repository to a Creative Commons license doesn't make sense, since we want to remain OSI conformant. However, we also need a licensing arrangement that is appropriate for content that is not "source code". One option I can think of is multi-licensing where we apply both a CC0/CC BY License and a BSD 3-Clause License to the repository. Users would then be able to choose either license based on their use case. I want to make sure that this is a legally valid approach. What do people think (also asked on Twitter)?

exploring the data

An issue has been raised in the meeting today regarding visualizations of the clinical data. Other data viz are also considered. However, more immediately, we need viz schemes of the clinical data for assessments and covariate selection.

Gene names converted to dates in Xena's PANCAN_mutation dataset

I've noticed that some gene names have been converted to dates in PANCAN_mutation (version info, Xena Browser). Here are some of the effected rows:

sample	chr	start	end	reference	alt	gene	effect	DNA_VAF	Amino_Acid_Change
TCGA-KK-A8IH-01	chr4	164534558	164534558	G	C	1-Mar	Missense_Mutation	0.320754716981	p.N33K
TCGA-EJ-7125-01	chr16	4829717	4829717	C	A	12-Sep	Missense_Mutation	0.0357142857143	p.R266L
TCGA-CH-5762-01	chr7	55874871	55874871	T	C	14-Sep	Missense_Mutation	0.0251256281407	p.T300A
TCGA-G9-6351-01	chrX	118767429	118767429	C	A	6-Sep	Missense_Mutation	0.0280373831776	p.R328M
TCGA-G9-6342-01	chr5	132098260	132098260	C	A	8-Sep	Missense_Mutation	0.0485436893204	p.M204I

The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in PANCAN_mutation looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.

Retain Metastatic Tumors

Currently (in Cell 11 of 2.TCGA-process.ipynb), we retain only Primary Solid Tumor and Primary Blood Derived Cancer - Peripheral Blood. In #44 it was determined that 389 samples (with mutation and gene expression data) were missing clinical annotations. It likely that many of these samples were removed from the clinical matrix by cell 11 above.

We should consider adding Metastatic and to Cell 11.

Acronyms for diseases

In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (tcga_dictionary.txt). The contents are:

tissue	acronym
adrenocortical cancer	ACC
bladder urothelial carcinoma	BLCA
breast invasive carcinoma	BRCA
cervical & endocervical cancer	CESC
cholangiocarcinoma	CHOL
colon adenocarcinoma	COAD
diffuse large B-cell lymphoma	DLBC
esophageal carcinoma	ESCA
glioblastoma multiforme	GBM
head & neck squamous cell carcinoma	HNSC
kidney chromophobe	KICH
kidney clear cell carcinoma	KIRC
kidney papillary cell carcinoma	KIRP
acute myeloid leukemia	LAML
brain lower grade glioma	LGG
liver hepatocellular carcinoma	LIHC
lung adenocarcinoma	LUAD
lung squamous cell carcinoma	LUSC
mesothelioma	MESO
ovarian serous cystadenocarcinoma	OV
pancreatic adenocarcinoma	PAAD
pheochromocytoma & paraganglioma	PCPG
prostate adenocarcinoma	PRAD
rectum adenocarcinoma	READ
sarcoma	SARC
skin cutaneous melanoma	SKCM
stomach adenocarcinoma	STAD
testicular germ cell tumor	TGCT
thyroid carcinoma	THCA
thymoma	THYM
uterine corpus endometrioid carcinoma	UCEC
uterine carcinosarcoma	UCS
uveal melanoma	UVM

Which types of mutation effects should be ignored?

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect	Count	Percent
Missense_Mutation	1,044,846	58.152%
Silent	432,995	24.099%
Nonsense_Mutation	81,092	4.513%
RNA	71,493	3.979%
Frame_Shift_Del	46,941	2.613%
Splice_Site	43,262	2.408%
Frame_Shift_Ins	22,546	1.255%
missense_variant	20,241	1.127%
In_Frame_Del	11,455	0.638%
synonymous_variant	7,907	0.440%
Translation_Start_Site	3,258	0.181%
In_Frame_Ins	3,052	0.170%
stop_gained	1,573	0.088%
3_prime_UTR_variant	1,420	0.079%
Nonstop_Mutation	1,318	0.073%
exon_variant	945	0.053%
EXON	420	0.023%
5_prime_UTR_variant	395	0.022%
splice_acceptor_variant	294	0.016%
splice_region_variant	255	0.014%
3'UTR	211	0.012%
splice_donor_variant	203	0.011%
Intron	148	0.008%
5_prime_UTR_premature_start_codon_gain_variant	110	0.006%
NON_SYNONYMOUS_CODING	95	0.005%
INTRAGENIC	57	0.003%
UTR_3_PRIME	38	0.002%
SYNONYMOUS_CODING	36	0.002%
start_lost	32	0.002%
5'UTR	28	0.002%
UTR_5_PRIME	22	0.001%
stop_lost	19	0.001%
IGR	16	0.001%
stop_retained_variant	7	0.000%
STOP_GAINED	6	0.000%
initiator_codon_variant	2	0.000%
SPLICE_SITE_ACCEPTOR	2	0.000%
SYNONYMOUS_STOP	1	0.000%
5'Flank	1	0.000%

It appears that certain effects are duplicates — such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME — which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

Reshape mutation matrix for use by core-service repository

The current format of the mutation matrix leads to some complications in the core-service repository. A more desirable format to work with for the purpose of populating the core-service mutation model would be of the form:

sample_id	entrez_gene_id
TCGA-18-3406-01	1
TCGA-38-4631-01	1
...

Treehouse Childhood Cancer Initiative

New, publicly available dataset of 11,078 RNAseq + clinical childhood cancer tumors.

Xena data

Blog Post

This will open up a lot of analysis opportunities - exciting it is now available!

Filtering samples is (potentially) too strict

Data is currently processed in https://github.com/cognoma/cancer-data/blob/master/2.TCGA-process.ipynb and the final matrices used in downstream analyses include samples that have mutation, expression, and clinical measurements and were not filtered for other reasons.

@kurtwheeler pointed out in cognoma/core-service#99 a potential issue that the current implementation is not finding samples it should. @dhimmel discovered that this was not an issue (at least not primarily an issue) of the backend, but of the data itself.

I outlined current problems with the data in cognoma/core-service#99 (comment) but we can continue this discussion here.

TCGA PanCanAtlas Paper/Data Release

The PanCanAtlas released 27 open access papers and updated data last week.

The UCSC Xena team also added this version to their database! An overview of the updated data is here.

We should update our download and the figshare so that cognoma runs with the most recent PanCanAtlas version

Much Richer Sample Clinical Data

Stumbled upon snaptron today and eventually found my way to this resource.

There are many variables curated here measured on each sample (in samples.tsv) including treatment (both specific therapeutic agent, and class of therapy (e.g. chemotherapy, immunotherapy, etc.). I know that @yigalron was very interested in this particular data...

Precomputing a sample × mutation-in-gene-set matrix

At the 8/23 meetup, @dhimmel expressed interest in incorporating metabolic pathway information by combining the dataset that we have and the hetnet database that was described at the first meetup. The hetnet has information on what pathways the mutated genes in the current dataset participate in.

I figured I'd open this issue to get the conversation started. Initially, I am wondering what this dataset would look like, and do we envision it being created from what we already have? And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

Licensing of the Xena TCGA Pan-Cancer Data

Currently, the licensing of the TCGA Pan-Cancer Data available via UCSC's Xena Browser is unclear. I've messaged their listserv and will update this issue with any progress.

Invalid gene in mutation-matrix.tsv.bz2 on figshare v5

See cognoma/core-service#42 (comment), where @stephenshank has discovered a gene with an invalid entrez_gene_id in mutation-matrix.tsv.bz2 from https://doi.org/10.6084/m9.figshare.3487685.v5.

The gene is 117153, which is not included in genes.tsv.

Persistent storage of matrices that enables quick indexed lookup

Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at cognoma/cognoma#17 (comment). We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.

A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas for a variety of setups.

Generate comprehensive comparisons between RNAseq, Mutation, and Clinical Matrix

We need to generate a comparison between the sample IDs that exist in all three data sources. It will be good to subset the clinical matrix to only samples that are measured by RNAseq and to file a pull request with this report.

Export a gene information table

Similar to how we have sample information in samples.tsv, it would be nice to create a table with gene information. The primary identifier is entrez_gene_id. Additional columns could be:

symbol
name
chromosome
n_mutations - number of mutated samples
median_expression - median gene expression
mad_expression - median absolute deviation of gene expression

I'm leaning towards a combined dataset for mutation and expression genes. But I could be convinced that splitting the datasets would be better.

We should probably get this information from entrez gene as @clairemcleod did in #12.

Labeling this issue a task awaiting a claimer.

Identify the types of clinical data fields for the django team

There are a number of questions around how best to represent various items that are important for sample selection. Can someone help out the django-cognoma team to specify the best way to put these into the database that the front end can access?

See cognoma/core-service#2

Recurrence and Distant Metastasis

which column in the clinical_data should i consider to know if the tumor has recurred or not?

does _RFS_IND=1 mean definitely recurred?

how do i know if the tumor is primary or second primary?

how do i know if the tumor has recurred locally or distally?

does clinical_M=M1 or pathologic_M=M1 mean definitely metastasized?

Variable documentation for Xena Browser's PANCAN_clinicalMatrix

Hi @dhimmel ,

The documentation links provided for the 3 datasets did not explain the variables involved clearly. It would be great if you could share some links around that.

Thanks,
Roshan

Extract detailed mutation information for TCGA samples

In speaking with a cancer biologist and collaborator about cognoma it was discovered that a huge win we could relatively easily deliver is classification performance (or classification scores) across different mutation types for an input gene. This would be extremely useful for a researcher who is interested in determining the pathogenicity of a particular mutation.

I believe that cognoma is an ideal way of approaching this problem. Typically, when genes mutate there is a range of severity regarding how the particular mutation impacts downstream changes. For a particularly virulent mutation, a classifier trained to detect an inactivation signature may output a higher score for those groups of samples, than other samples with a less virulent mutation.

In my eyes, this particular issue bypasses the machine learning group - they will still work with the previously defined Y matrices. However, in order for the backend to serve the frontend information from the database about each sample's mutation so that the frontend can visualize the results we need to know how to parse this information.

I looked briefly at the information embedded in the PANCAN mutation data - particularly the columns labeled HGVSc and HGVSp. These columns hold standard ways of storing specific mutation calls. More information about these standards are provided by the HGVS website.

Process the clinical matrix to extract sample attributes

An issue has been raised in today's meeting.

The clinical matrix should be carefully analyzed to select a specific covariate or a set of covariates we can use for analyses.

The relevant notebook is here
tcga notebook for data download
and the dataset is named
PANCAN-clinicalMatrix

Converting Xena datasets to standard identifiers rather than gene symbols

Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.

Currently, we use the HiSeqV2 and TCGA.PANCAN.sampleMap datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.