Giter Club home page Giter Club logo

cancer-data's Introduction

Project Cognoma

Putting machine learning in the hands of cancer biologists.

Project Cognoma is an open source project to create a webapp for analyzing cancer data. We're a community-driven philanthropic project that began as a collaboration between the Greene Lab, DataPhilly, and Code for Philly. Our contributors are primarily based in the Philadelphia area, but anyone anywhere is welcome. This GitHub repository is the administrative and informational home of Cognoma.

The Meetup phase of Cognoma is now complete! The Childhood Cancer Data Lab of Alex's Lemonade Stand Foundation will be providing longterm maintenance. Public contributions are still welcome through GitHub. The main priority is enhancements and bug fixes to improve http://cognoma.org. For a nice overview of the project, see its coverage by The Philadelphia Citizen.

Teams and Repositories

The project is composed of four teams with their own corresponding repositories:

Team Name Repositories Description
Cancer Data cancer-data, genes, figshare processing the underlying cancer data to the formats required for this project.
Machine Learning machine-learning, cognoml building classifiers to predict mutation status from gene expression data.
Backend core-service, task-service, ml-workers, infrastructure creating the infrastructure to power the webapp and glue the components together.
Frontend frontend, uiux building the webapp that users interact with.

New Here?

If you are a new user and would like to get involved, please introduce yourself. Contributions are made through GitHub, so if you are unfamiliar with git or GitHub, check out the sandbox for a place to learn by doing.

Meetup Schedule

We hold project meetups. Our usual meeting spot is at Industrious (where CandiDate is located). The address is 230 S Broad St, Floor 17, Philadelphia.

πŸ“… Date ⌚ Time πŸ—Ί Location ℹ️ Meetup Details πŸ’° Sponsor
Wednesday, October 11, 2017 6:00 PM MilkBoy DataPhilly Alex’s Lemonade Stand Foundation
Tuesday, August 15, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, July 11, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, June 27, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, May 30, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, April 25, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, April 4, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, February 28, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Monday, February 13, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, January 31, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Monday, January 16, 2017 9:00 AM Philly Think Space Frontend Only MLK Day Volunteers from Think Company
Tuesday, January 10, 2017 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, December 20, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, December 6, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, November 15, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, November 1, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, October 18, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, October 4, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Monday, September 19, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, September 6, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, August 23, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, August 9, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, July 26, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, July 19, 2016 6:00 PM CandiDate DataPhilly Penn Institute for Biomedical Informatics
Tuesday, July 12, 2016 6:00 PM CandiDate DataPhilly MilkBoy
Tuesday, July 5, 2016 6:00 PM CandiDate DataPhilly Neo Technology
Tuesday, June 28, 2016 6:00 PM MilkBoy DataPhilly / Code for Philly MilkBoy

Contributing

Community contributions are the driving force behind Cognoma. The heatmap below shows which users have contributed to which repositories:

Contribution Heatmap

See the guidelines for contributing for more information.

Maintainers

Cognoma relies on our generous community maintainers to assist with contributions. Thanks to the following maintainers for their help:

cancer-data's People

Contributors

clairemcleod avatar dhimmel avatar gwaybio avatar stephenshank avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cancer-data's Issues

Repository licensing options that are appropriate for non-code content

Currently, this repository is licensed under a BSD 3-Clause License, which is the default license for Project Cognoma repositories. This license was chosen for its permissive and open nature as well as it's compatibility with other Greene Lab software products. However, the BSD 3-Clause License is intended for code. Rather than referencing all content, the license specifically mentions "source code".

However, as a data repository, cognoma/cancer-data will hold much more than source code including data, visualizations, writing, documentation, and notebooks which combine all of the aforementioned content types into a single file.

As far as openly releasing content that is not only code, there's currently a short list of recommended licenses. My favorite is CC0 because it waives all copyrights and effectively places work in the public domain. This means anyone can use CC0 content without agonizing over license compatibility.

Despite being one of the most liberating licensing options available, CC0 is not an OSI-approved open source license for complicated reasons. We're still investigating the license of TCGA and Xena Browser data (see #3), but it's possible our upstream data providers may impose restrictions on our use that may require attribution, in which case CC BY is an option. However, Creative Commons recommends against using its licenses (besides CC0) for software.

Hence, switching the entire repository to a Creative Commons license doesn't make sense, since we want to remain OSI conformant. However, we also need a licensing arrangement that is appropriate for content that is not "source code". One option I can think of is multi-licensing where we apply both a CC0/CC BY License and a BSD 3-Clause License to the repository. Users would then be able to choose either license based on their use case. I want to make sure that this is a legally valid approach. What do people think (also asked on Twitter)?

exploring the data

An issue has been raised in the meeting today regarding visualizations of the clinical data. Other data viz are also considered. However, more immediately, we need viz schemes of the clinical data for assessments and covariate selection.

Gene names converted to dates in Xena's PANCAN_mutation dataset

I've noticed that some gene names have been converted to dates in PANCAN_mutation (version info, Xena Browser). Here are some of the effected rows:

sample chr start end reference alt gene effect DNA_VAF RNA_VAF Amino_Acid_Change
TCGA-KK-A8IH-01 chr4 164534558 164534558 G C 1-Mar Missense_Mutation 0.320754716981 p.N33K
TCGA-EJ-7125-01 chr16 4829717 4829717 C A 12-Sep Missense_Mutation 0.0357142857143 p.R266L
TCGA-CH-5762-01 chr7 55874871 55874871 T C 14-Sep Missense_Mutation 0.0251256281407 p.T300A
TCGA-G9-6351-01 chrX 118767429 118767429 C A 6-Sep Missense_Mutation 0.0280373831776 p.R328M
TCGA-G9-6342-01 chr5 132098260 132098260 C A 8-Sep Missense_Mutation 0.0485436893204 p.M204I

The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in PANCAN_mutation looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.

Retain Metastatic Tumors

Currently (in Cell 11 of 2.TCGA-process.ipynb), we retain only Primary Solid Tumor and Primary Blood Derived Cancer - Peripheral Blood. In #44 it was determined that 389 samples (with mutation and gene expression data) were missing clinical annotations. It likely that many of these samples were removed from the clinical matrix by cell 11 above.

We should consider adding Metastatic and to Cell 11.

Acronyms for diseases

In another discussion @gwaygenomics shared acronyms for TCGA diseases as a text file (tcga_dictionary.txt). The contents are:

tissue acronym
adrenocortical cancer ACC
bladder urothelial carcinoma BLCA
breast invasive carcinoma BRCA
cervical & endocervical cancer CESC
cholangiocarcinoma CHOL
colon adenocarcinoma COAD
diffuse large B-cell lymphoma DLBC
esophageal carcinoma ESCA
glioblastoma multiforme GBM
head & neck squamous cell carcinoma HNSC
kidney chromophobe KICH
kidney clear cell carcinoma KIRC
kidney papillary cell carcinoma KIRP
acute myeloid leukemia LAML
brain lower grade glioma LGG
liver hepatocellular carcinoma LIHC
lung adenocarcinoma LUAD
lung squamous cell carcinoma LUSC
mesothelioma MESO
ovarian serous cystadenocarcinoma OV
pancreatic adenocarcinoma PAAD
pheochromocytoma & paraganglioma PCPG
prostate adenocarcinoma PRAD
rectum adenocarcinoma READ
sarcoma SARC
skin cutaneous melanoma SKCM
stomach adenocarcinoma STAD
testicular germ cell tumor TGCT
thyroid carcinoma THCA
thymoma THYM
uterine corpus endometrioid carcinoma UCEC
uterine carcinosarcoma UCS
uveal melanoma UVM

Which types of mutation effects should be ignored?

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect Count Percent
Missense_Mutation 1,044,846 58.152%
Silent 432,995 24.099%
Nonsense_Mutation 81,092 4.513%
RNA 71,493 3.979%
Frame_Shift_Del 46,941 2.613%
Splice_Site 43,262 2.408%
Frame_Shift_Ins 22,546 1.255%
missense_variant 20,241 1.127%
In_Frame_Del 11,455 0.638%
synonymous_variant 7,907 0.440%
Translation_Start_Site 3,258 0.181%
In_Frame_Ins 3,052 0.170%
stop_gained 1,573 0.088%
3_prime_UTR_variant 1,420 0.079%
Nonstop_Mutation 1,318 0.073%
exon_variant 945 0.053%
EXON 420 0.023%
5_prime_UTR_variant 395 0.022%
splice_acceptor_variant 294 0.016%
splice_region_variant 255 0.014%
3'UTR 211 0.012%
splice_donor_variant 203 0.011%
Intron 148 0.008%
5_prime_UTR_premature_start_codon_gain_variant 110 0.006%
NON_SYNONYMOUS_CODING 95 0.005%
INTRAGENIC 57 0.003%
UTR_3_PRIME 38 0.002%
SYNONYMOUS_CODING 36 0.002%
start_lost 32 0.002%
5'UTR 28 0.002%
UTR_5_PRIME 22 0.001%
stop_lost 19 0.001%
IGR 16 0.001%
stop_retained_variant 7 0.000%
STOP_GAINED 6 0.000%
initiator_codon_variant 2 0.000%
SPLICE_SITE_ACCEPTOR 2 0.000%
SYNONYMOUS_STOP 1 0.000%
5'Flank 1 0.000%

It appears that certain effects are duplicates β€” such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME β€” which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank β€” I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

Reshape mutation matrix for use by core-service repository

The current format of the mutation matrix leads to some complications in the core-service repository. A more desirable format to work with for the purpose of populating the core-service mutation model would be of the form:

sample_id	entrez_gene_id
TCGA-18-3406-01	1
TCGA-38-4631-01	1
...

Filtering samples is (potentially) too strict

Data is currently processed in https://github.com/cognoma/cancer-data/blob/master/2.TCGA-process.ipynb and the final matrices used in downstream analyses include samples that have mutation, expression, and clinical measurements and were not filtered for other reasons.

@kurtwheeler pointed out in cognoma/core-service#99 a potential issue that the current implementation is not finding samples it should. @dhimmel discovered that this was not an issue (at least not primarily an issue) of the backend, but of the data itself.

I outlined current problems with the data in cognoma/core-service#99 (comment) but we can continue this discussion here.

TCGA PanCanAtlas Paper/Data Release

The PanCanAtlas released 27 open access papers and updated data last week.

The UCSC Xena team also added this version to their database! An overview of the updated data is here.

We should update our download and the figshare so that cognoma runs with the most recent PanCanAtlas version

Much Richer Sample Clinical Data

Stumbled upon snaptron today and eventually found my way to this resource.

There are many variables curated here measured on each sample (in samples.tsv) including treatment (both specific therapeutic agent, and class of therapy (e.g. chemotherapy, immunotherapy, etc.). I know that @yigalron was very interested in this particular data...

Precomputing a sample Γ— mutation-in-gene-set matrix

At the 8/23 meetup, @dhimmel expressed interest in incorporating metabolic pathway information by combining the dataset that we have and the hetnet database that was described at the first meetup. The hetnet has information on what pathways the mutated genes in the current dataset participate in.

I figured I'd open this issue to get the conversation started. Initially, I am wondering what this dataset would look like, and do we envision it being created from what we already have? And how much tweaking will the classifier of the machine learning group (for instance, that provided by @gwaygenomics) require?

Persistent storage of matrices that enables quick indexed lookup

Currently, we're storing our datasets (which are matrices) as compressed TSVs which are great for long-term interoperable storage. However, we'd love a way to lookup specific rows and columns without having to read the entire dataset. We began discussing options at cognoma/cognoma#17 (comment). We want a persistent storage format (i.e. file) that allows reading only specified rows and columns into a numpy array/matrix or a pandas dataframe.

A primary benchmark for judging implementations is how much time are you saving over reading in the entire bzipped TSV into python via pandas for a variety of setups.

Export a gene information table

Similar to how we have sample information in samples.tsv, it would be nice to create a table with gene information. The primary identifier is entrez_gene_id. Additional columns could be:

  • symbol
  • name
  • chromosome
  • n_mutations - number of mutated samples
  • median_expression - median gene expression
  • mad_expression - median absolute deviation of gene expression

I'm leaning towards a combined dataset for mutation and expression genes. But I could be convinced that splitting the datasets would be better.

We should probably get this information from entrez gene as @clairemcleod did in #12.

Labeling this issue a task awaiting a claimer.

Recurrence and Distant Metastasis

which column in the clinical_data should i consider to know if the tumor has recurred or not?

does _RFS_IND=1 mean definitely recurred?

how do i know if the tumor is primary or second primary?

how do i know if the tumor has recurred locally or distally?

does clinical_M=M1 or pathologic_M=M1 mean definitely metastasized?

Extract detailed mutation information for TCGA samples

In speaking with a cancer biologist and collaborator about cognoma it was discovered that a huge win we could relatively easily deliver is classification performance (or classification scores) across different mutation types for an input gene. This would be extremely useful for a researcher who is interested in determining the pathogenicity of a particular mutation.

I believe that cognoma is an ideal way of approaching this problem. Typically, when genes mutate there is a range of severity regarding how the particular mutation impacts downstream changes. For a particularly virulent mutation, a classifier trained to detect an inactivation signature may output a higher score for those groups of samples, than other samples with a less virulent mutation.

In my eyes, this particular issue bypasses the machine learning group - they will still work with the previously defined Y matrices. However, in order for the backend to serve the frontend information from the database about each sample's mutation so that the frontend can visualize the results we need to know how to parse this information.

I looked briefly at the information embedded in the PANCAN mutation data - particularly the columns labeled HGVSc and HGVSp. These columns hold standard ways of storing specific mutation calls. More information about these standards are provided by the HGVS website.

Converting Xena datasets to standard identifiers rather than gene symbols

Xena datasets (as retrieved in #1) use symbols to identify genes rather than standardized identifiers, such as Entrez GeneIDs, ensembl gene IDs, HGNC IDs, or UCSC gene IDs. This has led to upstream data quality issues such as #4. Hence, I think it makes sense to code our databases using standardized identifiers.

Currently, we use the HiSeqV2 and TCGA.PANCAN.sampleMap datasets which both use symbols. Does anyone have a preferred identifier? I like Entrez GeneIDs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.