mskcc-epi-bio / gnomer Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 16.0 63.61 MB

Package to wrangle and visualize genomic data in R

Home Page: https://mskcc-epi-bio.github.io/gnomeR/

License: Other

R 100.00%

genomics mutations r r-package

gnomer's People

Contributors

Stargazers

Watchers

Forkers

mljaniczek lurkingvariable arorarshi jalavery christinez-msk carokos michaelcurry1123 allisonvmitch rnaimehaom hfuchs5 kfwins2022 slb2240 whitec4 toumban1 jflynn264 akriti21

gnomer's Issues

Plotting issues still outstanding

Following up on PR #93 (and meeting 2020-08-17)

For now I removed all of the scale_fill_manual() statements from plotting functions and am just using out of the box ggplot defaults. As discussed with @arorarshi, we may want to consider adding default already loaded palettes, however since it is very simple to customize ggplot objects (see code above) we may not need this.

We agreed this is fine for now, but if someone feels inclined they can add pancan palettes to the default plots in the future.

The previous version of the plotting function had the following filtering. Currently, I only included this in the SNV plotting functions. Should all plotting functions have this filtering?

 maf <- maf %>%
    filter(.data$Variant_Type == "SNP",
           .data$HGVSc != "") %>%
    mutate(SNV_Class = substrRight(.data$HGVSc,3)

This filtering code is only included in the ggsnvclass() function in the newest version. We may want to add a line to documentation about this, but may not be needed

Arshi and I discussed potentially removing p.variant.bp, p.variant.dist and p.variant.dist.bar and p.snv.dist to simplify the functions, as these may not be needed. We can discuss more in the meeting.

These are removed in the current version of code

I’m a little confused about p.corr versus p.comut. Is this just two ways to filter which genes are displayed? one shows top co-occuring genes and one shows correlations for overall top genes? If so, maybe we can put this in one function with an arg option? I may be misunderstanding this though.

Decided to keep both cor and comut functions. We need to confirm that these are being calculated correctly in plotting functions

Also just a note that we added a n_genes argument to plotting functions to allow user to control how many genes are displayed in plots.

If you make any changes relating to the above, please link to this issue

@arorarshi @AxelitoMartin @michaelcurry1123

Change binary matrix function to output a matrix instead of a list

We discussed potentially changing the binary matrix function to output a binary matrix instead of a list. Potentially throw the patient IDs as a warning instead of returning them.

@AxelitoMartin what do you prefer?

wishlist

TCGA pancan data prep
Common checks and visualizations for TCGA pancan
Helper functions to prepare omic data for iCluster
Helper functions to visualize iCluster results

@karissawhiting @AxelitoMartin any others you want to add to wishlist?

Percentags include missing data

Hi, Using gnomeR for a project (thanks for your efforts!) - and noticed that the percentages generated by the gen.summary function use all the samples in the matrix as the denominator rather than only the ones are sequenced for those genes. (This is in the context where you have data from multiple IMPACT panels.) So even if I filter at 5%, if I have genes that are sequenced in only a small number of patients, it will show percentages lower than 5%.

Relatedly, it would be great to have an option to filter out genes that are sequenced in less than some threshold of samples.

Let me know if you want specific examples.
Thank you!

Update oncoprint function to properly deal with NAs

Two options:

Show missing as explicit factor level (and calculate percentages with this as a separate group)
Omit NAs (like cbioportal oncoprints do)

@arorarshi @AxelitoMartin

Change the way mutation status argument is dealt with- soft check and warning for SOMATIC mutation status.

@arorarshi or @AxelitoMartin can you add description and a reprex to this issue?

`patients` in `binmat`

Need to check that samples passed through patients in binmat actually exist in input file

spe.plat argument

Need to modify this so it throws warning that not all samples were mutated on IM3-6 but still annotated those that were rather than overwriting the argument to FALSE

Rewrite oncoprint function

Potentially using ggplot.

Add IMPACT Heme panel (IH3 & IH4)

talk to @edrill about this.

Create vignette for oncokb annotation

The package needs better documentation for oncoKB annotation. We should add a separate vignette for this that includes instructions on how to get, save and use an oncoKB token.

As part of this, we need to test the oncoKB functionality within the package.

Edit pathway annotation using gene-level data in `add_pathways()` function

The oncogenic signaling pathways taken from the paper can only be annotated using an alteration-level matrix.
We should prevent users from annotating the default pathways with a gene-level alteration matrix, but keep the functionality for user who have a custom pathway (i.e. for exploratory analyses, or for other pathways with less-strict definitions).

Review spe.palt argument in binary matrix function

Consider renaming
Throw a warning when spe.palt = FALSE if there are multiple platforms detected

For the second item, we can look at the patient ID's IMX codes to tell if there are multiple platforms.

Check column names of input df and allow create_binary_matrix() to accept maf, api or snakecase version of maf names

We need to improve handling of column names in create_binary_matrix(). The function should by default support: traditional maf names (see link below), lower snakecase version of maf names, and API column names.

This needs to be done for mutation (maf), cna, and structural variant data eventually but we will start with mutation (maf).

Look at maf data specifications and compare to API specifications. You can pull API data using cbioportalR. Make sure you install the development version of the package

library(cbioportalR)
set_cbioportal_db("msk")
msk_2017 <- get_genetics_by_study("msk_impact_2017")

ss <- available_samples("msk_impact_2017")
h_ss <- head(ss, n = 100)

msk_2017 <- get_genetics_by_sample(sample_id = h_ss$sampleId)
msk_2017$mutation

Make a table for internal documentation of the column names and how they match up between maf and API (double check cbioportal docs as well regarding their data formats). (See #158 and maybe integrate the table directly in this vignette eventually)
Discuss above table and decide what level of flexibility we want to accept with column names
Create function that checks and guesses column names before passing the data into create_binary_matrix()
Decide on minimum columns needed and which columns support which create_binary_matrix() arguments. Improve errors/messaging for when certain columns are missing (e.g. variant type = "germline")

fix rm.empty and spe.plat

Simple function to create matrix for gene pathways

Have user specify genes in pathway, and output matrix columns with those tallied.

Rework the Visualizations Vignette

Code needs to be tested and language reviewed.

Previous vignettes have been removed from the package. Please let me know if you plan to work on this and I can share the previous version to work off of.

Improve FACETS Functionality

@arorarshi @AxelitoMartin Can you add some details about what you are envisioning for this?

Some imports should be moved to suggests?

Several imports are in the package that I'm not sure are necessary.

Imports should be reserved for packages whose functions we need to do our functions.

For example:
ComplexHeatmap
GGally
plotly

Do we actually use functions from these packages within any of our functions? If so, they should clearly be marked within our functions as ComplexHeatmap::add_heatmap (should never just be "naked" like add_heatmap). But, I don't see functions from these packages being called this way anywhere in the package.

To make the package "leaner" I would recommend we cut down on unnecessary imports, or else make more clear why we require these inputs!

@AxelitoMartin @arorarshi @karissawhiting any thoughts on this?

tests - check for missing columns,

test on seq.plat, samples with no mutation

Separate out panel-specific NA annotation from binmat() and make it's own function

Let's make this into a separate function that can be called within binmat(). This function should work with IMPACT or GENIE panels (add tests to check).

Should also allow you to pass sample/panel pairs so you can differentially annotate for missingness depending on the panel that sample was tested on.

Add travis CI and testthat code coverage

Check CNA data type and CNA with oncoKB annotations

Currently gnomeR expects a matrix for CNA data. This causes the oncoKB functionality to not work.

move this check to after we have checked that a Variant_Classification column exists.

https://github.com/AxelitoMartin/gnomeR/blob/539ad9fb01c1105c443ec0e5af872889dcbb3cc3/R/make-bin-mat.R#L61

L61-L69

Improve Test Coverage

If there are already unit tests commented out on any file you are working on, feel free to use/revamp them or replace them

‼️ ~ Please comment below on this issue or add your name to a task on this list below if you plan to work on a task so we don't duplicate effort ~ ‼️

OVERALL TEST COVERAGE: 76.8%

tbl_genomic() tests - priority

Add tests for deprecation errors
Passing gene_binary explicitly: tbl_genomic(gene_binary = gene_binary) may cause error in some cases. I commented out this test for now, but we should revisit.
Review rules for wrapper functions for tbl_summary (see {gtreg} as an example).
May need more tests of other_vars arguments
Read vignette updates to make sure they make sense

Test sanitize_data() functions in utils.R

line 24 check if cli_abort works for missing columns
line 37 clin_abort fusions in mutation data frame
lines 44-45 mutation status in column names warning
all of if variant_type is in column names loop
lines 111 abort if no sample ID found
lines 117 abort if no hugo symbol column found
159-160 abort if number of missing columns is > 0

Test pathway functions

Here are some steps to help you get started contributing to this effort at the hackathon: (left from previous version of issue in case we need it later)

If you've never written unit tests, here is a great resource to skim in order to get an introduction to testing https://r-pkgs.org/tests.html
Fork, clone and branch the repository (see github trainings)
Once you've done the above to get your own copy, check that you can install the package and run checks ("Build" tab in RStudio). Try checking the package test coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). This will show you the test coverage of each file and the package overall. One of our goals is to increase this coverage percentage.
Read the documentation for the argument you are testing (it might also need updating).
Start writing tests with correctly specified args passed and also incorrect args passed. Check the following, as well as any common user behavior you can think of:
- When you specify the argument correctly (according to the documentation) does it return what you would expect ?
- When you don’t specify the argument at all, does it return what you expect?
- When you incorrectly specify the arg, does it give you a useful warning or message?

Additionally, I've added some suggestions for tests in the test files themselves (in the tests folder of the package)

Please add tests to the tests R files (in test folder). For tests you can use internal data (e.g gnomerR::mut, gnomerR::fusion, gnomerR::cna), a small sample of real data, or create your own fake data.
Additionally, if the function is working as expected but the function documentation is not clear, please edit the documentation (R folder in package). If any functions don’t work as expected, please make or suggest a change to the documentation or the function itself. If you can fix it yourself, please submit a PR with a change. For example, if you think the error/warning messages are unclear, please add warnings to the function. If you can’t fix it yourself, no problem, but please file an issue for your suggestion so we can make sure to address it!

Review names of all functions and files

Discussed doing a review of function names and file names. Here are some tips from the tidyverse style guide.

For file names, we discussed changing file names to be same as function names when there is only one function in the file.

Update all the vignettes

rewrite readme
Need tutorial for oncokb annotation
and get token for oncokb

To-dos for 2021-02-24 Release

@arorarshi @AxelitoMartin

Review `gen_summary()` and `gen_uni_cox()` and consider switching to use {gtsummary} functions

gen_summary() and gen_uni_cox() functionality overlaps with {gtsummary} tbl_summary() and tbl_uvregression() functions. Since {gtsummary} is very fully-featured, we may want to consider making these functions wrap existing {gtsummary} functions, or consider removing these functions altogether.

One thing to consider is speed since genomic data sets are often very large. We should test {gtsummary} functions with large genomic data sets first to see if it's worth keeping these less fully featured, but faster versions.

Add IMPACT 505 to impact_gene_info (internal package data)

The newest IMPACT panel needs to be added to the internal pacakge dataframe gnomeR::impact_gene_info. Review the impact-gene-info.R script (in data-raw folder) and add code there to do so. The code to add the previous panels is here and can be used as reference.

You will need to:

pull new panel from cbioportalR::get_gene_pane("IMPACT505")
Look up aliases for all genes with cbioportalR::get_alias()
Add this information to gnomeR::impact_gene_info dataframe.

Troubleshoot CNA data with panel NA annotation & gene name alias functionality

reflect change of gene names in CN and fusion data types
when cna.binary = FALSE - return the rownames of data.frame when running with cna alone - binmat(cna=cna)
when cna.binary= FALSE - specify.plat is not counted for? genes need to be censored for different IM platforms

Create {pkgdown} website

Create a package website

@margarethannum

CNA events - process and analysis

Processing CNA files

Have the gene columns as gene_CN and then have the discrete states as is. Have an option to also return CN events as binary. (This can be in make-bin-mat.R or make-oncoprint-dat.R)
In cna file, there are no NA, even if the sample was sequenced on a different platform for a gene that didn't exist in the platform. (Find out)
What type of an event is -1.5 in CN? Maybe there is a new file for CN (Find out)

Analysis
4. Analyze Copy Number Alterations (CNA) wrt to an entire gene and perform association test on all observed CN states of the gene

Review oncoKB Functionality

oncoKB removed from binary_matrix() for now. Need to work this back into the package and test.

oncoKB as in an R environment was fully validated with its python counterpart. We get the same annotations. However, oncoKB versions should be kept in mind, and we should try and update with their latest version.

oncoKb version should be noted or should be a part of gnomeR output
an option to provide clinical file (improvement)
passing of other oncoKB parameters

Modularize binary matrix function- split out separate sub functions for mutation, CN and fusion

We discussed making the binary matrix function more modular by splitting it out into 3 pre-processing functions for mutations, CNA and fusions. These would then be called within the overarching matrix function. This will help with future testing and troubleshooting.

Additionally, should the user have an option in the binary matrix function to combine mutations and CNAs and make an "alterations" matrix? If so, these could be first computed separately by the individually cleaning functions and then combined before being returned to user.

We discussed possibly making the binary matrix function into a method function in the future, but TBD.

oncoKB additions

@arorarshi
@AxelitoMartin

Integrate with cbioPortal Web API to gnomeR

I think a good start on this task will be to adjust the create.bin.matrix() function to accept column names other than MAF standard names as arguments. The MAF standards can be default.

Rework the Data Processing Vignette

Code needs to be tested and language reviewed.

Previous vignettes have been removed from the package. Please let me know if you plan to work on this and I can share the previous version to work off of.

Mislabeling of legend in oncoprint

Following the vignette, I run the command plot_oncoPrint(gen.dat = select(bmat, all_of(genes))). I get the following messages in console:
All mutation types: MUT, AMP, DEL.
Following at are removed: FUS, MIS, because no color was defined for them.
Following at are removed: FUS, MIS, because no color was defined for them.
Following at are removed: FUS, MIS, because no color was defined for them.

Attaching resulting oncoprint:

The correct labeling should be:
Blue square = Deletion
Red square = Amplification
Green line = Mutation

I ran the above code with both the regular and development versions of gnomeR.

Thank you!

Figure out how to fix all the package notes

This isn't urgent but should be done eventually, especially if we are planning to submit to CRAN. There are a couple in particular that I can't figure out like the one about the large size of the docs folder or how to bind the python functions so we don't get the no visible bindings note for those.

Export and add documentation for a list of IMPACT genes

Allow users to easily access the list of IMPACT genes and panels. Export this list to users and add documentation for it with a source reference.

Add a check for genes that are in user-provided MAF but not in IMPACT panel gene list

Thrown an error if a gene isn't part of any IMPACT platform. Additionally, there should be a check for any genes that have been renamed.

The check should probably be based off list from issue #53

Review binmat function and make more modular

Consider breaking up function to make it more modular. This will make it easier to test and troubleshoot.

Combine alias warning messages in create_gene_binary()

Currently create_binary_matrix() outputs separate sets of notes for which genes are relabeled as aliases. It would be best to capture these and concatenate into 1 message/list (can probably use {purrr} functions for this).

Change mut_type argument default to keep NAs and only eliminate GERMLINE as default

Currently the mut_type argument of binary_matrix() returns only mutations marked as SOMATIC in maf. This also removes any unknown mutation types (usually in file as NA or ""). Often when you pull data from the API, mutation type is blank for many observations. These are almost always actually SOMATIC so we should not exclude these by default.

Change the binary_matrix() function to remove GERMLINE by default but keep everything else including blanks. Maybe we can call this option NO_GERMLINE?
Add a warning (cli::cli_warn()) that notes you have X number of mutations marked as blank. Let user know these are retained in resulting binary matrix.
Still allow user to change to SOMATIC or GERMLINE or ALL
Edit documentation to make this edit clear
Add tests for the argument and the changes ensuring NA/blanks are returned by default (try to find a small example from API data to test a well, see {cbioportalR})

Add pathways function

I removed the pathways argument from the binary matrix function, but we still have the custom_pathway() function. I'm thinking we should remove this as well and have one unified function called add_pathway() that can annotate pre-set pathways included in the package, or allow you to pass a custom pathway. It could be used as follows:

binary_matrix(mutation = mut, cna = cna,  fusions = fus) %>%
    add_pathway(pathway = "all")

Where the pathway argument would be "all" be default, allowing you to annotate all package-included pathways. (see gnomeR::pathways). Otherwise you could specify a specific pathway_id like "RTK-RAS".

binary_matrix(mutation = mut, cna = cna,  fusions = fus) %>%
    add_pathway(pathway = "RTK-RAS")

Additionally, you could add a custom pathway with:

binary_matrix(mutation = mut, cna = cna,  fusions = fus) %>%
    add_pathway(custom_pathway = c("TP53", "APC", "FGFR3"))

First step of this will be updating the gnomeR::pathways internal data which is currently incorrect. See pathways.R which pulls data from pathways.csv to create this internal data frame. You can update pathways.csv directly.

Figure out the best way to deal with non MSK (-IM, -IH) samples

Right now if I query a non-MSK sample ID using the API and then feed the genetic data into binary matrix it looks as if that sample has no mutations even though it may have mutations that were just not accessible in the IMPACT database. It also overwrites spe.plat which is a pain.

I'm not sure the best way to deal with these non MSK samples yet, and I also haven't been able to find a way to query all sample mutation data within a study in a way that would allow access to these non MSK samples through the API. I would think it would be possible considering the data is on cbioportal somewhere.

Review `gen.summary()` and univariate functions

Consider removing these summary functions and instead encouraging people to use {gtsummary}, as this functionality is more fully featured already there. We can deprecate these and if someone calls these functions, we can throw a message with suggestion/code to use tbl_uvregression()

Create Vignette that uses {cbioportalR} to pull data and {gnomeR} to process data

We should have a case study vignette that uses data pulled directly from cBioPortal in {gnomeR} functions. See {cbioportalR} vignette for instructions on how to pull data: https://github.com/karissawhiting/cbioportalR

Rework the Analyzing Genomic Data Vignette

Code needs to be tested and language reviewed.

Previous vignettes have been removed from the package. Please let me know if you plan to work on this and I can share the previous version to work off of.

mskcc-epi-bio / gnomer Goto Github PK

gnomer's People

Contributors

Stargazers

Watchers

Forkers

gnomer's Issues

tbl_genomic() tests - priority

Test sanitize_data() functions in utils.R

Test pathway functions

Here are some steps to help you get started contributing to this effort at the hackathon: (left from previous version of issue in case we need it later)

Recommend Projects

Recommend Topics

Recommend Org