Giter Club home page Giter Club logo

gnomer's People

Contributors

akriti21 avatar alrein-05 avatar arorarshi avatar axelitomartin avatar carokos avatar christinez-msk avatar edrill avatar hfuchs5 avatar jalavery avatar jflynn264 avatar karissawhiting avatar michaelcurry1123 avatar mljaniczek avatar slb2240 avatar toumban1 avatar whitec4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gnomer's Issues

Plotting issues still outstanding

Following up on PR #93 (and meeting 2020-08-17)

  • For now I removed all of the scale_fill_manual() statements from plotting functions and am just using out of the box ggplot defaults. As discussed with @arorarshi, we may want to consider adding default already loaded palettes, however since it is very simple to customize ggplot objects (see code above) we may not need this.

We agreed this is fine for now, but if someone feels inclined they can add pancan palettes to the default plots in the future.

  • The previous version of the plotting function had the following filtering. Currently, I only included this in the SNV plotting functions. Should all plotting functions have this filtering?
 maf <- maf %>%
    filter(.data$Variant_Type == "SNP",
           .data$HGVSc != "") %>%
    mutate(SNV_Class = substrRight(.data$HGVSc,3)

This filtering code is only included in the ggsnvclass() function in the newest version. We may want to add a line to documentation about this, but may not be needed

  • Arshi and I discussed potentially removing p.variant.bp, p.variant.dist and p.variant.dist.bar and p.snv.dist to simplify the functions, as these may not be needed. We can discuss more in the meeting.

These are removed in the current version of code

  • I’m a little confused about p.corr versus p.comut. Is this just two ways to filter which genes are displayed? one shows top co-occuring genes and one shows correlations for overall top genes? If so, maybe we can put this in one function with an arg option? I may be misunderstanding this though.

Decided to keep both cor and comut functions. We need to confirm that these are being calculated correctly in plotting functions

Also just a note that we added a n_genes argument to plotting functions to allow user to control how many genes are displayed in plots.

If you make any changes relating to the above, please link to this issue

@arorarshi @AxelitoMartin @michaelcurry1123

wishlist

  • TCGA pancan data prep

  • Common checks and visualizations for TCGA pancan

  • Helper functions to prepare omic data for iCluster

  • Helper functions to visualize iCluster results

@karissawhiting @AxelitoMartin any others you want to add to wishlist?

Percentags include missing data

Hi, Using gnomeR for a project (thanks for your efforts!) - and noticed that the percentages generated by the gen.summary function use all the samples in the matrix as the denominator rather than only the ones are sequenced for those genes. (This is in the context where you have data from multiple IMPACT panels.) So even if I filter at 5%, if I have genes that are sequenced in only a small number of patients, it will show percentages lower than 5%.

Relatedly, it would be great to have an option to filter out genes that are sequenced in less than some threshold of samples.

Let me know if you want specific examples.
Thank you!

`patients` in `binmat`

Need to check that samples passed through patients in binmat actually exist in input file

spe.plat argument

Need to modify this so it throws warning that not all samples were mutated on IM3-6 but still annotated those that were rather than overwriting the argument to FALSE

Create vignette for oncokb annotation

The package needs better documentation for oncoKB annotation. We should add a separate vignette for this that includes instructions on how to get, save and use an oncoKB token.

As part of this, we need to test the oncoKB functionality within the package.

Edit pathway annotation using gene-level data in `add_pathways()` function

The oncogenic signaling pathways taken from the paper can only be annotated using an alteration-level matrix.
We should prevent users from annotating the default pathways with a gene-level alteration matrix, but keep the functionality for user who have a custom pathway (i.e. for exploratory analyses, or for other pathways with less-strict definitions).

Review spe.palt argument in binary matrix function

  • Consider renaming
  • Throw a warning when spe.palt = FALSE if there are multiple platforms detected

For the second item, we can look at the patient ID's IMX codes to tell if there are multiple platforms.

Check column names of input df and allow create_binary_matrix() to accept maf, api or snakecase version of maf names

We need to improve handling of column names in create_binary_matrix(). The function should by default support: traditional maf names (see link below), lower snakecase version of maf names, and API column names.

This needs to be done for mutation (maf), cna, and structural variant data eventually but we will start with mutation (maf).

  1. Look at maf data specifications and compare to API specifications. You can pull API data using cbioportalR. Make sure you install the development version of the package
library(cbioportalR)
set_cbioportal_db("msk")
msk_2017 <- get_genetics_by_study("msk_impact_2017")

ss <- available_samples("msk_impact_2017")
h_ss <- head(ss, n = 100)

msk_2017 <- get_genetics_by_sample(sample_id = h_ss$sampleId)
msk_2017$mutation
  1. Make a table for internal documentation of the column names and how they match up between maf and API (double check cbioportal docs as well regarding their data formats). (See #158 and maybe integrate the table directly in this vignette eventually)
  2. Discuss above table and decide what level of flexibility we want to accept with column names
  3. Create function that checks and guesses column names before passing the data into create_binary_matrix()
  4. Decide on minimum columns needed and which columns support which create_binary_matrix() arguments. Improve errors/messaging for when certain columns are missing (e.g. variant type = "germline")

Rework the Visualizations Vignette

Code needs to be tested and language reviewed.

Previous vignettes have been removed from the package. Please let me know if you plan to work on this and I can share the previous version to work off of.

Some imports should be moved to suggests?

Several imports are in the package that I'm not sure are necessary.

Imports should be reserved for packages whose functions we need to do our functions.

For example:
ComplexHeatmap
GGally
plotly

Do we actually use functions from these packages within any of our functions? If so, they should clearly be marked within our functions as ComplexHeatmap::add_heatmap (should never just be "naked" like add_heatmap). But, I don't see functions from these packages being called this way anywhere in the package.

To make the package "leaner" I would recommend we cut down on unnecessary imports, or else make more clear why we require these inputs!

@AxelitoMartin @arorarshi @karissawhiting any thoughts on this?

Improve Test Coverage

If there are already unit tests commented out on any file you are working on, feel free to use/revamp them or replace them

‼️ ~ Please comment below on this issue or add your name to a task on this list below if you plan to work on a task so we don't duplicate effort ~ ‼️

OVERALL TEST COVERAGE: 76.8%

tbl_genomic() tests - priority

  • Add tests for deprecation errors
  • Passing gene_binary explicitly: tbl_genomic(gene_binary = gene_binary) may cause error in some cases. I commented out this test for now, but we should revisit.
  • Review rules for wrapper functions for tbl_summary (see {gtreg} as an example).
  • May need more tests of other_vars arguments
  • Read vignette updates to make sure they make sense

Test sanitize_data() functions in utils.R

  • line 24 check if cli_abort works for missing columns
  • line 37 clin_abort fusions in mutation data frame
  • lines 44-45 mutation status in column names warning
  • all of if variant_type is in column names loop
  • lines 111 abort if no sample ID found
  • lines 117 abort if no hugo symbol column found
  • 159-160 abort if number of missing columns is > 0

Test pathway functions

Here are some steps to help you get started contributing to this effort at the hackathon: (left from previous version of issue in case we need it later)

  1. If you've never written unit tests, here is a great resource to skim in order to get an introduction to testing https://r-pkgs.org/tests.html
  2. Fork, clone and branch the repository (see github trainings)
  3. Once you've done the above to get your own copy, check that you can install the package and run checks ("Build" tab in RStudio). Try checking the package test coverage with withr::with_envvar(new = c("NOT_CRAN" = "true"), covr::report()). This will show you the test coverage of each file and the package overall. One of our goals is to increase this coverage percentage.
  4. Read the documentation for the argument you are testing (it might also need updating).
  5. Start writing tests with correctly specified args passed and also incorrect args passed. Check the following, as well as any common user behavior you can think of:
    • When you specify the argument correctly (according to the documentation) does it return what you would expect ?
    • When you don’t specify the argument at all, does it return what you expect?
    • When you incorrectly specify the arg, does it give you a useful warning or message?

Additionally, I've added some suggestions for tests in the test files themselves (in the tests folder of the package)

  1. Please add tests to the tests R files (in test folder). For tests you can use internal data (e.g gnomerR::mut, gnomerR::fusion, gnomerR::cna), a small sample of real data, or create your own fake data.
  2. Additionally, if the function is working as expected but the function documentation is not clear, please edit the documentation (R folder in package). If any functions don’t work as expected, please make or suggest a change to the documentation or the function itself. If you can fix it yourself, please submit a PR with a change. For example, if you think the error/warning messages are unclear, please add warnings to the function. If you can’t fix it yourself, no problem, but please file an issue for your suggestion so we can make sure to address it!

Review names of all functions and files

Discussed doing a review of function names and file names. Here are some tips from the tidyverse style guide.

For file names, we discussed changing file names to be same as function names when there is only one function in the file.

To-dos for 2021-02-24 Release

  • test and confirm binmat
  • test and confirm pathways functionality
  • test and confirm alias functionalities
  • test and confirm plotting functions
  • add unit tests as needed to the above functions
  • review all documentation
  • @karissawhiting - make vignette for cbioportalr
  • once everything is tested and confirmed, push all changes to master branch
  • clear warnings and notes
  • mark functions as experimental? (do we want to do this for TMB function?) - @michaelcurry1123 may help with this
  • make official release

@arorarshi @AxelitoMartin

Review `gen_summary()` and `gen_uni_cox()` and consider switching to use {gtsummary} functions

gen_summary() and gen_uni_cox() functionality overlaps with {gtsummary} tbl_summary() and tbl_uvregression() functions. Since {gtsummary} is very fully-featured, we may want to consider making these functions wrap existing {gtsummary} functions, or consider removing these functions altogether.

One thing to consider is speed since genomic data sets are often very large. We should test {gtsummary} functions with large genomic data sets first to see if it's worth keeping these less fully featured, but faster versions.

Add IMPACT 505 to impact_gene_info (internal package data)

The newest IMPACT panel needs to be added to the internal pacakge dataframe gnomeR::impact_gene_info. Review the impact-gene-info.R script (in data-raw folder) and add code there to do so. The code to add the previous panels is here and can be used as reference.

You will need to:

  1. pull new panel from cbioportalR::get_gene_pane("IMPACT505")
  2. Look up aliases for all genes with cbioportalR::get_alias()
  3. Add this information to gnomeR::impact_gene_info dataframe.

CNA events - process and analysis

Processing CNA files

  1. Have the gene columns as gene_CN and then have the discrete states as is. Have an option to also return CN events as binary. (This can be in make-bin-mat.R or make-oncoprint-dat.R)
  2. In cna file, there are no NA, even if the sample was sequenced on a different platform for a gene that didn't exist in the platform. (Find out)
  3. What type of an event is -1.5 in CN? Maybe there is a new file for CN (Find out)

Analysis
4. Analyze Copy Number Alterations (CNA) wrt to an entire gene and perform association test on all observed CN states of the gene

Review oncoKB Functionality

oncoKB removed from binary_matrix() for now. Need to work this back into the package and test.

oncoKB as in an R environment was fully validated with its python counterpart. We get the same annotations. However, oncoKB versions should be kept in mind, and we should try and update with their latest version.

  • oncoKb version should be noted or should be a part of gnomeR output
  • an option to provide clinical file (improvement)
  • passing of other oncoKB parameters

Modularize binary matrix function- split out separate sub functions for mutation, CN and fusion

We discussed making the binary matrix function more modular by splitting it out into 3 pre-processing functions for mutations, CNA and fusions. These would then be called within the overarching matrix function. This will help with future testing and troubleshooting.

Additionally, should the user have an option in the binary matrix function to combine mutations and CNAs and make an "alterations" matrix? If so, these could be first computed separately by the individually cleaning functions and then combined before being returned to user.

We discussed possibly making the binary matrix function into a method function in the future, but TBD.

Integrate with cbioPortal Web API to gnomeR

I think a good start on this task will be to adjust the create.bin.matrix() function to accept column names other than MAF standard names as arguments. The MAF standards can be default.

Rework the Data Processing Vignette

Code needs to be tested and language reviewed.

Previous vignettes have been removed from the package. Please let me know if you plan to work on this and I can share the previous version to work off of.

Mislabeling of legend in oncoprint

Following the vignette, I run the command plot_oncoPrint(gen.dat = select(bmat, all_of(genes))). I get the following messages in console:
All mutation types: MUT, AMP, DEL.
Following at are removed: FUS, MIS, because no color was defined for them.
Following at are removed: FUS, MIS, because no color was defined for them.
Following at are removed: FUS, MIS, because no color was defined for them.

Attaching resulting oncoprint:
oncoprint

The correct labeling should be:
Blue square = Deletion
Red square = Amplification
Green line = Mutation

I ran the above code with both the regular and development versions of gnomeR.

Thank you!

Figure out how to fix all the package notes

This isn't urgent but should be done eventually, especially if we are planning to submit to CRAN. There are a couple in particular that I can't figure out like the one about the large size of the docs folder or how to bind the python functions so we don't get the no visible bindings note for those.

Combine alias warning messages in create_gene_binary()

Currently create_binary_matrix() outputs separate sets of notes for which genes are relabeled as aliases. It would be best to capture these and concatenate into 1 message/list (can probably use {purrr} functions for this).

Change mut_type argument default to keep NAs and only eliminate GERMLINE as default

Currently the mut_type argument of binary_matrix() returns only mutations marked as SOMATIC in maf. This also removes any unknown mutation types (usually in file as NA or ""). Often when you pull data from the API, mutation type is blank for many observations. These are almost always actually SOMATIC so we should not exclude these by default.

  • Change the binary_matrix() function to remove GERMLINE by default but keep everything else including blanks. Maybe we can call this option NO_GERMLINE?
  • Add a warning (cli::cli_warn()) that notes you have X number of mutations marked as blank. Let user know these are retained in resulting binary matrix.
  • Still allow user to change to SOMATIC or GERMLINE or ALL
  • Edit documentation to make this edit clear
  • Add tests for the argument and the changes ensuring NA/blanks are returned by default (try to find a small example from API data to test a well, see {cbioportalR})

Add pathways function

I removed the pathways argument from the binary matrix function, but we still have the custom_pathway() function. I'm thinking we should remove this as well and have one unified function called add_pathway() that can annotate pre-set pathways included in the package, or allow you to pass a custom pathway. It could be used as follows:

binary_matrix(mutation = mut, cna = cna,  fusions = fus) %>%
    add_pathway(pathway = "all")

Where the pathway argument would be "all" be default, allowing you to annotate all package-included pathways. (see gnomeR::pathways). Otherwise you could specify a specific pathway_id like "RTK-RAS".

binary_matrix(mutation = mut, cna = cna,  fusions = fus) %>%
    add_pathway(pathway = "RTK-RAS")

Additionally, you could add a custom pathway with:

binary_matrix(mutation = mut, cna = cna,  fusions = fus) %>%
    add_pathway(custom_pathway = c("TP53", "APC", "FGFR3"))

First step of this will be updating the gnomeR::pathways internal data which is currently incorrect. See pathways.R which pulls data from pathways.csv to create this internal data frame. You can update pathways.csv directly.

Figure out the best way to deal with non MSK (-IM, -IH) samples

Right now if I query a non-MSK sample ID using the API and then feed the genetic data into binary matrix it looks as if that sample has no mutations even though it may have mutations that were just not accessible in the IMPACT database. It also overwrites spe.plat which is a pain.

I'm not sure the best way to deal with these non MSK samples yet, and I also haven't been able to find a way to query all sample mutation data within a study in a way that would allow access to these non MSK samples through the API. I would think it would be possible considering the data is on cbioportal somewhere.

Review `gen.summary()` and univariate functions

Consider removing these summary functions and instead encouraging people to use {gtsummary}, as this functionality is more fully featured already there. We can deprecate these and if someone calls these functions, we can throw a message with suggestion/code to use tbl_uvregression()

Rework the Analyzing Genomic Data Vignette

Code needs to be tested and language reviewed.

Previous vignettes have been removed from the package. Please let me know if you plan to work on this and I can share the previous version to work off of.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.