msk-access / access_data_analysis Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 0.0 766 KB

Scripts for downstream analysis plotting of pipeline output

Home Page: https://cmo-ci.gitbook.io/cmo-access-data-analysis/

License: Apache License 2.0

R 61.81% Shell 0.89% CSS 0.39% Python 36.92%

analysis data-analysis msk-access python r

access_data_analysis's People

Contributors

Stargazers

Watchers

access_data_analysis's Issues

This has to be unique.

access_data_analysis/R/filter_calls.R

Line 62 in f774b31

sample.type = sample.sheet[Sample_Barcode == sample.name]$Sample_Type

To avoid mutate function erroring out:

❯ Rscript ./filter_calls.R -m $PWD/access_data_analysis_inputs.tsv -o $PWD/result_27Jan2022
---------------
Arguments input:
/home/shahr2/bergerlab/Project_12672_B/small_variants/access_data_analysis_inputs.tsv
/home/shahr2/bergerlab/Project_12672_B/small_variants/result_27Jan2022
/juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt
stringent
---------------
[1] "Processing patient C-E3C1KC"
[1] "list"
Error: Column `Tumor_Sample_Barcode` must be length 36 (the number of rows) or one, not 2
$`suppressWarnings(filter_calls(fread(master.ref), results.dir, chlist, crite`
<environment: 0x5560b0cd5958>

$`withCallingHandlers(expr, warning = function(w) if (inherits(w, classes)) t`
<environment: 0x5560b0cdb398>

$`filter_calls(fread(master.ref), results.dir, chlist, criteria)`
<environment: 0x5560b0cd9f08>

$`lapply(unique(master.ref$cmo_patient_id), function(x) {\n    print(paste0("P`
<environment: 0x5560b0cdd360>

$`FUN(X[[i]], ...)`
<environment: 0x5560b0d22688>

$`do.call(rbind, lapply(fillouts.filenames, function(y) {\n    sample.name = g`
<environment: 0x5560b0d22148>

$`eval(lhs, parent, parent)`
<environment: 0x5560b0d2c308>

$`eval(lhs, parent, parent)`
<environment: 0x5560b0d22688>

$`do.call(rbind, lapply(fillouts.filenames, function(y) {\n    sample.name = g`
<environment: 0x5560b0d2fd90>

$`lapply(fillouts.filenames, function(y) {\n    sample.name = gsub(".*./|-ORG.`
<environment: 0x5560b0d33c08>

$`FUN(X[[i]], ...)`
<environment: 0x5560b0d3b690>

$`maf.file %>% mutate(Tumor_Sample_Barcode = paste0(sample.name, "___", sampl`
<environment: 0x5560b0d3b348>

$`withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))`
<environment: 0x5560b0d3d150>

$`eval(quote(`_fseq`(`_lhs`)), env, env)`
<environment: 0x5560b0d3cd98>

$`eval(quote(`_fseq`(`_lhs`)), env, env)`
<environment: 0x5560b0d3ad28>

$``_fseq`(`_lhs`)`
<environment: 0x5560b0d40420>

$`freduce(value, `_function_list`)`
<environment: 0x5560b0d40298>

$`function_list[[i]](value)`
<environment: 0x5560b0d3fce8>

$`mutate(., Tumor_Sample_Barcode = paste0(sample.name, "___", sample.type))`
<environment: 0x5560b0d3f9d8>

$`mutate.data.frame(., Tumor_Sample_Barcode = paste0(sample.name, "___", samp`
<environment: 0x5560b0d3f3b8>

$`as.data.frame(mutate(tbl_df(.data), ...))`
<environment: 0x5560b0d3ef20>

$`mutate(tbl_df(.data), ...)`
<environment: 0x5560b0d3eaf8>

$`mutate.tbl_df(tbl_df(.data), ...)`
<environment: 0x5560b0d42490>

$`mutate_impl(.data, dots, caller_env())`
<environment: 0x5560b0d41c40>

$`stop(list("Column `Tumor_Sample_Barcode` must be length 36 (the number of r`
<environment: 0x5560b0d413f0>

attr(,"error.message")
[1] "Error: Column `Tumor_Sample_Barcode` must be length 36 (the number of rows) or one, not 2\n"
attr(,"class")

Solution:

sample.type = unique(sample.sheet[Sample_Barcode == sample.name]$Sample_Type)

Wait for all jobs to complete `compile_reads`

Add command line arguments for patient report R markdown

DMP ID

For compile_reads.R:

If DMP id is not there what should be there in Master REF for that column
If DMP id is present but not present in 12-245 key file it should exit gracefully with proper error

discrepancy between README on GitHub and --help. --id is --sid on develop branch

(base) python get_cbioportal_variants.py --help
Usage: get_cbioportal_variants.py [OPTIONS]

  Tool to do the following operations: A. Get subset of variants based on
  Tumor_Sample_Barcode in MAF file B. Mark the variants as overlapping with
  BED file as covered [yes/no], by appending "covered" column to the subset
  MAF

  Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
  access/python_bed_lookup)

Options:
  -m, --maf FILE        MAF file generated by cBioportal repo  [default: /work
                        /access/production/resources/cbioportal/current/msk_so
                        lid_heme/data_mutations_extended.txt]

  -i, --ids PATH        List of ids to search for in the
                        'Tumor_Sample_Barcode' column. Header of this file is
                        'sample_id'  [default: ]

  --sid TEXT            Identifiers to search for in the
                        'Tumor_Sample_Barcode' column. Can be given multiple
                        times  [default: ]

  -b, --bed FILE        BED file to find overlapping variants  [default:
                        /work/access/production/resources/msk-
                        access/current/regions_of_interest/current/MSK-
                        ACCESS-v1_0-probe-A.sorted.bed]

  -n, --name TEXT       Name of the output file  [default: output.maf]
  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.

  --help                Show this message and exit.

Remove use of `master` from all scripts and docs, substitute with `metadata`

Incorporation of Hotspot list and CCF information

Problem: There are multiple mutations to view in plot_events.

Solutions:

It would be better to only color the hotspot ones rather than all.
Using CCF to determine clonal vs subclonal mutations and coloring them based on that.

Need to test input mafs from completed pipeline

Right now, the testing input files are intermediate mafs. Need to test on Ronak's folder

compile_reads error: "can't set ALTREP truelength"

I get the following error message when I try to run compile_reads.R

Error in .shallow(x, cols = cols, retain.key = TRUE) :
can't set ALTREP truelength

My command:
Rscript R/compile_reads.R -m /juno/work/bergerm1/bergerlab/access_projects/Project_06302_TDM1/metadata/for_access_data_analysis.2020-08-20.csv -o /juno/work/bergerm1/bergerlab/access_projects/Project_06302_TDM1/analysis_workflow_results

cmo_sample_id_normal empty column

test out when cmo_sample_id_normal is empty, what happens

Can we use https://github.com/jokergoo/bsub for bsub

Re-check labelling of mutations as covered, genotyped and not covered.

@peteryzheng I have seen a couple of times inconsistent labeling from filter_reads.R, can you please check to make sure the same criteria is used overall.

create_report script template.Rmd

Lines 239 and 241

final[is.na(final$HGVSp_Short) & nchar(final$Reference_Allele)>5,"VarName"] <- paste0(final$Hugo_Symbol, " ", final$Chromosome, ":", final$Start_Position, " ", substr(final$Reference_Allele,1,3),"..", ">", final$Tumor_Seq_Allele2)[is.na(final$HGVSp_Short) & nchar(final$Reference_Allele)>5 ]

final[is.na(final$HGVSp_Short) & nchar(final$Tumor_Seq_Allele2)>5,"VarName"] <- paste0(final$Hugo_Symbol, " ", final$Chromosome, ":", final$Start_Position, " ", final$Reference_Allele,1,3, ">", substr(final$Tumor_Seq_Allele2,1,3),"..")[is.na(final$HGVSp_Short) & nchar(final$Tumor_Seq_Allele2)>5]

obfuscate the sample collection dates in the patient reports

Add mutation called status for each IMPACT sample

This would be useful for patients with multiple IMPACT samples. E.g. if a mutation was called in one IMPACT and genotyped in the other, we currently cannot easily tell that from the excel files.

Generate average ALT ALLELE RATIO at each time point

Generate average ALT ALLELE RATIO at each time point only for MSK-IMPACT mutations and compare with Tumor Fraction from MSK-IMPACT

compile_reads.R: Sample without DMP ID, runs for all DMP samples

Here is the example folder: /work/bergerm1/bergerlab/shahr2/Project_10619_B/results_26June2020
Input Master: /work/bergerm1/bergerlab/shahr2/Project_10619_B/master_file_hc_file.tsv
Sample with issues: C-PXVUM9 and C-CDVA88

Installation README

Here is what I did once you have conda installed using this guide

conda create --name access_data_analysis python=3
conda activate access_data_analysis
conda install r-essentials r-base
conda install r-argparse
pip install genotype-variants

have report show multiple IMPACT samples

Some patients have multiple impact samples. one way to do this is to have a separate tab for each IMPACT samples, since it is not clear how to correct the VAFs when there are multiple IMPACT samples

compile reads issues

Reported by @kanika-arora . Tried with master branch. To reproduce:

Rscript ~/tools/access_data_analysis/R/compile_reads.R \
  -m /juno/work/bergerm1/bergerlab/access_projects/Project_06302_TDM1/metadata/C-F38KR6_for-access-data-analysis.csv \
  -o /home/murphyc4/test/ \
  -pid Project_06302_TDM1

The error message.

Error in rbindlist(l, use.names, fill, idcol) : 
 Column 150 ['C-F38KR6-L002-DUPLEX'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names.

I got it working by commenting them out and doing

maf.file <- data.frame(maf.file)
maf.file$Tumor_Sample_Barcode <- paste0(sample.name, '___',sample.type)
maf.file <- cbind(maf.file,data.frame(t_alt_count= maf.file$t_alt_count_standard))
maf.file <- cbind(maf.file,data.frame(t_total_count= maf.file$t_total_count_standard))
maf.file <- data.table(maf.file)

Not sure why similar things work at other places but not here.

collection_date column

Need to make sure the plot_all_event script accommodate both dates and character types

msk-access / access_data_analysis Goto Github PK

access_data_analysis's People

Contributors

Stargazers

Watchers

access_data_analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org