msk-access / access_data_analysis Goto Github PK
View Code? Open in Web Editor NEWScripts for downstream analysis plotting of pipeline output
Home Page: https://cmo-ci.gitbook.io/cmo-access-data-analysis/
License: Apache License 2.0
Scripts for downstream analysis plotting of pipeline output
Home Page: https://cmo-ci.gitbook.io/cmo-access-data-analysis/
License: Apache License 2.0
access_data_analysis/R/filter_calls.R
Line 62 in f774b31
To avoid mutate function erroring out:
❯ Rscript ./filter_calls.R -m $PWD/access_data_analysis_inputs.tsv -o $PWD/result_27Jan2022
---------------
Arguments input:
/home/shahr2/bergerlab/Project_12672_B/small_variants/access_data_analysis_inputs.tsv
/home/shahr2/bergerlab/Project_12672_B/small_variants/result_27Jan2022
/juno/work/access/production/resources/dmp_signedout_CH/current/signedout_CH.txt
stringent
---------------
[1] "Processing patient C-E3C1KC"
[1] "list"
Error: Column `Tumor_Sample_Barcode` must be length 36 (the number of rows) or one, not 2
$`suppressWarnings(filter_calls(fread(master.ref), results.dir, chlist, crite`
<environment: 0x5560b0cd5958>
$`withCallingHandlers(expr, warning = function(w) if (inherits(w, classes)) t`
<environment: 0x5560b0cdb398>
$`filter_calls(fread(master.ref), results.dir, chlist, criteria)`
<environment: 0x5560b0cd9f08>
$`lapply(unique(master.ref$cmo_patient_id), function(x) {\n print(paste0("P`
<environment: 0x5560b0cdd360>
$`FUN(X[[i]], ...)`
<environment: 0x5560b0d22688>
$`do.call(rbind, lapply(fillouts.filenames, function(y) {\n sample.name = g`
<environment: 0x5560b0d22148>
$`eval(lhs, parent, parent)`
<environment: 0x5560b0d2c308>
$`eval(lhs, parent, parent)`
<environment: 0x5560b0d22688>
$`do.call(rbind, lapply(fillouts.filenames, function(y) {\n sample.name = g`
<environment: 0x5560b0d2fd90>
$`lapply(fillouts.filenames, function(y) {\n sample.name = gsub(".*./|-ORG.`
<environment: 0x5560b0d33c08>
$`FUN(X[[i]], ...)`
<environment: 0x5560b0d3b690>
$`maf.file %>% mutate(Tumor_Sample_Barcode = paste0(sample.name, "___", sampl`
<environment: 0x5560b0d3b348>
$`withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))`
<environment: 0x5560b0d3d150>
$`eval(quote(`_fseq`(`_lhs`)), env, env)`
<environment: 0x5560b0d3cd98>
$`eval(quote(`_fseq`(`_lhs`)), env, env)`
<environment: 0x5560b0d3ad28>
$``_fseq`(`_lhs`)`
<environment: 0x5560b0d40420>
$`freduce(value, `_function_list`)`
<environment: 0x5560b0d40298>
$`function_list[[i]](value)`
<environment: 0x5560b0d3fce8>
$`mutate(., Tumor_Sample_Barcode = paste0(sample.name, "___", sample.type))`
<environment: 0x5560b0d3f9d8>
$`mutate.data.frame(., Tumor_Sample_Barcode = paste0(sample.name, "___", samp`
<environment: 0x5560b0d3f3b8>
$`as.data.frame(mutate(tbl_df(.data), ...))`
<environment: 0x5560b0d3ef20>
$`mutate(tbl_df(.data), ...)`
<environment: 0x5560b0d3eaf8>
$`mutate.tbl_df(tbl_df(.data), ...)`
<environment: 0x5560b0d42490>
$`mutate_impl(.data, dots, caller_env())`
<environment: 0x5560b0d41c40>
$`stop(list("Column `Tumor_Sample_Barcode` must be length 36 (the number of r`
<environment: 0x5560b0d413f0>
attr(,"error.message")
[1] "Error: Column `Tumor_Sample_Barcode` must be length 36 (the number of rows) or one, not 2\n"
attr(,"class")
Solution:
sample.type = unique(sample.sheet[Sample_Barcode == sample.name]$Sample_Type)
For compile_reads.R:
(base) python get_cbioportal_variants.py --help
Usage: get_cbioportal_variants.py [OPTIONS]
Tool to do the following operations: A. Get subset of variants based on
Tumor_Sample_Barcode in MAF file B. Mark the variants as overlapping with
BED file as covered [yes/no], by appending "covered" column to the subset
MAF
Requirement: pandas; typing; typer; bed_lookup(https://github.com/msk-
access/python_bed_lookup)
Options:
-m, --maf FILE MAF file generated by cBioportal repo [default: /work
/access/production/resources/cbioportal/current/msk_so
lid_heme/data_mutations_extended.txt]
-i, --ids PATH List of ids to search for in the
'Tumor_Sample_Barcode' column. Header of this file is
'sample_id' [default: ]
--sid TEXT Identifiers to search for in the
'Tumor_Sample_Barcode' column. Can be given multiple
times [default: ]
-b, --bed FILE BED file to find overlapping variants [default:
/work/access/production/resources/msk-
access/current/regions_of_interest/current/MSK-
ACCESS-v1_0-probe-A.sorted.bed]
-n, --name TEXT Name of the output file [default: output.maf]
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
--help Show this message and exit.
Problem: There are multiple mutations to view in plot_events.
Solutions:
Right now, the testing input files are intermediate mafs. Need to test on Ronak's folder
I get the following error message when I try to run compile_reads.R
Error in .shallow(x, cols = cols, retain.key = TRUE) :
can't set ALTREP truelength
My command:
Rscript R/compile_reads.R -m /juno/work/bergerm1/bergerlab/access_projects/Project_06302_TDM1/metadata/for_access_data_analysis.2020-08-20.csv -o /juno/work/bergerm1/bergerlab/access_projects/Project_06302_TDM1/analysis_workflow_results
test out when cmo_sample_id_normal is empty, what happens
@peteryzheng I have seen a couple of times inconsistent labeling from filter_reads.R, can you please check to make sure the same criteria is used overall.
Lines 239 and 241
final[is.na(final$HGVSp_Short) & nchar(final$Reference_Allele)>5,"VarName"] <- paste0(final$Hugo_Symbol, " ", final$Chromosome, ":", final$Start_Position, " ", substr(final$Reference_Allele,1,3),"..", ">", final$Tumor_Seq_Allele2)[is.na(final$HGVSp_Short) & nchar(final$Reference_Allele)>5 ]
final[is.na(final$HGVSp_Short) & nchar(final$Tumor_Seq_Allele2)>5,"VarName"] <- paste0(final$Hugo_Symbol, " ", final$Chromosome, ":", final$Start_Position, " ", final$Reference_Allele,1,3, ">", substr(final$Tumor_Seq_Allele2,1,3),"..")[is.na(final$HGVSp_Short) & nchar(final$Tumor_Seq_Allele2)>5]
This would be useful for patients with multiple IMPACT samples. E.g. if a mutation was called in one IMPACT and genotyped in the other, we currently cannot easily tell that from the excel files.
Here is the example folder: /work/bergerm1/bergerlab/shahr2/Project_10619_B/results_26June2020
Input Master: /work/bergerm1/bergerlab/shahr2/Project_10619_B/master_file_hc_file.tsv
Sample with issues: C-PXVUM9
and C-CDVA88
Here is what I did once you have conda installed using this guide
conda create --name access_data_analysis python=3
conda activate access_data_analysis
conda install r-essentials r-base
conda install r-argparse
pip install genotype-variants
Some patients have multiple impact samples. one way to do this is to have a separate tab for each IMPACT samples, since it is not clear how to correct the VAFs when there are multiple IMPACT samples
Reported by @kanika-arora . Tried with master
branch. To reproduce:
Rscript ~/tools/access_data_analysis/R/compile_reads.R \
-m /juno/work/bergerm1/bergerlab/access_projects/Project_06302_TDM1/metadata/C-F38KR6_for-access-data-analysis.csv \
-o /home/murphyc4/test/ \
-pid Project_06302_TDM1
The error message.
Error in rbindlist(l, use.names, fill, idcol) :
Column 150 ['C-F38KR6-L002-DUPLEX'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names.
Have ymax set to 2% when all mutations are below 2%.
I just realized I am genotyping duplex bams in /ifs/work/bergerm1/ACCESS-Projects/novaseq_curated_duplex_v2/
as standard bams...
If we are going to genotype donor bams as actually plasma samples, we need both duplex and simplex bams?
This may be due to R version inconsistencies (warning: package 'dplyr' was build under R version 3.6.3): filter_calls.sh stops with error message Error in .shallow(x, cols = cols, retain.key = TRUE) : can't set ALTREP truelength at lines 99-102.
https://github.com/msk-access/access_data_analysis/blob/master/R/filter_calls.R#L99
I got it working by commenting them out and doing
maf.file <- data.frame(maf.file)
maf.file$Tumor_Sample_Barcode <- paste0(sample.name, '___',sample.type)
maf.file <- cbind(maf.file,data.frame(t_alt_count= maf.file$t_alt_count_standard))
maf.file <- cbind(maf.file,data.frame(t_total_count= maf.file$t_total_count_standard))
maf.file <- data.table(maf.file)
Not sure why similar things work at other places but not here.
Need to make sure the plot_all_event
script accommodate both dates and character types
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.