Giter Club home page Giter Club logo

scramble's People

Contributors

carlosborroto avatar hackdna avatar mfinelli avatar your-highness avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scramble's Issues

What to evaluate should be a mandatory argument

Issue
If neither --eval-meis nor --eval-dels is set the script will run without errors and without generating any output.
It is hard to debug.

# Error while running:
Rscript --vanilla /path/to/scramble/cluster_analysis/bin/SCRAMble.R \
            --out-name ...

Suggestions
Either:

  1. If neither flag is set, inform user why nothing happened and exit without any other message.
  2. Set MEI detection as default behaviour and retire the argument. I'd assume that most people would use the tool for it anyway.

SCRAMble.R have error report.

Hi~
I am trying to run SCRAMble.R.
I have error report by using different --ref path.
But, both of attempts can't get a success report.
1.

    Rscript --vanilla ${bin}/SCRAMble.R \
        --out-name $OUTPUT_PATH/SCRAMble_${hg}/${ID} \
        --cluster-file $OUTPUT_PATH/SCRAMble_${hg}/clusters/${ID}.clusters.txt \
        --install-dir ${bin} \
        --mei-refs ${MEI_consensus_seqs} \
        --ref /staging/biology/zxc898977/writeCodeing/debugs/ref_hg19/ucsc.hg19.blastdb.fa \
        --eval-meis \
        --eval-dels

image
2.

    Rscript --vanilla ${bin}/SCRAMble.R \
        --out-name $OUTPUT_PATH/SCRAMble_${hg}/${ID} \
        --cluster-file $OUTPUT_PATH/SCRAMble_${hg}/clusters/${ID}.clusters.txt \
        --install-dir ${bin} \
        --mei-refs ${MEI_consensus_seqs} \
        --ref /staging/biology/zxc898977/writeCodeing/debugs/ref_hg19/ucsc.hg19.blastdb \
        --eval-meis \
        --eval-dels

image

PS. ucsc.hg19.blastdb.fa is equal to ucsc.hg19.fasta
my all files structure
image

I am curious how the successful report looks like.
Hope can get a solution to produce correct reports.
Thanks a lot!

Help with MEI annotation

Hi, SCRAMBLE is such a first-tier tool for MEI detection.
Can output txt file be converted into VCF file for further gene-based annotation or 1000G frequency annotation(dbRIP) ?
Thank you!

negative length vectors are not allowed

Hi.

I'm trying to run scramble on some WGS data. Some of the samples have processed through correctly, but some give the following error while running the analyse step:

Error in .Call2("PairwiseAlignmentsSingleSubject_align_aligned", x, gapCode,  : 
  negative length vectors are not allowed
Calls: do.meis ... as.matrix -> aligned -> aligned -> .local -> .Call2
Execution halted

Please let me know if you need any further information. I will now try and see if I can isolate the row in the cluster file that causes this error.

Error in writing vcd

Hi,

I'm trying to run my samples through SCRAMble, 44/293 samples failed due to this error, the rest ran smoothly.

The output file is .vcf that is 3321 in size that contain the title row.

10436 Writing VCF file to sample.vcf...
10437 Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord, :
10438 solving row 1: 'allow.nonnarrowing' is FALSE and the supplied start (0) is < 1
10439 Calls: write.scramble.vcf ... make_IRanges_from_windows_args -> solveUserSEW -> .Call2
10440 Execution halted

Question: Is this still maintained?

Hi,

I am working on a research project on MEIs and was wondering if this is still being maintained or developed for future use cases?

Best wishes,
Robert Wilson

Rare clusters have all "n"s in 4th field, causing R to fail

Of the 359,984 clusters produced by processing the 1000 genomes cram for sample HG00150, 5 have nothing but "n" calls in the 4th field. Here are two examples:
chr1:24599821 left 7 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ccccccccccccccccccttcgttaacgacgctattaccaactaaa
chr2:232427156 left 5 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ttttttttttttttttttttttttttttttttttttctctccccccccccccccccctctccctttcttgtgtttttttttttttttttttttccgcctcccccccacctcccaggtt

When the R script attempts find deletions using such a cluster, the analysis fails with this error:
Error in width(strings) : NAs in 'x' are not supported
Calls: do.dels ... .charToXStringSet -> solveUserSEW -> width -> width

My workaround is to clean the cluster file of these few bad guys before processing, but perhaps do.dels.R could exclude them.

Thanks for the software.

Empty vcf file after Error: subscript contains invalid names; result.txt_MEIs.txt shows MEI calls

Hi, I try running scramble on some eome giab NA12878 sample data. It interrupts with the following message:

Sample had 22 MEI(s)
Done analyzing MEIs
Loading required package: GenomicRanges
Error: subscript contains invalid names
Execution halted

The output file result.txt_MEIs.txt shows these 22 MEIs, but the according vcf file result.txt.vcf shows only the header.

If I understand this correctly, it failed within an R script. The installed version of R in the used environment is 4.1.3; GenomicRanges used in the R environment is version 1.46.1.

What further information can I provide to enable you to propose a solution?

Thank you in advance both for helping and for having developed this amazing tool,
Vinzenz

Is it okay if I make a bioconda package for scramble?

Hi everyone,

Hope you are well-- would you all be up for me making a bioconda package for scramble?

We are adding support for MEI calling in bcbio (https://github.com/bcbio/bcbio-nextgen) using scramble-- we were originally going to use MELT, but the license is one of the most insane licenses I've ever seen for software, and made it so we couldn't use it at all in bcbio.

On the topic of licenses, we were also wondering if you would consider a change in the license-- right now it's non-commercial only, but a lot of companies use bcbio and it would be awesome to extend MEI calling for them. Titus had a nice blog post a while back about the benefits of having a freer license for academics: http://ivory.idyll.org/blog/2015-on-licensing-in-bioinformatics.html and managed to change Lior's mind about it here: https://liorpachter.wordpress.com/2017/08/03/i-was-wrong-part-2/ in case that might sway your feelings. As a huge open source contributor I understand wanting to get compensated somehow for lots of hard, thankless work in the ditches, I totally get it. I think most of the time though, the benefits of having a tool get more widely adopted weighs whatever amount of money you could squeeze out of companies.

Anyway, no worries if the license switch is a no, we can work around it as we support some other non-free software as well.

Thanks so much. Please let me know if I can help you all with anything.

Segmentation fault

Dear scramble developper,

I am running the cluster_identifier and get a Segmentation fault.
I tried to backtrace with gdb and got: "Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055bf4e8e2c4e in handle_cluster.isra ()"

I had the same error on multiple BAM files and also using both the Docker and the installed versions (on Ubuntu 20.04 with all installed packages needed).

I would be very grateful for your help.

Mathieu

Fail to write VCF file - negative coordinates

Issue
For a couple of my samples I had problems while writing the MEIs to VCF, negative coordinate issue. Library kit: 'Agilent SureSelect Human All Exon V8'. Could you please help me deal with this issue?

scramble.sh --ref /path/to/hs37d5.fa --out-name /path/to/targeted_seq_mei_calling/work/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1/out/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1 --cluster-file /path/to/targeted_seq_mei_calling/work/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1/out/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1_cluster.txt --nCluster 5 --mei-score 50 --indel-score 80 --poly-a-frac 0.75 --eval-meis
Running sample: /path/to/targeted_seq_mei_calling/work/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1/out/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1_cluster.txt
Running scramble with options:
INSTALL.DIR : /path/to/targeted_seq_mei_calling/.snakemake/conda/9154f892d04f9bfe82a4d010855d834d/share/scramble/bin
blastRef : /path/to/hs37d5.fa
clusterFile : /path/to/targeted_seq_mei_calling/work/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1/out/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1_cluster.txt
deletions : FALSE
indelScore : 80
mei.refs : /path/to/targeted_seq_mei_calling/.snakemake/conda/9154f892d04f9bfe82a4d010855d834d/share/scramble/resources/MEI_consensus_seqs.fa
meiScore : 50
meis : TRUE
minDelLen : 50
nCluster : 5
no.vcf : FALSE
outFilePrefix : /path/to/targeted_seq_mei_calling/work/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1/out/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1
pctAlign : 90
polyAFrac : 0.75
polyAdist : 100
Useful Functions Loaded
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colnames, dirname, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    I, expand.grid, unname

Loading required package: IRanges
Loading required package: XVector
Loading required package: GenomeInfoDb

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

Done analyzing l1
Done analyzing sva
Done analyzing alu
Done analyzing l1
Done analyzing sva
Done analyzing alu
Sample had 38 MEI(s)
Done analyzing MEIs
Writing VCF file to /path/to/targeted_seq_mei_calling/work/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1/out/bwa.scramble.<SAMPLE_ID>-N1-DNA1-WES1.vcf...
Error in .Call2("C_solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  :
  solving row 1: 'allow.nonnarrowing' is FALSE and the supplied start (0) is < 1
Calls: write.scramble.vcf ... make_IRanges_from_windows_args -> solveUserSEW -> .Call2

Getting "Segmentation fault (core dumped)" when running against some bam files.

Hi
Running the scramle docker, it seems to randomly fail for some bam files with a seg fault.
This doesn't seem to be for all bam files but I can't identify any common aspect of the ones for which it does fail vs succeed.

I don't see any core dump file to include (where would these save to?)

Docker command used:

docker run -v /mnt/qsg-results-3/:/LIfolder -it --rm scramble:latest bash

Command used:

root@2704cc4af852:/# /app/cluster_identifier/src/build/cluster_identifier /LIfolder/LI6073/Sorted_LI6073.bam

output:


chr1:189991284 right 6 cctgccacaaccactccccagtgcctttaagagtttctacacctgcatccagatgtttaaatacaggaaactgctgt ttttc 
chr1:200911586 right 6 cctgccactcctgctcagaagacagtggctctgacgtctccagcatctcccaccccacttcgccgggcagcagcagccccgacatctcctttctgca ccc
chr1:201048681 right 10 ccagccgctgtacagggagacgcagtggcctgccgctgagctgggaccaggccaagcttggcaagtcatcctcacccggtctgtggaccgggagg cccc
chr1:201066388 right 9 ccagccatatcctgccctccacaccagctgcctttctgcctgaaaacactcccaccttccccttccctttcctccagccgtgagagtgtgccc ccc
chr1:201782348 right 12 cctgccacaggcacagccactgtcatgcaaactggtggttcagccactctcagcaagatccagaagtcctcaggcatccctgtca ccc
chr1:203498891 right 10 ccctgccactaggacagtcaccaacactgtgttagtgccccccgtgttagctctttcctgttggctctcagtccctccagctgtcaaaggga ccc
chr1:204249208 right 6 ccagccggtcctgctccctcaccaccttgttctgctcacacaattttcccagcagcttctagggacccagagagtggagaaggagagggagaaa cc
chr1:204460083 right 7 tgaaggagggagagtcttt gccctgggaaacccattttctccctccctctcctcagctcacactctgatttaaaggagttcccactctttctatatgtcctgtgaagac
chr1:205065724 left 15 tcatgtctcatggcctgctgcactggtcagggccggtggtc tggcctgctgcactggtcagggccggtggtctgacctgctgcccctaactgtccccgtgtgcagaaggagaccattggggatctgacca
chr1:205065766 right 11 actggtcagggccggtggtctgacc ccatggatgtcatggagtaggggactcccaagcgctgcctcatgtctcatggcctgctgcactggtcagggccggtggtctggcctgctgc
chr1:205069461 right 7 aagccatatcggtgtcccttggcccttgacagccccctcggtggcaccctcaggactcagcggaggaggtggagcccccggagagct aaa
chr1:205339789 right 8 cctgccgcatatactcggtaatctaagaagaaaggccatacatgcccctggcttagctca ccc
chr1:205339790 right 9 cctgccgcatatactcggtaatctaagaagaaaggccatacatgcccctggcttagctcacaggtacagcaagacaggcccaccagtatctattg cccc
chr1:205662050 right 7 ccagccggtccatgaccagagagaagaccagggagatggcgcactgcaggaacagccccaggctgcccatccgaacgcctgcagagggagaggggcc cc
chr1:206730053 right 5 ccagccacaactctttgaccactccttgttatacaccgtactatgtgggtaagtccacagggggcccagggacctaggcttttcccagaactttt ccc
chr1:208103054 left 8 aggacaattcctct ctctctctgcaaagtactgtcatatcccatcattcccggaaagccccggtcttctgcatgccaaacccttttccagtataccccaaactta
chr1:208267465 left 10 tcataagattactttccagcagcagcagcagcagcagcagcagcagcagcagcag cagcagcagcagcagcagcagcagcagcgatgtaattgacccccatttacagatgatgcagctttaaggcagagaattccatggctg
chr1:209432319 right 6 agagtaactctg catggagctgacaaccatgaggcctcggcagccaccgccaccaccgccgccgccaccaccgtagcagcagcagcagcagcagcagcagca
chr1:209788627 right 5 gagccactactggaatgacctgttcaggacacagaacacaggtgtatcctctgaggaaaaggtatttttaaatagcacaatggacccaagatt cg
chr1:214464832 right 13 gcctatcctggcgcacacgcccctgagatggccttagcagtttcgtgactggaaaattacactatcacctgtgctcctccaggcaggga cct
chr1:214464835 right 8 gcctatcctggcgcacacgcccctgagatggccttagcagtttcgtgactggaaaattacactatcacctgtgctcctccaggcagggaaaagg ccgcct
chr1:215786813 right 12 cctgccacaatgttctgtggcttccatagatgctgggcagaggatcctgcactctttggtttcctgagtcaagtggcag cc
chr1:224114452 right 5 gaggagaaggga ttatgtcctacgacgaaattagccagctccgcctggtgaggcccccgcagaactcctgcctccctctccccccggccgaggtctgggagat
chr1:224330143 right 6 caagccacagctcggaccgccagctcctagtcaaccgggggcctcgtaggggttgcccgccgcgttcgccgggccagttgcacctgaaa ttt
chr1:227925467 right 14 gccgcctctggctccagggtcagcgggaggatggtcaggggctcgctgcccgtcagcctgggcacagagaggccagcatgagcccggcccc gggcg
chr1:229647493 left 5 ttttcctctatc ttattttgccctttagctcttaaaccgagaagcttctcaggagcagcctgtgtccctcacagtggtcgggcctgtcttagatgtcctggct
chr1:230710052 right 11 tcagggagcagccagtcttccatcctgtcacagcctgcatgaacctgtcaatcttctcagcagcaacatccagttctgtgaagtccagagagcgt cccg
chr1:232515127 right 5 ttgggcacagctggggtaccattagccggaccaccaccgccagtctcattggaattcgaggcatttaaagaagtagtgggtcccatgttgccat cccatt
chr1:233379357 right 5 gtggggaggccagcagccccccctccctgccactgtcaagtgccctgggcatcctctccacaccttctttctccacaaagtgcctgctgcagatggac g
chr1:233614150 left 10 cggcgttggccttggctttggct ttggcggcggcggtggagaagatgctgcagtccctggccggcagctcgtgcgtgcgcctggtggagcggcaccgctcggcctggtgctt
chr1:236540468 left 6 caagcacacgcacatatacttatgactgcctgtttgtctggggagagacag ggacgcaaggaaacatttaaatttggataataagttaatttattaactgtttttttttggtggcgggggggg
chr1:240207330 right 6 agccacgaacactctgtttcctctgcctttaaaaacagctgtaacatcccatctccaccacctctgccttgcacagag g
chr1:247434047 left 11 gtgttctgaggccttctctattcca gagctctctggtcagatgtgttctgatgctttctgcctctgttcttggcatgaaggttggggcgctgtggcctctcgcatgagtgctgctt
chr1:248858387 right 11 cctgccgttagggcctcagtttcctcatcagtgaactggggcaagactaaactatttcaatagcagtggcaggtgtggagccaaaccccgtcctt ccc
chr2:676478 left 12 cagaactcctgtaa gtgtcactactctctgctggggaccgcagcggcttctccagagccgcgccatgacataaggacacaagcgcatctactcccatcaatgcac
chr2:1638773 left 11 ggaataaaacgttatacg cagaaggttcgggcagggctgtgctgctgtggaatcttggagtggggggacacaggccgccaggcacctcacctgtgttctgaggtctg
chr2:6910472 right 7 acattagtgggtgcagcgcaccagcatggcac gtagaaagagagagagagaggagtttttaagtactgtatgtattttaaggagattgaataatctaaggtgaggagcatttaaaataata
chr2:9843465 right 7 gcttctccagcctttcccggaagctgcgctcgc aaggtttccctgccgcgcaggcgcacggaatcctaggcgcggatctcgcgtttgcggccggaag
chr2:10122796 right 5 ccactatgctctccctccgtgtcccgctcgcgcccatcacggacccgcagcagctgcagctctcgccgctgaaggggctcagcttggtcgacaagg ccg
chr2:15393610 left 6 gtctgagagaaatgaaagcgtatgtctacac aaatgaaagcgtatgtctacacaaacacttgcatatgaaaagtcatagcaactttatttgtaaaagccaaaactcaaaataacccaaat
chr2:15942065 right 6 gagccgatgccgagctgctccacgtccaccatgccgggcatgatctgcaagaacccagacctcgagtttgactcgctacagccctgcttct ggggggg
chr2:16239536 left 7 acaacaagctacaacagcagcagcagcagcag cagcagcagcagcagcagcagcagcagcagcagcagcagcagcagcagaatgaaggaatgaatgaatgaatgagcgagtgagtgag
chr2:20102729 left 11 atggggagaaggagaagaagaagaagaagaaga agaagaagaagaagaagaagaagaagaagaagaagacgacgacaacggtggtgagggggatggtaccagtctgaggttcgacaggcagtt
chr2:20618698 left 5 ggggctgggcatcta gttggcaaggcatccccacaccctccctccccttcatgtccacggggaataagacacattgggctctggctcctagggtgagagccgctcc
chr2:20640843 right 6 cctgcctcaccagcctggctgaagcctccaggctgcagaggcagctgtggacatgctcccactggggcacggcagcggggcctagttctgggc ccc
chr2:24054101 right 6 cccagccagtctagtgggaatgataaaggaggcttggaaggccaactctttccctcttctaccagcaagggccatccatggtgccagcttctaggt ccc
chr2:27205506 right 6 ccctgcaaggaaagcacagcaaccctgccacagaggccttctaaacccagcttgtccaacctgccttattttgttgttgc cccc
chr2:27629250 right 13 ggtcttcgaggatttggagggt ttttgtacaggtgacgtacacagcatgggtgtagtaggggagcgcaaaaggttgcctccggcaggcggaaggccaggaagaaagggaggga
chr2:28632222 right 5 tttttttttttttaaccatctctctccaagaggattcctgagggtggctttttccacattacctccttt t
chr2:29144009 right 6 acagacagtatg gatcgtgttgttattgcaggacagaaggtacagtaagtaactgcagtctctgaagccagggttgttatgtccatgacctatgttcaaggac
Segmentation fault (core dumped)

Bam files are built against GRCH38, aligned with novoalign.

Row names warning message

When running SCRAMBLE-MEI on the example data, I'm getting the following warnings:

Sample had 1 MEI(s)
Warning messages:
1: In data.frame(df.all, alignments_fwd, stringsAsFactors = F) :
row names were found from a short variable and have been discarded
2: In data.frame(df.all, alignments_rev, stringsAsFactors = F) :
row names were found from a short variable and have been discarded
Done analyzing MEIs
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Feature request: Add option for reference to cluster_identifier

Hi,

Is it possible to add the option to specify the reference file in cluster_identifier? I'm trying to run it using a CRAM file and the tool looks for the reference but can't find it automatically.

Or do you know of a way I can specify this myself?

Thanks
-Nicolas

Update scramble to 1.0.2 in bioconda

Issue
We would like to include scramble to our MEI analysis workflow, but our workflow relies on Conda and requires VCF outputs.
The changes to fix VCF output are in 1.0.2, but it was not integrated to bioconda-recipes.

Suggestion
Update bioconda-recipes. We tried in bioconda/bioconda-recipes#36929, but there is something missing. Perhaps version defined in scramble/cluster_identifier/src/cluster_identifier.c ?

filtering calls

Do you have any suggestions for data filtering, such as: coverage threshold - minimum number of reads that can be determined that the call is reliable?

Error in write.table

Hi,

I m using the docker version of scramble.
I get the following error when I ran this command on my clusters file.

/bin/SCRAMble.R 
--out-name ${PWD}/test 
--cluster-file ${PWD}/MEN_CGH200860-I.sorted.clusters.txt 
--install-dir /app/cluster_analysis/bin/ 
--mei-refs /app/cluster_analysis/resources/MEI_consensus_seqs.fa 
--ref /app/validation/test.fa 
--eval-meis
Done analyzing MEIs
Writing VCF file to /data/share/genmol/sacha/projects/ALU/test.vcf...
Error in write.table(fixed, paste0(outFilePrefix, ".vcf"), row.names = F,  :
  unimplemented type 'list' in 'EncodeElement'
Execution halted

Seems the fixed dataframe contains list for the REF columns. Here is the output of print(fixed) and print(str(fixed) :

print(fixed)

   #CHROM       POS     ID  REF          ALT      QUAL FILTER
11  chr13  18212144 INS:ME NULL <INS:ME:ALU>  79.24415   PASS
10  chr13  38878403 INS:ME NULL <INS:ME:ALU>  60.81471   PASS
9   chr14  57041288 INS:ME NULL <INS:ME:ALU>  78.32513   PASS
8   chr15  40808120 INS:ME NULL <INS:ME:ALU>  73.71777   PASS
7   chr16  89224981 INS:ME NULL <INS:ME:ALU>  78.32513   PASS
3    chr2  97185492 INS:ME NULL  <INS:ME:L1>  57.59200   PASS
4    chr2 102706037 INS:ME NULL  <INS:ME:L1>  62.65276   PASS
6   chr22  23928275 INS:ME NULL <INS:ME:ALU> 103.66561   PASS
5   chr22  43928709 INS:ME NULL  <INS:ME:L1>  96.75457   PASS
2    chr4 185440770 INS:ME NULL <INS:ME:ALU>  86.15520   PASS
1    chr5  62561291 INS:ME NULL <INS:ME:ALU> 103.66561   PASS
                                                   INFO
11   MEINFO=chr13:18212144_ALU_Plus,18212144,18212145,+
10   MEINFO=chr13:38878403_ALU_Plus,38878403,38878404,+
9    MEINFO=chr14:57041288_ALU_Plus,57041288,57041289,+
8   MEINFO=chr15:40808120_ALU_Minus,40808120,40808121,-
7    MEINFO=chr16:89224981_ALU_Plus,89224981,89224982,+
3      MEINFO=chr2:97185492_L1_Plus,97185492,97185493,+
4   MEINFO=chr2:102706037_L1_Plus,102706037,102706038,+
6    MEINFO=chr22:23928275_ALU_Plus,23928275,23928276,+
5    MEINFO=chr22:43928709_L1_Minus,43928709,43928710,-
2  MEINFO=chr4:185440770_ALU_Plus,185440770,185440771,+
1     MEINFO=chr5:62561291_ALU_Plus,62561291,62561292,+

print(str(fixed))

'data.frame':   11 obs. of  8 variables:
 $ #CHROM: chr  "chr13" "chr13" "chr14" "chr15" ...
 $ POS   : int  18212144 38878403 57041288 40808120 89224981 97185492 102706037 23928275 43928709 185440770 ...
 $ ID    : chr  "INS:ME" "INS:ME" "INS:ME" "INS:ME" ...
 $ REF   :List of 11
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
  ..$ : NULL
 $ ALT   : chr  "<INS:ME:ALU>" "<INS:ME:ALU>" "<INS:ME:ALU>" "<INS:ME:ALU>" ...
 $ QUAL  : num  79.2 60.8 78.3 73.7 78.3 ...
 $ FILTER: chr  "PASS" "PASS" "PASS" "PASS" ...
 $ INFO  : chr  "MEINFO=chr13:18212144_ALU_Plus,18212144,18212145,+" "MEINFO=chr13:38878403_ALU_Plus,38878403,38878404,+" "MEINFO=chr14:57041288_ALU_Plus,57041288,57041289,+" "MEINFO=chr15:40808120_ALU_Minus,40808120,40808121,-" ...

First list of my cluster file

chr1:931134     right   6       gtgcccccccccccccccccccccgggccaccggttgggtggggagggg       tgggacgtgaacatctctttccgagaggcgtcctgcaggtaggagccgtgctgtgcgtgcataagagggggccgtgactcccc
chr1:939446     left    6       tgctccttgtgttggcccggtagcgcctctaccacctggg        cctccccagccacggtgaggacccaccctggcatgatctcccctcatcacctccccagccacatgtactcggccattcctgttgctga
chr1:955902     right   9       atgccccccaccccgcgtaacagcgggaatacatttgcaccaataaaaaaaacaaaatatgtagaaatccaaaaatgt  ctctgttgccatgtctctgtcctagccacaaggcctctggcttctcctgtgtgtggtcccgacccaccttccaccctacccccc
chr1:971019     right   10      ggggggggggggggggggggggggggggggggggggggggggg     gctggctttaccacctggagaagcagacggccctcctcggggggccgcggcgctgccactcggcacccccacaggtcagtgccgggg
chr1:1046488    right   10      cgccccccccccccggggccccccccaaacccccacaaccccaaccccccacccccc       ccagcactcacccgacatctgcctccgtgactgtgaccaccccagggctcctcctgagccaggcactgccggcccccccc
chr1:1046501    right   8       aggcgccccccaagaccccacccacccccacccccccaccccccacaaagcgaacgcggaccacaaaca   cccgacatctgcctccgtgactgtgaccaccccagggctcctcctgagccaggcactgccggcccccccccgcgcccaccccc
chr1:1048421    left    6       tgtggccgtttttgttagtgggtatgggttccccccgcctttggtggggggggcggccgccggggggggccatgtttg  ggggggggggctaagccaccatcaggctttgagttgggggcaggagcccggattaaggcggggtttcggccagatgcggtggc
chr1:1049076    left    5       gggggtattgtatttctggttttgggggttttttttgggcggggtgctgctcgggggggggggggggggcg ggggcgggggcagctcaggtgggcggggagggg
chr1:1050063    left    17      tgttttggggggggccccggggggggttggggccactttggccctccggggggggggggggggggctgggggggg     gggggggggggggggttgaacgtttgggcgggtacaggttccaggtagcattgcagttaggatgcggctcagtctagtctgggttttgag
chr1:1050070    left    6       cggggcggggccccgggggggggtggggcccctttcgcccccccggggggggggggggggctcgggggggggggggtt  ggggggggttgaacgtttgggcgggtacaggttccaggtagcattgcagttaggatgcggctcagtctagtctgggttttgag

Help with call filtering

We ran SCRAMble on NA12878 dataset and compared the output with NA12878 validation data.

  • Using original data from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.mergedSV.v8.20130502.svs.genotypes.vcf.gz
    original-scramble-1k
  • Using phase 3 data from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage_SV/working/20190906_Devine_MELT/
    updated-scramble-1k
  • After excluding calls within low complexity regions
    lcrfilter-scramble-1k
  • After setting alignment score cutoff to 80
    score80-scramble-1k

Could you help us explain the remaining differences in calls? Could you recommend any additional filters to use for calls made by SCRAMble?

VCF not valid

VCF format specification (https://samtools.github.io/hts-specs/VCFv4.2.pdf) highly recommends (but not required) that the header include tags describing the contigs referred to in the VCF file. With samtools, output VCF of Scramble can not be view correctly due to contigs definition :

##contig=<ID=chr1>

Samtools command:

$ samtools view scramble.vcf
[main_samview] fail to read the header from "scramble.vcf"

Suggestion:
Add length and assembly as tags:

##contig=<ID=chr1,length=249250621,assembly=hg19>

VCF Creation Issue

I have used the following commands to generate reference files (*.nhr, *.nin, and *.nsq files) for VCF creation for both GRCh37/38.

makeblastdb -in file.fasta -input_type fasta -dbtype nucl

However, when I run the Cluster analysis, I get the following error:

Done analyzing MEIs
Writing VCF file
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Error: subscript contains invalid names
Execution halted

This occurs with either reference (37 or 38).

Any ideas how I could go about troubleshooting this step?

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.