cgroza / graffite Goto Github PK

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.

License: Other

Nextflow 11.13% Shell 22.37% Python 0.87% R 38.29% Perl 27.35%

bioinformatics structural-variation transposons

graffite's People

Contributors

Stargazers

Watchers

Forkers

sureinra wangdong-ls mikecuoco powerplant

graffite's Issues

max divergence options

Hello!
Thanks for developing this tool!

I was wondering what is reasoning behind a maximum of 5% divergence in the first step: pseudo-alignment using minimap.
Specially as minimap allows for more divergence thresholds with asm5/asm10/asm20:

https://manpages.ubuntu.com/manpages/kinetic/en/man1/minimap2.1.html

Given that the TE family rule is the infamous 80/80/80 I could see asm20 working better for divergent populations? Although same TE family =! same allele.

Thanks!

Provide variants coordinates for all input genomes, not only reference (as in the VCF)

Hi Cristian,
I was wondering if it is possible to have the polymorphic TE coordinates in all génomes and not only in the ref ? It s quite useful when we map RNAseq and other things to the assemblies and not the ref.

Thanks!
Rita

Singularity issue

I used the singularity (3.8.6)
When I ran the code

singularity remote add --no-login SylabsCloud cloud.sycloud.io
singularity remote use SylabsCloud

An error occurred

FATAL:   name collision while syncing: SylabsCloud

How can I solve it?

Replace `pangenome.vcf` with a `presence-absence.vcf` as main output, but keep it to build the graph genomes

Replace pangenome.vcf with a presence-absence.vcf in the 3_TSD_Search/ output folder. This new file will show 1 genotype column per sample but the calls are only 1 or 0 (i.e. identical to the SUPP_VEC field). We still need to output pangenome.vcf for compatibility with the option --graffite-vcf (skips SV search and annotation, and use the VCF provided to build graph and map reads). Alternatively, don't output pangenome.vcf, but keep it internally to build the graph if needed. This would require to modify the routines for --graffite-vcf in order to strip the genotype column and replace them with a single column with all variants 1|0.

I anticipate a possible source of confusion as "presence-absence" could be interpreted as the presence or absence of a TE rather than presence/absence of the variant. Perhaps a solution to this is to output two files, one in VCF format, respecting the VCF convention and called GraffiTE_variants_presence-absence.vcf and the other being tsv table, identical to the non-header lines of the VCF but where the DEL calls are reverted to match the presence/absence pattern of the TEs for each sample. We could call this file GraffiTE_TE_presence-absence.tsv.

Of course, will need to update the documentation accordingly.

This change has several advantages:

it is more explicit and easier to interpret, either seing 1 (alt allele) or 0 (ref allele) in the VCF for each variants/sample combination in the VCF or 1 (TE presence) or 0 (TE absence) in the TSV for each TE/sample.
it should be easier to parse than the SUPP_VEC
it avoids having to pull the vcf.txt file from in order to know which position of the SUPP_VEC correspond to which sample.

New variations after genotyping.

Hi,

Thank you for your excellent pipeline. I utilized a custom-made VCF file for genotyping. However, some variations were identified during the genotyping step.
here are some examples:
屏幕截图 2024-07-01 164546.pdf
The rows in the ID column that do not contain 'Chr' represent newly discovered mutations.
Can these newly generated mutations be directly removed?

ERROR : Failed to create user namespace: user namespace disabled

Hello,

I ran the test data in the login node of HPC cluster:

nextflow run /public/home/zyqi/pan-TE-analysis/GraffiTE/GraffiTE/main.nf \
   --assemblies assemblies.csv \
   --TE_library human_DFAM3.6.fasta \
   --reference hs37d5.chr22.fa \
    --reads reads.csv -with-singularity /public/home/zyqi/pan-TE-analysis/GraffiTE/graffite_latest.sif

It gave me an error in the first step map_asm: FATAL: while extracting /public/home/zyqi/pan-TE-analysis/GraffiTE/graffite_latest.sif: root filesystem extraction failed: extract command failed: ERROR : Failed to create user namespace: user namespace disabled
: exit status 1

What can I do to solve the error?

N E X T F L O W   ~  version 24.04.3

Launching `/public/home/zyqi/pan-TE-analysis/GraffiTE/GraffiTE/main.nf` [compassionate_hamilton] DSL2 - revision: 6d6ae7414d



▄████  ██▀███   ▄▄▄        █████▒ █████▒██▓▄▄▄█████▓▓█████
██▒ ▀█▒▓██ ▒ ██▒▒████▄    ▓██   ▒▓██           ██▒ ▓▒▓█   ▀
▒██░▄▄▄░▓██ ░▄█ ▒▒██  ▀█▄  ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███
░▓█  ██▓▒██▀▀█▄  ░██▄▄▄▄██ ░▓█▒  ░░▓█▒  ░░██░░ ▓██▓ ░ ▒▓█  ▄
░▒▓███▀▒░██▓ ▒██▒  █   ▓██▒░▒█░   ░▒█░   ░██░  ▒██▒ ░ ░▒████▒
░▒   ▒ ░ ▒▓ ░▒▓░ ▒▒   ▓▒█░ ▒ ░    ▒ ░   ░▓    ▒ ░░   ░░ ▒░ ░
░   ░   ░▒ ░ ▒░  ▒   ▒▒ ░ ░      ░      ▒ ░    ░     ░ ░  ░
░ ░   ░   ░░   ░   ░   ▒    ░ ░    ░ ░    ▒ ░  ░         ░
░    ░           ░  ░               ░              ░  ░

V . null

Find and Genotype Transposable Elements Insertion Polymorphisms
in Genome Assemblies using a Pangenomic Approach

Authors: Cristian Groza and Clément Goubert
Bug/issues: https://github.com/cgroza/GraffiTE/issues


executor >  local (1)
executor >  local (1)
[d9/e7bbb9] map_asm (1)    [100%] 1 of 1, failed: 1 ✘
[-        ] svim_asm       -
[-        ] survivor_merge -
[-        ] repeatmask_VCF -
[-        ] tsd_prep       -
[-        ] tsd_search     -
[-        ] tsd_report     -
[-        ] pangenie       -
[-        ] merge_VCFs     -
ERROR ~ Error executing process > 'map_asm (1)'

Caused by:
  Process `map_asm (1)` terminated with an error exit status (255)


Command executed:

  minimap2 -a -x asm5 --cs -r2k -t 1 -K 500M hs37d5.chr22.fa HG002.mat.cur.20211005_chr22.fasta.gz | samtools sort -m4G -@4 -o asm.sorted.bam -

Command exit status:
  255

Command output:
  (empty)

Command error:
  INFO:    Converting SIF file to temporary sandbox...
  FATAL:   while extracting /public/home/zyqi/pan-TE-analysis/GraffiTE/graffite_latest.sif: root filesystem extraction failed: extract command failed: ERROR  : Failed to create user namespace: user namespace disabled
  : exit status 1

Work dir:
  /public/home/zyqi/pan-TE-analysis/GraffiTE/GraffiTE/test/GraffiTE_testset/work/d9/e7bbb9304f96f7914424bc1d3e9d97

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Inquiry about Integrating Minigraph-cactus into the Pipeline for Enhanced SV Analysis

Dear Guillaume Bourque's group,

Firstly, I would like to express my sincere gratitude for your remarkable software. It is indeed a significant contribution to the field. I've been using your pipeline for SV (Structural Variant) analysis, but I've encountered some intriguing differences in the results when comparing two methods.

I have been generating pan-genome SVs using Minigraph-cactus and also calling SVs based on your pipeline that utilizes svim for comparison against a reference genome. In my tests, I've observed that the alignment results from Minigraph-cactus seem to be more reliable.

Given these observations, I am curious about the possibility of integrating the Minigraph-cactus process into your pipeline. Specifically, I am interested in using the SVs generated by Minigraph-cactus and then applying your TE (Transposable Element) identification process for further analysis and genotyping.

I believe that the incorporation of Minigraph-cactus into your pipeline could enhance the accuracy and efficacy of SV analysis. Could you please let me know if such integration is feasible in your future development plans?

Your guidance and insights on this matter would be greatly appreciated.

Thank you for your time and consideration.

Best regards,
Yfchen

does GraffiTE work with 10x linked reads?

Hi there,

Thank you so much for developing such a good tool to identify the polymorphic TEs.

I am wondering if GraffiTE can work with 10x linked reads? The contig-levle assembly of 10x linked reads is much shorter than the PacBio/Nanopore and might not be able to have contigs spanning highly repetitive regions. Given these, do you think GraffiTE can work with the contig level assembly from 10x linked read data?

Thanks in advance!

Best,
Lin

The need to gzipped files

It would be awesome if you can use the reads in a zipped version

Error for PanGenie-index

Description:

I'm encountering an issue when running the following command:

PanGenie-index -v pangenome.vcf -r GXFZ.fa -t 8 -o pangenome

The error message I'm receiving is:
https://private-user-images.githubusercontent.com/24269804/300379928-55105a77-470e-4216-91e5-1ca02ff4decc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1NDE2NDUsIm5iZiI6MTcwNjU0MTM0NSwicGF0aCI6Ii8yNDI2OTgwNC8zMDAzNzk5MjgtNTUxMDVhNzctNDcwZS00MjE2LTkxZTUtMWNhMDJmZjRkZWNjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDE1MTU0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxYWJlMjBiMGI5NDk3ZmNiMjZlMDJlZWQwNmExOWUyMTEyMzc0YjAwYjRhNjIxMmFkYjUxMzRhOWJmYzg4MzkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.6O0tLJA1oM935GJ4ROz_T3uzJdwk_EUcIUnI7bZajtU
https://private-user-images.githubusercontent.com/24269804/300379990-e6160b36-8c80-47d0-9d81-4b50561d0384.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1NDE2NDUsIm5iZiI6MTcwNjU0MTM0NSwicGF0aCI6Ii8yNDI2OTgwNC8zMDAzNzk5OTAtZTYxNjBiMzYtOGM4MC00N2QwLTlkODEtNGI1MDU2MWQwMzg0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDE1MTU0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTUyZjkxNTg3ODAwMDEwY2JiNmY1NjYzYTYxZDVhZGQ3ZTRhZTU2OWQ2MjVmZDc5ZjM5OGIzYzM2ZmU0OTc2ZGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Jf80qTBPsT8JxUS1JpKwPh7A2tOk7GanTxhVG45ELbA

Here, pangenome.vcf file is as follows:

https://private-user-images.githubusercontent.com/24269804/300380587-ec64c1fc-1f47-4ba3-8a62-4f4d61cd7c92.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDY1NDE2NDUsIm5iZiI6MTcwNjU0MTM0NSwicGF0aCI6Ii8yNDI2OTgwNC8zMDAzODA1ODctZWM2NGMxZmMtMWY0Ny00YmEzLThhNjItNGY0ZDYxY2Q3YzkyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMjklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTI5VDE1MTU0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI5MTVmYjM2ZWU2NmIyYzg2NjU4Y2ZkMDRjNzgzMTUzNTIyYWE0ZGZlOGNiOTllZmQwMWVmODAzYzQ5NzdjNDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.0P-sdiE4-iXqQeCUTSNVhW1QCyfR28Hu4bhVGuPgL1E

So, how can I handle it?

GraffiTE stops at the tsd_prep step when bypassing SV discovery

Hi,

I'm trying to run GraffiTE in the mode that bypasses the SV calling steps but it seems to get stuck at the tsd_prep step. I was wondering if you had any idea why?

Loading nextflow/23.04.4
  Loading requirement: Java/17.0.4
N E X T F L O W  ~  version 23.04.4
Launching `/workspace/Repo/GraffiTE/main.nf` [voluminous_sanger] DSL2 - revision: 20270181eb


▄████  ██▀███   ▄▄▄        █████▒ █████▒██▓▄▄▄█████▓▓█████
██▒ ▀█▒▓██ ▒ ██▒▒████▄    ▓██   ▒▓██           ██▒ ▓▒▓█   ▀
▒██░▄▄▄░▓██ ░▄█ ▒▒██  ▀█▄  ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███
░▓█  ██▓▒██▀▀█▄  ░██▄▄▄▄██ ░▓█▒  ░░▓█▒  ░░██░░ ▓██▓ ░ ▒▓█  ▄
░▒▓███▀▒░██▓ ▒██▒  █   ▓██▒░▒█░   ░▒█░   ░██░  ▒██▒ ░ ░▒████▒
░▒   ▒ ░ ▒▓ ░▒▓░ ▒▒   ▓▒█░ ▒ ░    ▒ ░   ░▓    ▒ ░░   ░░ ▒░ ░
░   ░   ░▒ ░ ▒░  ▒   ▒▒ ░ ░      ░      ▒ ░    ░     ░ ░  ░
░ ░   ░   ░░   ░   ░   ▒    ░ ░    ░ ░    ▒ ░  ░         ░
░    ░           ░  ░               ░              ░  ░

V . null

Find and Genotype Transposable Elements Insertion Polymorphisms
in Genome Assemblies using a Pangenomic Approach

Authors: Cristian Groza and Clément Goubert
Bug/issues: https://github.com/cgroza/GraffiTE/issues


[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -

[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -

[-        ] process > repeatmask_VCF [  0%] 0 of 1
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -

executor >  local (1)
[a0/da2e6a] process > repeatmask_VCF (1) [  0%] 0 of 1
[-        ] process > tsd_prep           -
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -

executor >  local (2)
[a0/da2e6a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[1c/8d8948] process > tsd_prep (1)       [  0%] 0 of 1
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -

executor >  local (2)
[a0/da2e6a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[1c/8d8948] process > tsd_prep (1)       [  0%] 0 of 1
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -

executor >  local (2)
[a0/da2e6a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[1c/8d8948] process > tsd_prep (1)       [100%] 1 of 1 ✔
[-        ] process > tsd_search         -
[-        ] process > tsd_report         -
Completed at: 23-May-2024 17:17:41
Duration    : 1m 10s
CPU hours   : 0.1
Succeeded   : 2

I'm using a vcf file from Sniffles2. But we also saw the same error with a vcf from SVIM-asm.

My reads.csv contains:

path,sample,type
./guppy_v6.4.6_sup.fq.gz,<tag>,ont

The commands I tried:

nextflow run main.nf \
    --vcf $vcfFile --genotype false \
    --reference $referenceGenome \
    --TE_library $TElib \
    --reads ./reads.csv \
    --graph_method graphaligner \
    --cores 4 \
    --repeatmasker_memory 24G \
    --graph_align_memory 24G \
    --vg_call_memory 24G

nextflow run main.nf \
    -profile cluster \
    -resume \
    --TE_library $TElib \
    --reference $referenceGenome \
    --reads ./reads.csv \
    --graph_method graphaligner \
    --vcf $vcfFile \
    --cores 1 \
    --repeatmasker_memory 24G \
    --graph_align_memory 24G \
    --vg_call_memory 24G

Current singularity image bug: `biomartr` missing

The latest singularity image (sha256.aa0c18cb743d243bec4b18c16ed147d9bc7f4493c98603b7a886c7601a7beaa5) for GraffiTE misses the R packages biomartr, which causes the RepeatMasker process to fail annotating the VCF. The next process tsd_prep does not produce an output and prematurely ends the pipeline.

Update as soon as it is fixed!

Possible issue related to $TMPDIR

Hi,

I seem to have an issue at the Repeatmasker stage of the pipeline, that could be related to how the $TMPDIR is used. I am trying to run GraffiTE on a slurm based cluster where $TMPDIR is assigned per job. It contains a path to a folder in /scratch, the pattern is /scratch/user-ID_job_354286_o07c02. Only in this path can the job actually write to scratch. The error says this path is not writeable:

compute repeat proportion for each SVs...
sort: cannot create temporary file in '/scratch/fr_de1013_job_354286_o07c02': No such file or directory
sort: cannot read: span: No such file or directory
sort: cannot create temporary file in '/scratch/fr_de1013_job_354286_o07c02': No such file or directory
Mammalian filters ON. Filtering...
awk: cmd. line:1: warning: regexp escape sequence '\#' is not a known regexp operator
sort: cannot create temporary file in '/scratch/fr_de1013_job_354286_o07c02': No such file or directory
sort: cannot read: TwP.txt: No such file or directory
writing vcf...
mktemp: failed to create file via template '/scratch/fr_de1013_job_354286_o07c02/tmp.XXXXXXXXXX': No such file or directory
/home/fr/fr_fr/fr_de1013/bin/GraffiTE/bin/repmask_vcf.sh: line 121: ${HDR_FILE}: ambiguous redirect

etc. From there on it keeps failing. Running the script interactively, shows that the $TMPDIR is in fact writeable, since I could write and read files there even after the pipeline failed.

From searching for the error or similar ones, I found that the 'sort' command makes use of the TMPDIR, but cannot access it. I found that it might be necessary to bind the location of $TMPDIR in singularity (but I am not sure if this step is in singularity).
nf-core/chipseq#123 (comment)

At the same time, in the pipeline, the system set $TMPDIR appears to be overwritten (excerpt from main.nf lines 323 to 334, newest code from the repo):

switch(params.graph_method) {
                case "giraffe":
                    prep + """
                    vg autoindex --tmp-dir \$PWD  -p index/index -w giraffe -v sorted.vcf.gz -r ${fasta}
                    """ + finish
                    break
                case "graphaligner":
                    prep + """
                    export TMPDIR=$PWD
                    vg construct -a  -r ${fasta} -v ${vcf} -m 1024 > index/index.vg
                    """ + finish
                    break

I am not sure if the pipeline is at this stage yet, but that might be another potential issue in any case, since all programs that access $TMPDIR will not find it after that change.

Here is the .command.sh

#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz |   awk -v FS='  ' -v OFS='   '   ’{if($0 ~ /#CHROM/) {$9 = “FORMAT”; $10 = “ref”; print $0} else if(substr($0, 1, 1) == “#”) {print $0} else {$9 = “GT”; $10 = “1|0”; print $0}}' |   awk ‘NR==1{print; print “##FORMAT=<ID=GT,Number=1,Type=String,Description=“Genotype”>“} NR!=1’ |   bcftools view -i ‘INFO/total_match_span > 0.80’ -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

I hope I was clear in my description, if you need additional information please let me know. Thanks for your help.

Edit: Legibility of code

short reads question

Hi,

I started a new issue as I am not sure on how the short reads should be handled and others may also be interested in the answer.

It says that the reads should be "Paired-end reads must be interleaved in the same file (Pangenie)". Can you advice on how to interleave the files? Is a simple zcat sufficient?
zcat sampleR1.fq.gz sampleR12.fq.gz
How much coverage should the short reads have to be suitable? I have 20X, is this too much?
I am comparing assemblyA to assemblyB. Your manual say to add short reads to aid with genotyping. Should the short reads come from the same individual as assemblyB was based on in order to avoid any bias?

Many thanks!

prepTSD filters

In line 24 of prepTSD, we explicitly filter VCF records by presence of "DEL" and "INS".

awk '/n_hits=1/ && /INS/ {print $1"\t"$2"\t"($2)+1"\t"$3; next} /n_hits=1/ && /DEL/ {print $1"\t"$2"\t"($2+length($4))"\t"$3; next} /n_hits=2/ && /INS/ && /5P_INV/ {print $1"\t"$2"\t"($2)+1"\t"$3; next} /n_hits=2/ && /DEL/ && /5P_INV/ {print $1"\t"$2"\t"($2+length($4))"\t"$3}' > oneHit_SV_coordinates.bed

This works with VCFs created by GraffiTE, but VCFs from other sources may not be annotated quite the same way.
For example, the HPRC VCFs are annotated with "del" and "ins".
Therefore, a more general approach is necessary to filter deletions and insertions. Perhaps with bcftools and its ILEN field.

ERROR ~ Error executing process > 'graph_align_reads (41)'

Hi,

I'm attempting to run GraffiTE in GT-sv-GA mode, but I encounter an error during the execution of the graph_align_reads (41) process, as shown in the message below. (However, when I run this step manually using GraphAligner, it completes successfully but the job gets killed.) Do you have any idea what might be causing this issue?

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check '.nextflow.log' file for details

executor > local (507)
[21/3da086] process > map_asm (1) [100%] 115 of 115 ✔
[cb/7c0a52] process > svim_asm (115) [100%] 115 of 115 ✔
[13/513669] process > survivor_merge [100%] 1 of 1 ✔
[b8/15025a] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[a8/d2e2df] process > tsd_prep (1) [100%] 1 of 1 ✔
[94/e3d127] process > tsd_search (148) [100%] 157 of 157 ✔
[60/be72ec] process > tsd_report (1) [100%] 1 of 1 ✔
[77/79f066] process > make_graph (1) [100%] 1 of 1 ✔
[1a/75fbbf] process > graph_align_reads (49) [100%] 115 of 115, failed: 89 ✘
[- ] process > vg_call -
[- ] process > merge_VCFs -
ERROR ~ Error executing process > 'graph_align_reads (41)'

Caused by:
Process graph_align_reads (41) terminated with an error exit status (134)

Command executed:

GraphAligner -t 1 -x vg -g index/index.vg -f X.fastq.gz -a X.gam

vg pack -x index/index.vg -g X.gam -o X.pack -Q 0

Command exit status:
134

Command output:
GraphAligner bioconda 1.0.13-
Load graph from index/index.vg
Build minimizer seeder from the graph
Minimizer seeds, length 15, window size 20, density 10
Seed cluster size 1
Alignment bandwidth 10
Clip alignment ends with identity < 66%
X-drop DP score cutoff 14705
write alignments to X.gam
Align

Command error:
INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
INFO: fuse2fs not found, will not be able to mount EXT3 filesystems
GraphAligner bioconda 1.0.13-
GraphAligner bioconda 1.0.13-
Load graph from index/index.vg
Build minimizer seeder from the graph
Minimizer seeds, length 15, window size 20, density 10
Seed cluster size 1
Alignment bandwidth 10
Clip alignment ends with identity < 66%
X-drop DP score cutoff 14705
write alignments to X.gam
Align
Signal 11. Read: 2a9030c3-0fd1-4219-8496-ad2d7f1fb33b runid=b8d4812b34a5a9fdf693319045b7ee7e26775c97 read=30068 ch=2021 start_time=2019-11-30T05:26:38Z flow_cell_id=PAD94007 protocol_group_id=191128-CLIMARES sample_id=MULTIPLEX6-2-M40-S
00-L14 barcode=barcode07. Seed: 34132+,3021,15,752
.command.sh: line 2: 16 Aborted GraphAligner -t 1 -x vg -g index/index.vg -f X.fastq.gz -a X.gam

This is from .nextflow log file:

Jul-12 03:21:25.913 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (35); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/58/afee1f76707eac941eba261f63cc7d
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (35) failed
Jul-12 03:21:25.914 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 426; name: graph_align_reads (34); status: COMPLETED; exit: -; error: nextflow.exception.ProcessE
xception: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/39/b2a0764669a366a56a5e598a98e656]
Jul-12 03:21:25.914 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (34); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/39/b2a0764669a366a56a5e598a98e656
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (34) failed
Jul-12 03:21:25.915 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 440; name: graph_align_reads (48); status: COMPLETED; exit: -; error: nextflow.exception.ProcessE
xception: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/dd/a38cb1d0b16fac105668ec2bf9234a]
Jul-12 03:21:25.915 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (48); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/dd/a38cb1d0b16fac105668ec2bf9234a
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (48) failed
Jul-12 03:21:25.915 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 396; name: graph_align_reads (4); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/97/785f099c658f091c3b16c7b0abe50e]
Jul-12 03:21:25.915 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (4); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/97/785f099c658f091c3b16c7b0abe50e
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (4) failed
Jul-12 03:21:25.916 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 434; name: graph_align_reads (42); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/24/4cc79fecfa1f92029986601a22bd88]
Jul-12 03:21:25.916 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (42); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/24/4cc79fecfa1f92029986601a22bd88
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (42) failed
Jul-12 03:21:25.929 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 483; name: graph_align_reads (91); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/fd/a12ae6a90d2858d7514f4992c60e24]
Jul-12 03:21:25.929 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (91); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/fd/a12ae6a90d2858d7514f4992c60e24
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (91) failed
Jul-12 03:21:25.953 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 405; name: graph_align_reads (13); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/58/5a8dbb35508e8365cbef0793e7f1f1]
Jul-12 03:21:25.954 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (13); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/58/5a8dbb35508e8365cbef0793e7f1f1
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (13) failed
Jul-12 03:21:25.954 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 462; name: graph_align_reads (70); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/58/6143156afdcd9ead22067413ce58d3]
Jul-12 03:21:25.954 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (70); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/58/6143156afdcd9ead22067413ce58d3
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (70) failed
Jul-12 03:21:25.955 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 484; name: graph_align_reads (92); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/50/5f76f2b85046ea4410442f61413fb8]
Jul-12 03:21:25.955 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (92); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/50/5f76f2b85046ea4410442f61413fb8
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (92) failed
Jul-12 03:21:25.955 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 441; name: graph_align_reads (49); status: COMPLETED; exit: -; error: nextflow.exception.ProcessException: Process exceeded running time limit (12h); workDir: /raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/1a/75fbbffbd47b5475851ec1121ee26c]
Jul-12 03:21:25.956 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=graph_align_reads (49); work-dir=/raven/ptmp/ykaya/Pangenome_project/SV/TE_SV/work/1a/75fbbffbd47b5475851ec1121ee26c
error [nextflow.exception.ProcessFailedException]: Process graph_align_reads (49) failed
Jul-12 03:21:26.093 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-12 03:21:26.103 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false)
Jul-12 03:21:26.210 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=418; failedCount=89; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=9d 21h 44m 9s; failedDuration=43d 16h 36m 38s; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=115; peakCpus=115; peakMemory=480 GB; ]
Jul-12 03:21:26.517 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Jul-12 03:21:26.571 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Jul-12 03:21:26.571 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

The command I tried:

/u/ykaya/nextflow run https://github.com/cgroza/GraffiTE --genotype true \
   --assemblies assemblies.csv \
   --TE_library TE.nonredun.fa \
   --reference /ptmp/X.fasta \
   --graph_method graphaligner \
   --reads longreads.csv

TE library construction and Input file quality control

Dear cgroza

Thank you for developing the GraffiTE software, he will obviously be cited a lot in the future, he is very helpful and inspiring to me at the moment!
I currently have Nanopore data of nearly 100 samples and 30 high-quality genome data of this species. The genetic diversity between different individuals is very high, so my initial idea is to use EDTA+RepeatModeler+Rapbase to predict these thirty genomes TE will eventually form a perfect TE_library after removing redundancy. What I want to know is whether you recommend this strategy, because it may determine the accuracy of subsequent ONT data genotyping.
In addition, it is well known that there are quite a few sequencing errors in ONT data. What I want to know is whether this will later affect the accuracy of TE detection. Do I need to use second-generation data to correct the ONT data? Will this have a big impact on the results?

Sincerely
yulong

GraffiTE exiting at minimap due to "Unknown option: --no-home"

I've been trying to run GraffiTE on a cluster and am having it fail at the alignment step. Do you have any pointers on how I can getting it running? I'm not sure if this is a bug with Nextflow/Singularity or with GraffiTE.

Here's the command I've used with the test data, and the log output below

nextflow run /ceph/users/jgalbraith/Programs/GraffiTE/main.nf \
   --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta \
   --cores 40 -with-singularity /ceph/users/jgalbraith/Programs/GraffiTE/graffite_latest.sif

N E X T F L O W  ~  version 23.10.1
Launching `/ceph/users/jgalbraith/Programs/GraffiTE/main.nf` [fabulous_mccarthy] DSL2 - revision: 20270181eb


▄████  ██▀███   ▄▄▄        █████▒ █████▒██▓▄▄▄█████▓▓█████
██▒ ▀█▒▓██ ▒ ██▒▒████▄    ▓██   ▒▓██           ██▒ ▓▒▓█   ▀
▒██░▄▄▄░▓██ ░▄█ ▒▒██  ▀█▄  ▒████ ░▒████ ░▒██▒▒ ▓██░ ▒░▒███
░▓█  ██▓▒██▀▀█▄  ░██▄▄▄▄██ ░▓█▒  ░░▓█▒  ░░██░░ ▓██▓ ░ ▒▓█  ▄
░▒▓███▀▒░██▓ ▒██▒  █   ▓██▒░▒█░   ░▒█░   ░██░  ▒██▒ ░ ░▒████▒
░▒   ▒ ░ ▒▓ ░▒▓░ ▒▒   ▓▒█░ ▒ ░    ▒ ░   ░▓    ▒ ░░   ░░ ▒░ ░
░   ░   ░▒ ░ ▒░  ▒   ▒▒ ░ ░      ░      ▒ ░    ░     ░ ░  ░
░ ░   ░   ░░   ░   ░   ▒    ░ ░    ░ ░    ▒ ░  ░         ░
░    ░           ░  ░               ░              ░  ░

V . null

Find and Genotype Transposable Elements Insertion Polymorphisms
in Genome Assemblies using a Pangenomic Approach

Authors: Cristian Groza and Clément Goubert
Bug/issues: https://github.com/cgroza/GraffiTE/issues


[-        ] process > map_asm        -
[-        ] process > svim_asm       -
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -
[-        ] process > pangenie       -
[-        ] process > merge_VCFs     -

executor >  local (1)
[ef/f8e87c] process > map_asm (1)    [  0%] 0 of 1
[-        ] process > svim_asm       -
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -
[-        ] process > pangenie       -
[-        ] process > merge_VCFs     -

executor >  local (1)
[ef/f8e87c] process > map_asm (1)    [  0%] 0 of 1
[-        ] process > svim_asm       -
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -
[-        ] process > pangenie       -
[-        ] process > merge_VCFs     -
ERROR ~ Error executing process > 'map_asm (1)'

Caused by:
  Process `map_asm (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -a -x asm5 --cs -r2k -t 40 -K 500M hs37d5.chr22.fa HG002.mat.cur.20211005_chr22.fasta.gz | samtools sort -m4G -@4 -o asm.sorted.bam -

Command exit status:
  1

Command output:
  �(B

Command error:
  ERROR: Unknown option: --no-home
  �(B

Work dir:
  /ceph/users/jgalbraith/Programs/GraffiTE/test/GraffiTE_testset/work/ef/f8e87c34a7cc24ce498d825f4b43f8

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

executor >  local (1)
[ef/f8e87c] process > map_asm (1)    [100%] 1 of 1, failed: 1 ✘
[-        ] process > svim_asm       -
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -
[-        ] process > pangenie       -
[-        ] process > merge_VCFs     -
ERROR ~ Error executing process > 'map_asm (1)'

Caused by:
  Process `map_asm (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -a -x asm5 --cs -r2k -t 40 -K 500M hs37d5.chr22.fa HG002.mat.cur.20211005_chr22.fasta.gz | samtools sort -m4G -@4 -o asm.sorted.bam -

Command exit status:
  1

Command output:
  �(B

Command error:
  ERROR: Unknown option: --no-home
  �(B

Work dir:
  /ceph/users/jgalbraith/Programs/GraffiTE/test/GraffiTE_testset/work/ef/f8e87c34a7cc24ce498d825f4b43f8

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details

Error "Argument of `file` function cannot be null"

We are trying to run the pipeline but after giving the arguments:

nextflow run cgroza/GraffiTE --assemblies SeaTurtles.csv --TE_library /srv/public/users/tomas/STR/rDerCor1/Checking_11_21/RepeatModeller2/rDerCor1_database2-families.fa --reference /srv/public/users/tomas/STR/rDerCor1/Checking_11_21/RepeatModeller2/rDerCor1.pri.cur.20210524.fasta --reads Reads.csv --cores 40

nextflow run /srv/public/users/tomas/programs/GraffiTE/main.nf --assemblies SeaTurtles.csv --TE_library /srv/public/users/tomas/STR/rDerCor1/Checking_11_21/RepeatModeller2/rDerCor1_database2-families.fa --reference /srv/public/users/tomas/STR/rDerCor1/Checking_11_21/RepeatModeller2/rDerCor1.pri.cur.20210524.fasta --reads Reads.csv --cores 40

We have the error:
Argument of file` function cannot be null

-- Check script '/srv/public/users/tomas/programs/GraffiTE/main.nf' at line: 59 or see '.nextflow.log' file for more details
Argument of file function cannot be null
checking on the.nextflow.log`

Nov-08 11:38:02.199 [Actor Thread 4] ERROR nextflow.extension.OperatorImpl - @unknown
java.lang.IllegalArgumentException: Argument of `file` function cannot be null
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
	at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:72)
	at org.codehaus.groovy.reflection.CachedConstructor.doConstructorInvoke(CachedConstructor.java:59)
	at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrap.callConstructor(ConstructorSite.java:84)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:59)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:263)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:277)
	at nextflow.Nextflow.file(Nextflow.groovy:146)
	at nextflow.Nextflow$file.callStatic(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:217)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:240)
	at Script_1fa1c3aa$_runScript_closure2.doCall(Script_1fa1c3aa:59)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:38)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:139)
	at nextflow.extension.MapOp$_apply_closure1.doCall(MapOp.groovy:57)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:274)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
	at groovy.lang.Closure.call(Closure.java:412)
	at groovyx.gpars.dataflow.operator.DataflowOperatorActor.startTask(DataflowOperatorActor.java:120)
	at groovyx.gpars.dataflow.operator.DataflowOperatorActor.onMessage(DataflowOperatorActor.java:108)
	at groovyx.gpars.actor.impl.SDAClosure$1.call(SDAClosure.java:43)
	at groovyx.gpars.actor.AbstractLoopingActor.runEnhancedWithoutRepliesOnMessages(AbstractLoopingActor.java:293)
	at groovyx.gpars.actor.AbstractLoopingActor.access$400(AbstractLoopingActor.java:30)
	at groovyx.gpars.actor.AbstractLoopingActor$1.handleMessage(AbstractLoopingActor.java:93)
	at groovyx.gpars.util.AsyncMessagingCore.run(AsyncMessagingCore.java:132)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Nov-08 11:38:02.273 [Actor Thread 4] DEBUG nextflow.Session - Session aborted -- Cause: Argument of `file` function cannot be null

I have no idea what seems to be wrong help please :)

singularity issue

I am trying to get GraffiTE installed on our server, but I am having some issues with singularity. I am not sure if this is on my side and I need to update something? I can clone the github repository and all files are downloaded. When I then do the second step I get an error:

singularity pull --arch amd64 graffite_latest.sif library://cgroza/collection/graffite:latest

FATAL: Unable to get library client configuration: remote has no library client (see https://apptainer.org/docs/user/latest/endpoint.html#no-default-remote)

singularity guide

3_TSD_search output folder missing

Hello,

First of all: thank you very much for the pipeline.

I ran the command like this:

nextflow run cgroza/GraffiTE --assemblies assemblies.csv --TE_library nrTREP20 --reference Bgt_genome_v3_16 --graph_method pangenie --genotype false

To detect SVs in 10 fully assembled genomes (so I didn't add any reads for mapping so far).

As far as I understood, the 3rd output folder TSD search and especially the pangenome.vcf should still be written, or not?

I only get:
[rest_of_path]/out$ ls
1_SV_search 2_Repeat_Filtering

The output files in these folder look "normal" as far as I can tell. For example the file "indels.fa.masked" has many sequences, of which most are at least partially repeat masked.

However the file:
genotypes_repmasked_filtered.vcf somehow has no variants, even if there are many in the per sample vcfs.

Also: there is not error message at the end of the run:

executor > local (21)
[d3/6be117] process > map_asm (5) [100%] 9 of 9 ✔
[a0/d40a97] process > svim_asm (9) [100%] 9 of 9 ✔
[32/f5dfe6] process > survivor_merge [100%] 1 of 1 ✔
[9c/720983] process > repeatmask_VCF (1) [100%] 1 of 1 ✔
[4f/1f4338] process > tsd_prep (1) [100%] 1 of 1 ✔
[- ] process > tsd_search -
[- ] process > tsd_report -
Completed at: 25-Apr-2024 11:58:28
Duration : 26m 27s
CPU hours : 2.3
Succeeded : 21

Thank you very much for your help!

GraffiTE ended at repeatmask vcf: Error executing process > 'repeatmask_VCF (1)' Caused by: Process `repeatmask_VCF (1)` terminated with an error exit status (255)

Hi, I met an error when processing repeatmask:

nextflow run .../Software/GraffiTE/main.nf \
   --assemblies assemblies.csv \
   --TE_library Libraries/RepbaseforRepeatMasker.fasta \
   --reference gadMor3.0_shorten.fa \
   --reads reads.csv \
   --cores 16

executor > local (3)
[e6/7ef80e] process > svim_asm (1) [100%] 1 of 1 ✔
[e6/ad812a] process > survivor_merge [100%] 1 of 1 ✔
[13/56bd72] process > repeatmask_VCF (1) [ 0%] 0 of 1
[- ] process > tsd_prep -
[- ] process > tsd_search -
[- ] process > tsd_report -
[- ] process > pangenie -
[- ] process > merge_VCFs -
Error executing process > 'repeatmask_VCF (1)'

Caused by:
Process repeatmask_VCF (1) terminated with an error exit status (255)

Command executed:

repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz RepbaseForRepeatMasker.fasta
bcftools view -G genotypes_repmasked.vcf.gz | awk -v FS=' ' -v OFS=' ' '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' | awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' | bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref gadMor3.0_shorten.fa --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

Command exit status:
255

Command output:
column count: 10

Meta line 260 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 31290
Character matrix gt cols: 10
skip: 0
nrows: 31290
row_num: 0

Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant 23000
Processed variant 24000
Processed variant 25000
Processed variant 26000
Processed variant 27000
Processed variant 28000
Processed variant 29000
Processed variant 30000
Processed variant 31000
Processed variant: 31290
All variants processed
[1] "CHROM" "POS" "qry_id" "n_hits"
[5] "fragmts" "match_lengths" "repeat_ids" "matching_classes"
[9] "strands" "RM_id"
compute repeat proportion for each SVs...
Mammalian filters OFF, writing vcf...

Command error:
Phase 1 : exact matches
#################################
85 matches found in non-fuzzy phase

#################################
4680 elements found without match
#################################

#################################
Output file should be manually edited to take into account all specificities of the considered organism!
#################################

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

  filter, lag

The following objects are masked from 'package:base':

  intersect, setdiff, setequal, union


 *****       ***   vcfR   ***       *****
 This is vcfR 1.13.0 
   browseVignettes('vcfR') # Documentation
   citation('vcfR') # Citation
 *****       *****      *****       *****

Warning message:
The x argument of as_tibble.matrix() must have unique column names if
.name_repair is omitted as of tibble 2.0.0.
i Using compatibility .name_repair.
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: division by zero attempted
mktemp: failed to create file via template '/scratch/SlurmTMP/user.7524169/tmp.XXXXXXXXXX': No such file or directory
.../Software/GraffiTE/bin/repmask_vcf.sh: line 146: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 147: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 148: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 149: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 150: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 151: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 152: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 153: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 154: ${HDR_FILE}: ambiguous redirect
.../Software/GraffiTE/bin/repmask_vcf.sh: line 155: ${HDR_FILE}: ambiguous redirect
sort: cannot create temporary file in '/scratch/SlurmTMP/user.7524169': No such file or directory
[E::hts_open_format] Failed to open file "CHROM,POS,~ID,INFO/n_hits,INFO/fragmts,INFO/match_lengths,INFO/repeat_ids,INFO/matching_classes,INFO/RM_hit_strands,INFO/RM_hit_IDs,INFO/total_match_length,INFO/total_match_span" : No such file or directory
Failed to read from CHROM,POS,~ID,INFO/n_hits,INFO/fragmts,INFO/match_lengths,INFO/repeat_ids,INFO/matching_classes,INFO/RM_hit_strands,INFO/RM_hit_IDs,INFO/total_match_length,INFO/total_match_span: No such file or directory
Failed to read from standard input: unknown file type

Work dir:
/.../work/13/56bd72c9c56695ebc011428b61aac4

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

.command.sh:

#!/bin/bash -ue repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz RepbaseForRepeatMasker.fasta bcftools view -G genotypes_repmasked.vcf.gz | awk -v FS=' ' -v OFS=' ' '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' | awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' | bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf fix_vcf.py --ref gadMor3.0_shorten.fa --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

GraffiTE ends prematurely at "tsd_prep" with no output. [FIXED: check chromosome names]

    Thanks for the tip, I ended up removing the "--contain" flag altogether, since it always seemed to insist on going to the $TMPDIR path whatever I tried to bind. That worked, however, now the pipeline stops after the first TSD step. The job finished as successful, no error was generated. But the pipeline is not complete I think, as there is no 3_... folder in the output. The last result folder generated is `2_Repeat_Filtering` with

genotypes_repmasked_filtered.vcf repeatmasker_dir

The job output:

executor >  local (1)
[ba/f7b952] process > svim_asm (5)     [100%] 16 of 16, cached: 16 ✔
[06/c2782e] process > repeatmasker (1) [100%] 1 of 1, cached: 1 ✔
[bd/7be51f] process > tsd_prep (1)     [100%] 1 of 1 ✔
[-        ] process > tsd_search       -
[-        ] process > tsd_report       -

The .command.sh:

#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz |     awk -v FS='	' -v OFS='	'     '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' |     awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' |     bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

Originally posted by @dewuem in #8 (comment)

[E::parse_cigar] CIGAR length too long at position 1 (274808464H)

Me again.

Managed to get the software to run but it only ran for four minutes before hitting this CIGAR error.

A quick google search suggests that the read lengths are too long to handle (samtools/samtools#1667).

However, I'm not dealing with reads, this is a job where I'm analyzing whole assemblies. I'm sure this is a mistake on my part somewhere given that the documentation specifically says graffiTE can be run using whole assemblies.

My command line:

nextflow run https://github.com/cgroza/GraffiTE \
   --assemblies cTho_assemblies.csv \
   --TE_library mammals.plus.covid_bats2.14072022.fa \
   --reference ../assemblies/cTho_A.fa \
   --graph_method pangenie \
   --genotype false \
   --cores 12 \
   --mammal \
   --svim_asm_threads 12 \
   --asm_divergence asm5
   --svim_asm_time 2h

The error:


[-        ] process > svim_asm       -
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -

executor >  local (1)
[11/563cf2] process > svim_asm (1)   [  0%] 0 of 2
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -

executor >  local (2)
[12/90fee5] process > svim_asm (2)   [  0%] 0 of 2
[-        ] process > survivor_merge -
[-        ] process > repeatmask_VCF -
[-        ] process > tsd_prep       -
[-        ] process > tsd_search     -
[-        ] process > tsd_report     -
ERROR ~ Error executing process > 'svim_asm (1)'

Caused by:
  Process `svim_asm (1)` terminated with an error exit status (1)

Command executed:

  mkdir asm
  minimap2 -a -x asm5 --cs -r2k -t 12 -K 500M cTho_A.fa cTho_B.fa | samtools sort -m4G -@4 -o asm/asm.sorted.bam -
  samtools index asm/asm.sorted.bam
  svim-asm haploid --min_sv_size 100 --types INS,DEL --sample cTho_B asm/ asm/asm.sorted.bam cTho_A.fa
  sed 's/svim_asm\./cTho_B\.svim_asm\./g' asm/variants.vcf > cTho_B.vcf

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::31.011*1.36] collected minimizers
  [M::mm_idx_gen::37.874*1.71] sorted minimizers
  [M::main::37.874*1.71] loaded/built the index for 812 target sequence(s)
  [M::mm_mapopt_update::40.498*1.66] mid_occ = 168
  [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 812
  [M::mm_idx_stat::42.456*1.63] distinct minimizers: 161244371 (94.15% are singletons); average occurrences: 1.421; average spacing: 9.923; total length: 2273669687
  [E::parse_cigar] CIGAR length too long at position 1 (274808464H)
  [E::parse_cigar] CIGAR length too long at position 877 (272289946H)
  [E::parse_cigar] CIGAR length too long at position 4012 (275636627H)
  samtools sort: truncated file. Aborting

Any insight would be appreciated.

David

genotyping with assemblies only

Note: This was also sent to Clement via e-mail. Then I realized I should ask through here instead. Sorry for the duplication.

I saw your GraffiTE package a few days ago and just finished installing for a test run.

The test data ran successfully, so now it's time for a test using our data.

I recently came into possession of two haplotypes for a single individual and thought this might be a useful scenario, trying to identify polymorphisms in the two haplotypes of the diploid genome.

According to the documentation on github, all of these are required:

nextflow run cgroza/GraffiTE \
   --assemblies assemblies.csv \
   --TE_library library.fa \
   --reference reference.fa \
   --graph_method pangenie \
   --reads reads.csv

No problem with nearly all of these. But, the documentation also says that you can perform the genotyping using only assemblies, as is the case I want to try.

From the paper: "pMEs can be detected from genome assemblies or any type of long-read data, and genotyping can be performed using short- and long-read sets. This flexibility allows researchers to get the most out of their data; for example, by performing the initial SV search with high-quality – though perhaps less abundant – data, such as chromosome-level assemblies and long-read sequences, while genotyping in larger cohorts or populations using cost-effective short-read sets."

I haven't tried the run with the two assemblies yet but, given the wording on github, I'm going to get an error if I don't include the --reads option.

Is this something I'm going to need to worry about? How do I get around this, if possible?

Just noticed another potential problem:

--graph_method: can be pangenie, giraffe or graphaligner, select which graph method will be used to genotyped TEs. Default is pangenie and it is optimized for short-reads. giraffe can handle both short and long reads, and graphaligner is optimized for long reads.

None of these mention using only assemblies? Assuming what I'm asking is possible, which, if any of these, should I choose? graphaligner?

David

Typo L481 of main.nf

Hi,

While using GraffiTE with Giraffe on a cluster, I ran into a bug that seems to stem from a typo L481 of the main.sf file:

vg giraffe -t ${graph__align_threads} -Z index/index.giraffe.gbz -m index/index.min -d index/index.dist -i -f ${sample_reads} > ${sample_name}.gam

I believe graph__align_threads should be graph_align_threads instead.
Thank you for the very nice work :)
All the best,
Yann

Question about setting own tmp dir

Hi~

Recently, I'm struggling with the problem that I have to set my own tmp directory while running the GraffiTE, because of the limited access authority in group server. I used the command line ‘export NXF_TEMP=’ in my slurm script to set the tmp dir. However, the squeue showed that my task job was in normally running state, but the output dir contained nothing. I also tried the way you mentioned in the ‘important note' to revise the nextflow.config, but the slurm task showed error as the moment I sbatched my work. Any idea could figure my problem out?

Thank you so much!

How to interpret the results of the TIPs genotyping

Hi,

I was lucky to have the TIPs genotyping results after running the GraffiTE code. My results are as follows:

Some questions are as follows:

What does the dot in the ID column in the Figure mean? Additionally, I have checked the ID. I found that the ID was not in the pangenome.vcf file generated by the previous step.
In the INFO column, the ID with the dot has any TE annotations. How can we solve it?
I would like to perform the gene annotation for the TIPs genotyping file (vcf.) based on the reference genone gtf file. Do you have any recommendations on how to do that?

Looking forward to hearing from you！