parklab / xtea Goto Github PK

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics

License: Other

Shell 0.01% Dockerfile 0.13% Common Workflow Language 0.09% Python 99.77%

transposable-elements linked-reads illumina nanopore pacbio mobile-element-insertion

xtea's Introduction

xTea

xTea (comprehensive transposable element analyzer) is designed to identify TE insertions from paired-end Illumina reads, barcode linked-reads, long reads (PacBio or Nanopore), or hybrid data from different sequencing platforms and takes whole-exome sequencing (WES) or whole-genome sequencing (WGS) data as input.

Download

short reads (Illumina and Linked-Reads)

1.1 latest version

git clone https://github.com/parklab/xTea.git

1.2 cloud binary version

git clone --single-branch --branch release_xTea_cloud_1.0.0-beta  https://github.com/parklab/xTea.git

long reads (PacBio or Nanopore)

git clone --single-branch --branch xTea_long_release_v0.1.0 https://github.com/parklab/xTea.git

pre-processed repeat library used by xTea (this library is used for both short and long reads)
```
wget https://github.com/parklab/xTea/raw/master/rep_lib_annotation.tar.gz
```
gene annotation files are downloaded from GENCODE. Decompressed gff3 files are required.
- For GRCh38 (or hg38), gff3 files are downloaded and decompressed from https://www.gencodegenes.org/human/release_33.html ;
- For GRCh37 (or hg19), gff3 files are downloaded and decompressed from https://www.gencodegenes.org/human/release_33lift37.html ;
- For CHM13v2, gff3 files are downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v2.0.gene_annotation.gff3;
- Or use the latest version

Dependencies

bwa (version 0.7.17 or later, which requires the -o option), which can be downloaded from https://github.com/lh3/bwa.
samtools (version 1.0 or later), which can be downloaded from https://github.com/samtools.
minimap2 (for long reads only), which can be downloaded from https://github.com/lh3/minimap2.
wtdbg2 (for long reads only), which can be downloaded from https://github.com/ruanjue/wtdbg2.
Python 2.7+/3.6+
- For the following packages, only a conda-based installation in shown. You may also install these in other ways, such as pip.
- pysam (https://github.com/pysam-developers/pysam, version 0.12 or later) is required.
  - Install pysam:
```
 conda config --add channels r
 conda config --add channels bioconda
 conda install pysam -y
```
- sortedcontainers
  - Install sortedcontainers conda install sortedcontainers -y
- numpy, scikit-learn, and pandas
  - Install numpy, scikit-learn and pandas conda install numpy scikit-learn=0.18.1 pandas -y
- DF21 (this is used to replease scikit-learn, which is complained by several users for version incompatible)
  - Install DF21 pip install deep-forest
Note: bwa and samtools need to be added to the $PATH.

Install

Use Conda

xtea is a bioconda package, to install first make sure the bioconda channel has been added:
```
 conda config --add channels defaults
 conda config --add channels bioconda
 conda config --add channels conda-forge
```
Then, install xtea (while creating a new enviroment):
```
 conda create -n your_env xtea=0.1.6
```
Or install directly via: conda install -y xtea=0.1.6
Install-free

If the dependencies have already been installed, then install-free mode is recommended. One can directly run the downloaded python scripts.

Run xTea

Input
- A sample id file list, e.g. a file named sample_id.txt with content as follows (each line represents one unique sample id):
```
 NA12878
 NA12877
```
- A file of listed alignments:
  - An Illumina bam/cram file (sorted and indexed) list, e.g. a file named illumina_bam_list.txt with content as follows (two columns separated by a space or tab: sample-id bam-path):
```
 NA12878 /path/na12878_illumina_1_sorted.bam
 NA12877 /path/na12877_illumina_1_sorted.bam
```
  - A 10X bam/cram file (sorted and indexed, see BarcodeMate regarding barcode-based indicies) list, e.g. a file named 10X_bam_list.txt with content as follows (three columns separated by a space or tab: sample-id bam-path barcode-index-bam-path):
```
NA12878 /path/na12878_10X_1_sorted.bam /path/na12878_10X_1_barcode_indexed.bam
NA12877 /path/na12877_10X_1_sorted.bam /path/na12877_10X_1_barcode_indexed.bam
```
  - A case-ctrl bam/cram file list (three columns separated by a space or tab: sample-id case-bam-path ctrl-bam-path
```
DO0001 /path/DO001_case_sorted.bam /path/DO001_ctrl_sorted.bam
DO0002 /path/DO002_case_sorted.bam /path/DO002_ctrl_sorted.bam
```

Run the pipeline from local cluster or machine

2.1 Generate the running script (if it is install-free, then use the full path of the downloaded bin/xtea instead.)

Run on a cluster or a single node (by default xtea assumes the reference genome is GRCh38 or hg38. For hg19 or GRCh37, please use xtea_hg19; for CHM13, please use gnrt_pipeline_local_chm13.py)

Here, the slurm system is used as an example. If using LSF, replace --slurm with --lsf. For those using clusters other than slurm or LSF, users must adjust the generated shell script header accordingly. Users also must adjust the number of cores (-n) and memory (-m) accordingly. In general, each core will require 2-3G memory to run. For very high depth bam files, runtime (denoted by -t) may take longer.
Note that --xtea is a required option that points to the exact folder containing python scripts.

Using only Illumina data

 xtea -i sample_id.txt -b illumina_bam_list.txt -x null -p ./path_work_folder/ -o submit_jobs.sh -l /home/rep_lib_annotation/ -r /home/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -f 5907 -y 7  --slurm -t 0-12:00 -q short -n 8 -m 25

Using only 10X data

 xtea -i sample_id.txt -b null -x 10X_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh -l /home/ec2-user/rep_lib_annotation/ -r /home/ec2-user/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -y 7 -f 5907 --slurm -t 0-12:00 -q short -n 8 -m 25

Using hybrid data of 10X and Illumina

 xtea -i sample_id.txt -b illumina_bam_list.txt -x 10X_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh -l /home/ec2-user/rep_lib_annotation/ -r /home/ec2-user/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -y 7 -f 5907 --slurm -t 0-12:00 -q short -n 8 -m 25

Using case-ctrl mode

 xtea --case_ctrl --tumor -i sample_id.txt -b case_ctrl_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh -l /home/ec2-user/rep_lib_annotation/ -r /home/ec2-user/reference/genome.fa -g /home/gene_annotation_file.gff3 --xtea /home/ec2-user/xTea/xtea/ -y 7 -f 5907 --slurm -t 0-12:00 -q short -n 8 -m 25

Working with long reads (non case-ctrl; more detailed steps please check the "xTea_long_release_v0.1.0" branch)

 xtea_long -i sample_id.txt -b long_read_bam_list.txt -p ./path_work_folder/ -o submit_jobs.sh --rmsk ./rep_lib_annotation/LINE/hg38/hg38_L1_larger_500_with_all_L1HS.out -r /home/ec2-user/reference/genome.fa --cns ./rep_lib_annotation/consensus/LINE1.fa --rep /home/ec2-user/rep_lib_annotation/ --xtea /home/ec2-user/xTea_long/xtea_long/ -f 31 -y 15 -n 8 -m 32 --slurm -q long -t 2-0:0:0

Parameters:

 Required:
 	-i: samples id list (one sample id per line);
 	-b: Illumina bam/cram file list (sorted and indexed — one file per line);
 	-x: 10X bam file list (sorted and indexed — one file per line);
 	-p: working directory, where the results and temporary files will be saved;
 	-l: repeat library directory (directory which contains decompressed files from "rep_lib_annotation.tar.gz");
 	-r: reference genome fasta/fa file;
 	-y: type of repeats to process (1-L1, 2-Alu, 4-SVA, 8-HERV; sum the number corresponding to the repeat type to process multiple repeats. 
 	    For example, to run L1 and SVA only, use `-y 5`. 
 	    Each repeat type will be processed separately, however some of the early processing steps are common to multiple repeat types.
 	    Thus, when analyzing a large cohort, to improve the efficiency (and save money on the cloud), 
 	    it is highly recommended to run the tool on one repeat type first, and subsequently on the rest. 
 	    For example, first use '-y 1', and for then use '-y 6' in a second run);
 	-f: steps to run. (5907 means run all the steps);
 	--xtea: this is the full path of the xTea/xtea folder (or the xTea_long_release_v0.1.0 folder for long reads module), 
 	        where the python scripts reside in;
 	-g: gene annotation file in gff3 format;
 	-o: generated running scripts under the working folder;
 Optional:
 	-n: number of cores (default: 8, should be an integer);
 	-m: maximum memory in GB (default: 25, should be an integer);
 	-q: cluster partition name;
 	-t: job runtime;
 	--flklen: flanking region length;
 	--lsf: add this option if using an LSF cluster (by default, use of the slurm scheduler is assumed);
 	--tumor: indicates the tumor sample in a case-ctrl pair;
 	--purity: tumor purity (by default 0.45);
 	--blacklist: blacklist file in bed format. Listed regions will be filtered out;
 	--slurm: runs using the slurm scheduler. Generates a script header fit for this scheduler;
 
 The following cutoffs will be automatically set based on read depth (and also purity in the case of a tumor sample); 
 These parameters have been thoroughly tuned based on the use of benchmark data and also on a large cohort analysis. 
 For advanced users (optional major cutoffs):
 	--user: by default, this is turned off. If this option is set, then a user-specific cutoff will be used;
 	--nclip: minimum number of clipped reads;
 	--cr: minimum number of clipped reads whose mates map to repetitive regions;
 	--nd: minimum number of discordant pairs;

 Specific parameters for long reads module:
     --rmsk: this is a reference full-length L1 annotation file from RepeatMasker only for the "ghost" L1 detection module. 
             One file named "hg38_L1_larger2K_with_all_L1HS.out" within the downloaded library could be directly used;
     --cns: this is the L1 concensus sequence needed only by the "ghost" L1 detection module. 
            One file named "LINE1.fa" within the downloaded library could be directly used;
     --rep: repeat library folder (folder contain files decompressed from the downloaded "rep_lib_annotation.tar.gz");
     --clean: clean the intermediate files;

2.2 The previous step will generate a shell script called run_xTea_pipeline.sh under WFOLDER/sample_id/L1(or other types of repeats), where WFOLDER is specified by -p option.

To run on the script: sh run_xTea_pipeline.sh or users can submit the jobs (where each line corresponds to one job) to a cluster.

Run from the Cloud
- A docker file and a cwl file are provided for running the tool on AWS/GCP/FireCloud.
Output

A gVCF file will be generated for each sample.
- For germline TE insertion calling on short reads, the orphan transduction module usually has a higher false positive rate. Users can filter out false positive events with a command such as grep -v "orphan" out.vcf > new_out.vcf to retrieve higher confidence events.

Citation and accompany scripts If you are using xTea for your project, please cite:

Chu, C., Borges-Monroy, R., Viswanadham, V.V. et al. Comprehensive identification of transposable element insertions using multiple sequencing technologies. Nat Commun 12, 3836 (2021). https://doi.org/10.1038/s41467-021-24041-8

The accompany scripts for re-produce the results in the paper could be found here: https://github.com/parklab/xTea_paper

Update log
- 06/11/23 Add gnrt_pipeline_local_chm13.py for CHM13_v2.0 reference genome .
- 06/09/22 Update the Dockerfile and cwl for germline module (hg38).
- 04/20/22 A fatal error was noticed at the genotyping step. The machine learing model was trained with features extracted with a old version of xTea, and this will introduce bias to predict the features extracted with the latest version of xTea. A new model is uploaded for non-conda version.
- 04/20/22 The scikit-learn version issue is complained by several users. To solve this issue, the new genotype classification model is trained with DF21 (https://github.com/LAMDA-NJU/Deep-Forest). Users need to install with command pip install deep-forest. For now, this is only for the non-conda version. I'll update the conda version soon.

xtea's People

Contributors

Stargazers

Watchers

Forkers

alongalor pennybebraver ealeelab altingia abhijeetrpatil andurill kijong-yi zm-git-dev zengxi-hada pavanreddy0 crankycrank mikecuoco xtmgah drtconway wangdi2016 hoonbiolab yuliamostovoy wigasper corinsexton

xtea's Issues

MELT in your article

Hi,
Could you send me vcf file of MELT from HG002. I have run melt in HG002 but I got far fewer insertion than what your paper show. I run melt-single with default parameter. That will be nice if you could tell me how you run melt.
Thanks!
xuxf

rep_lib_annotation.tar.gz not a gzip file

Hi,

I've run git clone to retrieve the files from this repository and tried to decompress rep_lib_annotation.tar.gz (I believe it says the path to the decompressed version is required). I tried to unpack it with tar -zvxf and get the following error:

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

I also notice that cloning the file (on various computers) results in a different much smaller file size than the one shown on github (134 bytes vs 300 MB).

Would it be possible to share this file through other means? Thanks.

TypeError: 'str' object does not support item assignment

Hi,

I tried to run xTea on L1, Alu, SVA and HERV respectively. But I got two types of errors.

For HERV , I got the error below. And the final vcf was not outputted.
Traceback (most recent call last):
File "/path/xTea/xTea/xtea/x_TEA_main.py", line 1117, in
xmutation=XMutation(s_working_folder)
File "/path/xTea/xTea/xtea/x_mutation.py", line 25, in init
self.working_folder[-1] += "/"
TypeError: 'str' object does not support item assignment_

The script is shown below.
PREFIX=/workdir/Sample1/HERV/
ANNOTATION=/path/xTea/rep_lib_annotation/HERV/hg38/hg38_HERV.out
REF=/path/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna
ANNOTATION1=/path/xTea/rep_lib_annotation/HERV/hg38/hg38_HERV.out
GENE=/path/GENCODE/GRCh38_v38/gencode.v38.annotation.gff3
BLACK_LIST=null
L1_COPY_WITH_FLANK=/path/xTea/rep_lib_annotation/HERV/hg38/hg38_HERV_copies_with_flank.fa
SF_FLANK=null
L1_CNS=/path/xTea/rep_lib_annotation/consensus/HERV.fa
XTEA_PATH=/path/xTea/xTea/xtea/
BAM_LIST=${PREFIX}"bam_list.txt"
BAM1=${PREFIX}"10X_phased_possorted_bam.bam"
BARCODE_BAM=${PREFIX}"10X_barcode_indexed.sorted.bam"
TMP=${PREFIX}"tmp/"
TMP_CLIP=${PREFIX}"tmp/clip/"
TMP_CNS=${PREFIX}"tmp/cns/"
TMP_TNSD=${PREFIX}"tmp/transduction/"
/path/Python-3.6.9/bin/python3.6 ${XTEA_PATH}"x_TEA_main.py" -C --dna -i ${BAM_LIST} --lc 3 --rc 3 --cr 1 -r ${L1_COPY_WITH_FLANK} -a ${ANNOTATION} --cns ${L1_CNS} --ref ${REF} -p ${TMP} -o ${PREFIX}"candidate_list_from_clip.txt" -n 1 --cp /ldfssz1/ST_MCHRI/CLINIC/USER/lijianbiao/tmp/software_test_20211020/xteagermline/HBN1/pub_clip/
/path/Python-3.6.9/bin/python3.6 ${XTEA_PATH}"x_TEA_main.py" -D --dna -i ${PREFIX}"candidate_list_from_clip.txt" --nd 5 --ref ${REF} -a ${ANNOTATION} -b ${BAM_LIST} -p ${TMP} -o ${PREFIX}"candidate_list_from_disc.txt" -n 1
/path/Python-3.6.9/bin/python3.6 ${XTEA_PATH}"x_TEA_main.py" -N --dna --cr 3 --nd 5 -b ${BAM_LIST} -p ${TMP_CNS} --fflank ${SF_FLANK} --flklen 3000 -n 1 -i ${PREFIX}"candidate_list_from_disc.txt" -r ${L1_CNS} --ref ${REF} -a ${ANNOTATION} -o ${PREFIX}"candidate_disc_filtered_cns.txt"
/path/Python-3.6.9/bin/python3.6 ${XTEA_PATH}"x_TEA_main.py" -I --dna -p ${TMP} -n 1 -i ${PREFIX}"candidate_disc_filtered_cns.txt" -r ${L1_CNS} --teilen 50 -o ${PREFIX}"internal_snp.vcf.gz"
/path/Python-3.6.9/bin/python3.6 ${XTEA_PATH}"x_TEA_main.py" --gene -a ${GENE} -i ${PREFIX}"candidate_disc_filtered_cns.txt" -n 1 -o ${PREFIX}"candidate_disc_filtered_cns_with_gene.txt"

The second error appeared in analysis of all four repeat types.
"Error happen at merge clip and disc feature step: chrEBV not exist"
"Error happen at merge clip and disc feature step: chrY not exist"
However, the final vcfs of L1, Alu and SVA were successfully generated. Are these vcfs usable?

Thank you,
Jianbiao Li

VCF output for xtea_long?

Hello and thank you for sharing this tool.
I ran xtea_long, per the instructions on the xtea_long branch, on Pac Bio data.
The final output was only txt files.
Is it possible to generate the VCF files mentioned in the article that can aid in determining the zygosity of the insertions?
Or any way to extract from the xtea_long output how many reads in the insertion location support the insertion and how many reads do not support it?
Thanks.

xtea temp file undersatand

hello,how can I got the number of tranposable element clipped reads and dicount pais reads under the cutoff. Is there some intermediate file you can tell me? Because there's nothing in the temp file about what each column represents ?thanks

failed to run demo

Hi,

Thank you for xTea.

I am interested in using the xTea for identifying TEs in my data. I installed the tool and I was testing with the script in demo folder.

I changed the paths of the corresponding files in run_gnrt_pipeline.sh and ran it using 'sh run_gnrt_pipeline.sh'. Rest is default parameters.

But it gave me the following error. Can you please help me with that.

Please let me know if you need any other info

Thank you
Jainy

sh run_gnrt_pipeline.sh
Traceback (most recent call last):
File "/home/jainy/software/xTea/xtea/gnrt_pipeline_local.py", line 1170, in
smemory, ncores, sf_sbatch_sh, "null", b_lsf, b_slurm, b_resume)
File "/home/jainy/software/xTea/xtea/gnrt_pipeline_local.py", line 756, in gnrt_running_shell
split_bam_list(m_id, sf_bams, sf_10X_bams, l_rep_type, s_wfolder, sf_case_control_bam_list)
File "/home/jainy/software/xTea/xtea/gnrt_pipeline_local.py", line 840, in split_bam_list
sid = fields[0]
IndexError: list index out of range

SVLEN=0 for some TEs

Hello.

I have run xTea on lots of samples (Illumina paired-end reads) successfully. Thank you for developing the program.

I have a question about the length of TEs. For some of them, length is 0. May I ask what it means?
Do I need to filter out TEs like these from the output?

And a short question: any filtering options after running the program? (besides grep -v "orphan" output.vcf > output_new.vcf)

Thank you in advance,

Best regards,

error: cannot open Packages database in /var/lib/rpm

Hi,

I tried using xTea, but I got below error message.

**python2 /home/users/sijaewoo/bin/xTea/gnrt_pipeline_local_v38.py -i /home/users/sijaewoo/sample_id.txt -b /home/users/sijaewoo/bamfile_list.txt -x null -p /home/users/sijaewoo/test_xTea/ -o submit_jobs.sh -l /home/users/sijaewoo/bin/xTea/rep_lib_annotation -r /home/users/sijaewoo/common_files/hg38/hg38.fa -xtea /home/users/sijaewoo/bin/xTea/ -y 15 -m 32 -n 8

error: cannot open Packages database in /var/lib/rpm
Traceback (most recent call last):
File "/home/users/sijaewoo/bin/xTea/gnrt_pipeline_local_v38.py", line 916, in
sf_ref, sf_gene, sf_black_list, sf_folder_xtea, spartition, stime, smemory, ncores, sf_sbatch_sh)
File "/home/users/sijaewoo/bin/xTea/gnrt_pipeline_local_v38.py", line 605, in gnrt_running_shell
gnrt_lib_config(l_rep_type, sf_folder_rep, sf_ref, sf_gene, sf_black_list, sf_folder_xtea, sf_sample_folder)
File "/home/users/sijaewoo/bin/xTea/gnrt_pipeline_local_v38.py", line 481, in gnrt_lib_config
if sf_folder_xtea[-1] != "/":
TypeError: 'NoneType' object has no attribute 'getitem'**

If you know how to solve this problem, please tell me know.

Thanks,

Sijae

Running on local machine (without scheduler)

Hi,

Is there a straightforward way to run your software on a local server where no scheduler is available?

By using your command for illumina data, I get one one script similar to this:
sbatch < /script/for/alu.sh
sbatch < /script/for/line.sh
etc.

I tried manually replacing "sbatch <" with "bash", and it runs for a while, but now it's stuck (hasn't proceeded in a long time and it is using 0 CPU). This is the last output from the software:

[...]
[re-select step]:Filtered out: chrY:16507137 fall in repetitive region.
[re-select step]:Filtered out: chrY:19413366 fall in repetitive region.
[re-select step]:Filtered out: chrY:19678053 fall in repetitive region.
[re-select step]:Filtered out: chrY:20877263 fall in repetitive region.
[re-select step]:Filtered out: chrY:21318615 fall in repetitive region.
[re-select step]:Filtered out: chrY:26345156 fall in repetitive region.
[re-select step]:Filtered out: chrY:26345891 fall in repetitive region.
Running command with output: cat xxx/tools/xTea/LINE/hg38/hg38_FL_L1_flanks_3k.fa xxx/workdirs/xTea/HG00512/HG00512/L1/tmp/transduction/all_with_polymerphic_flanks.fa.polymerphic_only.fa

Running command: bwa index xxx/workdirs/xTea/HG00512/HG00512/L1/tmp/transduction/all_with_polymerphic_flanks.fa

[bwa_index] Pack FASTA... 0.02 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.38 seconds elapse.
[bwa_index] Update BWT... 0.01 sec
[bwa_index] Pack forward-only FASTA... 0.01 sec
[bwa_index] Construct SA from BWT and Occ... 0.21 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index xxx/workdirs/xTea/HG00512/HG00512/L1/tmp/transduction/all_with_polymerphic_flanks.fa
[main] Real time: 0.644 sec; CPU: 0.641 sec

How to proceed?
Thanks!

xtea_long question

hello! could you tell me whether xtea_long could call reference TE or divide TE into subfamily(such as SVA_A;SVA_B)?

User request CHM13 libs

Question from #20

Hi, I tried using >5900bp as the cutoff for the full length L1. I run hg38 first to see whether I can reproduce the result in the provided hg38 rep_lib_annotation data. It turned out that the result I got was much larger than the annotation file provided. For example, the hg38_FL_L1_flanks.fa file I got is 53MB (using -e 100), while the size of hg38_FL_L1_flanks_3k.fa in the provided rep_lib_annotation file is 2MB. I attached my code here, any idea where is incorrect? The hg38 reference genome and repeatmasker output file are all from UCSC.

#########
grep "LINE1" hg38.fa.out > hg38.fa_L1.out
cat hg38.fa_L1.out | while read line
do
eval{line}|awk '{printf("var_9=%s;var_12=%s;var_13=%s;var_14=%s;",$9,$12,$13,$14)}')
if [ $var_9 == "C" ];then
i_length=$(($var_13 - $var_14))
else
i_length=$(($var_13 - $var_12))
fi
if [ $i_length -gt 5900 ];then
echo "$line"
fi
done >hg38.fa_L1_full_length.out ### this is to select out the LINE1 >5900bp

python x_TEA_main.py -P -K -p ./ -r hg38.fa -a hg38.fa_L1_full_length.out -o hg38.fa_L1_full_length_with_flank_e100.fa -e 100
#########

And is it reasonable to set cutoff for full-length Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

It would be super helpful if you could kindly add chm13 into the rep_lib_annotation data. Thank you!

Case-control mode

Hi,
I am trying to run xTea in case-control mode but I get the following error when running the first step:

"Usage: xtea [options]

xtea: error: no such option: --case_ctrl"

Any help would be greatly appreciated.
Thanks!

Difference between classified_results.txt.merged_LINE1.txt and classified_results.txt.LINE1.txt

Hi,

Thanks for developing this tool!
I am running the xTea long v0.1.0.
Under the autogenerated directory which contains all the output, there are two classified_results.txt. for each TE/retroelement, and what is the difference between the two output.

ValueError: file does not contain alignment data

Hello.

Thank you for developing xTea.

I made shell scripts using the following command:
xTea/bin/xtea -i B-ID-list-for-xtea.txt -b A-CRAM-list-for-xtea.txt -x null -p xtea_results/ -o C-shell-script-for-running-xtea.sh -l /path/to/xTea/rep_lib_annotation/ -r /path/to/Homo_sapiens_assembly38.fasta -g /path/to/xTea/gencode.v33.annotation.gff3 --xtea /path/to/data/xTea/xTea/xtea/ -f 5907 -y 15 --slurm -q nc -n 8 -m 8

My input file is CRAM.
I have an error that is happening every time I run the program.
This is the log:

Working on "clip" step!
Ave coverage is 31.544: automatic parameters (clip, disc, clip-disc) with value (3, 4 ,1)

Clip cutoff: 3, 3, 1 are used!!!
Input bam /home/data/results/GM000036/GM000036.cram is sequenced from illumina platform!
Collected clipped reads file xtea_results/GM000036/pub_clip/0/GM000036.cram.clipped.fq doesn't exist. Generate it now!
Initial minimum clip cutoff is 2
('Output info: Collect clipped parts for file ', '/home/data/results/GM000036/GM000036.cram')

And this is the error message when it crashes:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[E::main_mem] fail to open file `xtea_results/GM000036/pub_clip/0/GM000036.cram.clipped.fq'.
Traceback (most recent call last):
File "/home/ncbn-pg/users/saeidehashouri-pg/data/xTea/xTea/xtea/x_TEA_main.py", line 523, in
tem_locator.call_TEI_candidate_sites_from_multiple_alignmts(sf_annotation, sf_rep_cns, sf_rep, b_se,
File "/lustre8/home/ncbn-pg/users/saeidehashouri-pg/data/xTea/xTea/xtea/x_TEI_locator.py", line 107, in call_TEI_candidate_sites_from_multiple_alignmts
caller.call_TEI_candidate_sites_from_clip_reads_v2(sf_annotation, sf_rep_cns, sf_ref, b_se,
File "/lustre8/home/ncbn-pg/users/saeidehashouri-pg/data/xTea/xTea/xtea/x_TEI_locator.py", line 633, in call_TEI_candidate_sites_from_clip_reads_v2
bwa_align.two_stage_realign(sf_rep_cns, sf_ref, sf_all_clip_fq, sf_algnmt)
File "/lustre8/home/ncbn-pg/users/saeidehashouri-pg/data/xTea/xTea/xtea/bwa_align.py", line 263, in two_stage_realign
self.get_fully_mapped_algnmts(sf_sam_cns, sf_ref_cns, max_clip_len, sf_fully_sam, sf_unmap_fa, sf_polyA_fa)
File "/lustre8/home/ncbn-pg/users/saeidehashouri-pg/data/xTea/xTea/xtea/bwa_align.py", line 167, in get_fully_mapped_algnmts
bamfile = pysam.AlignmentFile(sf_sam, "r", reference_filename=sf_reference) #
File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.cinit
File "pysam/libcalignmentfile.pyx", line 956, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file does not contain alignment data

Any help?

Please let me know if more information is needed.

Best regards,

Error in SVA calling(ValueError: file does not contain alignment data)

Hi,
Thank you for making this useful pipeline! While I was running xTEA from short reads, somehow I couldn't produce VCF file for SVA although I successfully get output vcf for L1 and Alu. The first error generated was "ValueError: file does not contain alignment data" which wasn't in other TE running. Do you have any idea to solve this problem? This is the stdout:


k.nahyun@compute1-exec-136:~$ sh /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/run_xTEA_pipeline.sh
Working on "clip" step!
Ave coverage is 49.706: automatic parameters (clip, disc, clip-disc) with value (4, 7 ,1)

Clip cutoff: 4, 4, 1 are used!!!
Input bam /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/data/UDN963226-1682-8d92b331-2f92-472c-bb16-dd79a2ed7767.bam is sequenced from illumina platform!
Collected clipped reads file /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/pub_clip/0/UDN963226-1682-8d92b331-2f92-472c-bb16-dd79a2ed7767.bam.clipped.fq already exist!
('Output info: Re-align clipped parts for file ', '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/data/UDN963226-1682-8d92b331-2f92-472c-bb16-dd79a2ed7767.bam')
Running command: bwa mem -t 15 -T 9 -k 9 -o /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/tmp/UDN963226-1682-8d92b331-2f92-472c-bb16-dd79a2ed7767.bam.clipped.sam_cns.sam /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/rep_lib_annotation/consensus/SVA.fa /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/pub_clip/0/UDN963226-1682-8d92b331-2f92-472c-bb16-dd79a2ed7767.bam.clipped.fq

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 3601324 sequences (150000045 bp)...
[M::process] read 2969640 sequences (126020436 bp)...
Killed
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 422, in <module>
    wfolder_pub_clip, b_force, max_cov_cutoff, sf_out)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEI_locator.py", line 109, in call_TEI_candidate_sites_from_multiple_alignmts
    sf_new_pub, i_idx_bam, b_force, max_cov, sf_out_tmp)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEI_locator.py", line 633, in call_TEI_candidate_sites_from_clip_reads_v2
    bwa_align.two_stage_realign(sf_rep_cns, sf_ref, sf_all_clip_fq, sf_algnmt)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/bwa_align.py", line 263, in two_stage_realign
    self.get_fully_mapped_algnmts(sf_sam_cns, sf_ref_cns, max_clip_len, sf_fully_sam, sf_unmap_fa, sf_polyA_fa)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/bwa_align.py", line 167, in get_fully_mapped_algnmts
    bamfile = pysam.AlignmentFile(sf_sam, "r", reference_filename=sf_reference)  #
  File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 956, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file does not contain alignment data
Working on "disc" step!
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 439, in <module>
    m_original_sites = xfilter.load_in_candidate_list(sf_candidate_list)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_intermediate_sites.py", line 149, in load_in_candidate_list
    with open(sf_candidate_list) as fin_candidate_sites:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_list_from_clip.txt'
Working on "clip-disc-filtering" step!
Current working folder is: /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/tmp/cns/

Ave coverage is 49.982: automatic parameters (clip, disc, clip-disc) with value (4, 7 ,1)

Mean insert size is: 456.79731168186396

Standard derivation is: 176.74872197608607

Read length is: 151.0

Maximum insert size is: 987

Average coverage is: 49.982

Filter (on cns) cutoff: 4 and 7 are used!!!
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 534, in <module>
    n_clip_cutoff, n_disc_cutoff, sf_output)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_clip_disc_filter.py", line 1540, in call_MEIs_consensus
    sf_disc_fa, sf_raw_disc)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_clip_disc_filter.py", line 3324, in collect_clipped_disc_reads
    sf_disc_fa_tmp)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_clip_disc_filter.py", line 296, in collect_clipped_disc_reads_of_given_list
    with open(sf_candidate_list) as fin_list:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_list_from_disc.txt'
Current working folder is: /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/tmp/transduction/

Ave coverage is 49.870000000000005: automatic parameters (clip, disc, clip-disc) with value (4, 7 ,1)

Mean insert size is: 456.0272645946843

Standard derivation is: 176.628666089216

Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 594, in <module>
    i_rep_type, b_tumor, sf_tmp_slct)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_transduction.py", line 54, in re_slct_with_clip_raw_disc_sites
    m_te_candidates=xintmdt.load_in_candidate_list_str_version(sf_candidates) #TE candidates
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_intermediate_sites.py", line 184, in load_in_candidate_list_str_version
    with open(sf_candidate_list) as fin_candidate_sites:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt'
Current working folder is: /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/tmp/transduction/

Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 897, in <module>
    sf_new_out, b_tumor)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_post_filter.py", line 416, in run_post_filtering
    l_old_rcd=xtea_parser.load_in_xTEA_rslt(sf_xtea_rslt)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_post_filter.py", line 495, in load_in_xTEA_rslt
    with open(sf_rslt) as fin_in:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns2.txt'
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 897, in <module>
    sf_new_out, b_tumor)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_post_filter.py", line 416, in run_post_filtering
    l_old_rcd=xtea_parser.load_in_xTEA_rslt(sf_xtea_rslt)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_post_filter.py", line 495, in load_in_xTEA_rslt
    with open(sf_rslt) as fin_in:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns2.txt.high_confident'
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 1033, in <module>
    gff.annotate_results(sf_input, sf_output)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_gene_annotation.py", line 238, in annotate_results
    with open(sf_ori_rslt) as fin_rslt, open(sf_out, "w") as fout_rslt:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering.txt'
/usr/local/lib/python3.6/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator LabelEncoder from version 1.0.1 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 932, in <module>
    gc.predict_for_site(sf_model, sf_xTEA, sf_new)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_genotype_classify.py", line 146, in predict_for_site
    with open(sf_xTEA) as fin_xTEA, open(sf_new, "w") as fout_new:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene.txt'
Running command: sort -k1,1V -k2,2n -o /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt.sorted /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt

sort: cannot read: /storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt: No such file or directory
Traceback (most recent call last):
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_TEA_main.py", line 964, in <module>
    gvcf.cvt_raw_rslt_to_gvcf(s_sample_id, sf_bam, sf_raw_rslt, i_rep_type, sf_ref, sf_vcf)
  File "/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/xtea/x_gvcf.py", line 199, in cvt_raw_rslt_to_gvcf
    with open(sf_raw_rslt_sorted) as fin_rslt:
FileNotFoundError: [Errno 2] No such file or directory: '/storage1/fs1/jin810/Active/Nahyun/THESIS_PROJECT/TE/xTEA/JOBS/20220531_CP_trio/UDN963226P/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt.sorted'
Usage: x_TEA_main.py [options]

x_TEA_main.py: error: no such option: --bamsnap
Usage: x_TEA_main.py [options]

x_TEA_main.py: error: no such option: --bamsnap

how to merge gVCF

how to merge the output gVCF into a single VCF;

does gatk genotype is ok for this procedure.

output understand

hello,everyone,after run xTea,I got a lot of file,what is the meaning of each file?

Does xTea only make "non-reference" calls?

Hi, thanks for writing this interesting pipeline!

Just a quick question -- does xTea only produce non-reference calls, or does it call everything? Does xTea filter these out? I'm asking because I see the reference annotation file (which is a bwa-index fasta file), but I'm unclear on how this is used. What I'd like to do is see if xTea is properly calling some canonical sites that I know should be in my sequenced sample.

Billy

Missing python scripts from the current package

Hi,
The following py script is missing from the directory xTea/xtea
Traceback (most recent call last):
File "/work/greenbaum/users/suns3/xTea/xtea/x_TEA_main.py", line 128, in
from x_TE_associated_sv import *

After comment out this package in the main py, another py script was found to be missing "x_BamSnap". Could you please upload those functions if they are necessary?

Thank you so much.

Best,
Siyu

error in HERV

Hi,
xTea works well for L1/Alu/SVA, but I got some errors in HERV.
First error is like below.
Traceback (most recent call last): File "/home/sijaewoo/tools/xTea/xtea/x_TEA_main.py", line 1118, in <module> xmutation.call_mutations_from_reads_algnmt(sf_sites, sf_cns, n_len_cutoff, n_jobs, sf_merged_vcf) File "/home/sijaewoo/tools/xTea/xtea/x_mutation.py", line 39, in call_mutations_from_reads_algnmt m_sites=xsites.load_in_qualified_sites_from_xTEA_output(i_len_cutoff) File "/home/sijaewoo/tools/xTea/xtea/x_sites.py", line 49, in load_in_qualified_sites_from_xTEA_output is_lth=int(fields[-1]) ValueError: invalid literal for int() with base 10: '867.0'
I change ' is_lth=int(fields[-1])' to ' is_lth=int(float(fields[-1]))'.
Second error was in "x_mutation.py".
"pysam/libcalignmentfile.pyx", line 990, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False
So, I changed the code like this.
165 bam_in = pysam.AlignmentFile(sf_algnmt, 'rb', reference_filename=sf_ref, check_sq=False)
After that, I ran the xtea HERV again. But it seems it's not work because the final file is internal_snp.vcf.gz.
If you know the solution, please let me know.
Thank you,
Sijae

hg38_L1_larger2K_with_all_L1HS.out file missing

Hi, I am running the xTea long read pipeline. While parsing the err file I found it reported an error related to the missing file hg38_L1_larger2K_with_all_L1HS.out. I could not find this file in the rep_lib annotation but seems all the output files are there. Can you please provide the missing file or how should I generate it myself?

Thanks!

How to create hg38_FL_L1_flanks_3k files

Hi,
I’m interested in replicate the files created with hg38.
How do you generate files like hg38_FL_L1_flanks_3k.fa?

Thank you.

Should I use bwa mem -M -Y options?

Hello,
I am working on some human paired-end short-read data and I have two questions about how to prepare proper bam file with bwa and one about how to compare xTea and MELT results.
Firstly, I am using MarkDuplicates(Picard) to remove duplicates and it is recommended to use bwa mem -M mark shorter split hits as secondary for facilitating the working of MarkDuplicates. I wonder if -M will disturb xTea?
Secondly, xTea realigns clipped and discordant reads to the consensus. Should bwa mem -Y use soft clipping for supplementary alignments be used to make sure xTea extracts sequences required for realignment?
What's more, I plan to compare xTea and MELT results just as xTea paper did, but I am confused about the criterion of overlapping TE insertions. Should overlapping TE insertions be defined as insertions whose breakpoint are exactly the same in both xTea and MELT results or some kind of difference is allowed?
Many thanks.

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

Hi
Sorry, Its me again, Thank you for helping!
Now I have this error now. I installed the latest version of scikit-learn v0.41. But it says "ModuleNotFoundError: No module named 'sklearn.ensemble.forest'" Do you have any recommendation of a particular version for scikit-learn?
Traceback (most recent call last):
File "/home/jainy/software/xTea/xtea/x_TEA_main.py", line 1031, in
pkl_model = gc.load_model_from_file(sf_model)
File "/home/jainy/software/xTea/xtea/x_genotype_classify.py", line 182, in load_model_from_file
pickle_model = pickle.load(file, encoding='latin1')
ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

Thanks again
Jainy

ModuleNotFoundError: No module named 'pysam'

Hi,

Installed XTea using bioconda (Anaconda3/5.3.0) on a linux system to run on an lsf cluster. I am getting a module not found error when submitting the submit_jobs.sh script to the cluster. The error suggests there is no module called pysam. I can confirm I have installed pysam as per the instructions using: "conda install pysam -y". This installed pysam version 0.17.0 with samtools, bcftools and htslib version 1.13. I have also checked that pysam exists within the environment, currently installed in "/home/user1/.conda/envs/your_env/lib/python3.7/site-packages (0.17.0).". Any suggestions would be greatly appreciated.

Full error:
Traceback (most recent call last):
File "home/user1/.conda/envs/your_env/lib/x_TEA_main.py" line 114 in
from x_TEI_locator import *
File "home/user1/.conda/envs/your_env/lib/x_TEI_locator" line 7 in
import pysam
ModuleNotFoundError: No module named 'pysam'

Any suggestions would be greatly appreciated!

hg38_mitochondrion_copies_with_flank.fa does not exist

Hi,
Thanks for making this tool!

I am running version 0.1.7 installed using conda.

I am trying to run repeat analysis for Mitochondrial using -y 16. The pipeline scripts are generated, I am able to launch the main script, the pipeline runs through part of the analysis. Eventually it runs into an issue and reports an issue with the L1_COPY_WITH_FLANK input not existing. This variable is set in the run_xTEA_pipeline.sh script.

The generator scripts here and here report it being located within the rep annotation directory, ${rep_lib_annotation}/Mitochondrion/hg38/hg38_mitochondrion_copies_with_flank.fa. There is no directory called Mitochondrion within the rep_lib_annotation tar ball. Looks like some of the other variables are also set to non existing files, ANNOTATION and ANNOTATION1.

Can you point me to where I can download these necessary input files?

Thanks,
-Alex

Regarding input bam files

Hello,
Thank you for creating xTea!
Does xTea create bam files or do I have to have bam files ready to run xTea? If I need to have bam files to run xTea, what are they aligned to? reference genome or ref genome merged with TE sequences?
Thanks!

xTea repeat library preparation

Hi,
I would like create new repeat library for xTEA using command:
xtea -P -K -p ./ -r path-of-reference-genome.fa -a path-to-rep-lib-folder/full-length-TE-type_rmsk.out -o path-output-folder/TE_copies_with_flank.fa -e 100

However, I get:
xtea: error: no such option: -P
xtea: error: no such option: -K

Could you help me?
Thanks

repeat library preparation

Hello. xTea is a wonderful tool. I also want to use it in plants.
I want to know the version you used in https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep?
I used xTea v0.1 for short and linked reads on hg38, but I didn't find -P and -K.

FileNotFoundError: [Errno 2] No such file or directory and x_TEA_main.py: error: no such option: --bamsnap

Hi,
Thanks for your great tools!
I am running version xTea=0.1.6 installed using conda and installed short reads with the latest version by "git clone https://github.com/parklab/xTea.git"
we generate the running script in case-ctrl model by command:
xtea --case_control --tumor -i sample_id/sample_id.txt -b bam_config/case_ctrl_bam_list.txt -x null -p working/ -o submit_jobs.sh -l rep_lib_annotation/ -r /ada/reference/Homo_sapiens_assembly38.fasta -g gff3/gencode.v33.annotation.gff3 --xtea /ada/tools/xTea/xtea -y 1 -f 10 --slurm -t 0-12:00 -q intel-e5 -n 4 -m 2

after generating run script successfully, we run:
sh submit_jobs.sh

I encountered the following problems, maily in two aspects: file or directory not found and no --bamsnap option:

Filter (on cns) cutoff: 1 and 3 are used!!!
Traceback (most recent call last):
File "/ada/tools/xTea/xtea/x_TEA_main.py", line 638, in
n_clip_cutoff, n_disc_cutoff, sf_output)
File "/ada/tools/xTea/xtea/x_clip_disc_filter.py", line 1540, in call_MEIs_consensus
sf_disc_fa, sf_raw_disc)
File "/ada/tools/xTea/xtea/x_clip_disc_filter.py", line 3324, in collect_clipped_disc_reads
sf_disc_fa_tmp)
File "/ada/tools/xTea/xtea/x_clip_disc_filter.py", line 296, in collect_clipped_disc_reads_of_given_list
with open(sf_candidate_list) as fin_list:
FileNotFoundError: [Errno 2] No such file or directory: 'working/166/L1/candidate_list_barcode.txt'
/ada/tools/miniconda3/envs/mutpipe/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: De
precationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Ave coverage is 6.930000000000001: automatic parameters (clip, disc, clip-disc) with value (1, 1 ,0)
clip,disc,polyA-cutoff is (1, 1, 0)
Traceback (most recent call last):
File "/ada/tools/xTea/xtea/x_TEA_main.py", line 888, in
n_polyA_cutoff, sf_rep_cns, sf_flank, i_flk_len, bin_size, sf_output, b_tumor)
File "/ada/tools/xTea/xtea/x_somatic_calling.py", line 93, in call_somatic_TE_insertion
xclip_disc.sprt_TEI_to_td_orphan_non_td(sf_case_candidates, sf_non_td, sf_td, sf_td_sibling, sf_orphan)
File "/ada/tools/xTea/xtea/x_clip_disc_filter.py", line 823, in sprt_TEI_to_td_orphan_non_td
with open(sf_cns) as fin_cns:
FileNotFoundError: [Errno 2] No such file or directory: 'working/166/L1/candidate_disc_filtered_cns_post_filtering.txt'
/ada/tools/miniconda3/envs/mutpipe/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: De
precationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Traceback (most recent call last):
File "/ada/tools/xTea/xtea/x_TEA_main.py", line 895, in
ccm.parse_high_confident_somatic(sf_candidate_list, sf_raw_somatic, sf_output)
File "/ada/tools/xTea/xtea/x_somatic_calling.py", line 502, in parse_high_confident_somatic
with open(sf_raw_somatic) as fin_somatic:
FileNotFoundError: [Errno 2] No such file or directory: 'working/166/L1/candidate_disc_filtered_cns_post_filtering_somatic.txt'
/ada/tools/miniconda3/envs/mutpipe/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: De
precationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Traceback (most recent call last):
File "/ada/tools/xTea/xtea/x_TEA_main.py", line 1089, in
sf_sites=x_igv.gnrt_sites_single_sample(sf_sites, sf_bam_list)
File "/ada/tools/xTea/xtea/x_igv.py", line 52, in gnrt_sites_single_sample
with open(sf_sites) as fin_sites, open(sf_new_sites,"w") as fout_new:
FileNotFoundError: [Errno 2] No such file or directory: 'working/166/L1/candidate_disc_filtered_cns2.txt'
/ada/tools/miniconda3/envs/mutpipe/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: De
precationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Traceback (most recent call last):
File "/ada/tools/xTea/xtea/x_TEA_main.py", line 1089, in
sf_sites=x_igv.gnrt_sites_single_sample(sf_sites, sf_bam_list)
File "/ada/tools/xTea/xtea/x_igv.py", line 52, in gnrt_sites_single_sample
with open(sf_sites) as fin_sites, open(sf_new_sites,"w") as fout_new:
FileNotFoundError: [Errno 2] No such file or directory: 'working/166/L1/candidate_disc_filtered_cns_high_confident_post_filtering_somatic.txt'
/ada/tools/miniconda3/envs/mutpipe/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: De
precationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Usage: x_TEA_main.py [options]

x_TEA_main.py: error: no such option: --bamsnap
/ada/tools/miniconda3/envs/mutpipe/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: De
precationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Usage: x_TEA_main.py [options]
x_TEA_main.py: error: no such option: --bamsnap

Can you point me where the problem is? Thank you!

insertion sequences

Hi.

Thank you for developing xTea.
I ran xTea on all my samples, successfully.
Now, I have a question: Seems that xTea does not report the insertions sequences. Right?
If so, is there any way to extract the sequence of each ME?

Thank you in advance,
Best,

No output for HERV

When running xtea with a case/control:

$xteaPath/bin/xtea --case_control --tumor -i $output/sample_id.txt -b $output/case_ctrl_bam_list.txt -x null -p $output -o /hpc/pmc_gen/rvanamerongen/scripts/xtea_run_tumor_all.sh -l $xteaPath/data/rep_lib_annotation -r $hg38Path -g $xteaPath/data/known_genes_hg38.gff3 --xtea $xteaPath/xtea -f 5907 -y 15 --slurm -t 0-48:00 -n 16 -m 200

I am not getting any output for HERV and it shows the following errors:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/pmc_research/rvanamerongen/.conda/envs/xtea/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/pmc_research/rvanamerongen/.conda/envs/xtea/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_mutation.py", line 16, in unwrap_self_algn_read
    return XMutation.align_reads_to_cns_one_site(*arg, **kwarg)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_mutation.py", line 104, in align_reads_to_cns_one_site
    self.select_qualified_alignments(ssam_hap1_tmp, sf_cns, ssam_hap1_slct)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_mutation.py", line 165, in select_qualified_alignments
    bam_in = pysam.AlignmentFile(sf_algnmt, "rb", reference_filename=sf_ref)
  File "pysam/libcalignmentfile.pyx", line 741, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 990, in pysam.libcalignmentfile.AlignmentFile._open
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 1118, in <module>
    xmutation.call_mutations_from_reads_algnmt(sf_sites, sf_cns, n_len_cutoff, n_jobs, sf_merged_vcf)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_mutation.py", line 46, in call_mutations_from_reads_algnmt
    self.align_reads_to_cns_in_parallel(l_records, n_jobs)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_mutation.py", line 53, in align_reads_to_cns_in_parallel
    pool.map(unwrap_self_algn_read, list(zip([self] * len(l_records), l_records)), 1)
  File "/home/pmc_research/rvanamerongen/.conda/envs/xtea/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/pmc_research/rvanamerongen/.conda/envs/xtea/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False
Ave coverage is 42.336: automatic parameters (clip, disc, clip-disc) with value (3, 5 ,1)

clip,disc,polyA-cutoff is (3, 5, 1)
Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 888, in <module>
    n_polyA_cutoff, sf_rep_cns, sf_flank, i_flk_len, bin_size, sf_output, b_tumor)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_somatic_calling.py", line 93, in call_somatic_TE_insertion
    xclip_disc.sprt_TEI_to_td_orphan_non_td(sf_case_candidates, sf_non_td, sf_td, sf_td_sibling, sf_orphan)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_clip_disc_filter.py", line 823, in sprt_TEI_to_td_orphan_non_td
    with open(sf_cns) as fin_cns:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns_post_filtering.txt'
Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 895, in <module>
    ccm.parse_high_confident_somatic(sf_candidate_list, sf_raw_somatic, sf_output)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_somatic_calling.py", line 502, in parse_high_confident_somatic
    with open(sf_raw_somatic) as fin_somatic:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns_post_filtering_somatic.txt'
Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 1089, in <module>
    sf_sites=x_igv.gnrt_sites_single_sample(sf_sites, sf_bam_list)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_igv.py", line 52, in gnrt_sites_single_sample
    with open(sf_sites) as fin_sites, open(sf_new_sites,"w") as fout_new:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns2.txt'
Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 1089, in <module>
    sf_sites=x_igv.gnrt_sites_single_sample(sf_sites, sf_bam_list)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_igv.py", line 52, in gnrt_sites_single_sample
    with open(sf_sites) as fin_sites, open(sf_new_sites,"w") as fout_new:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns_high_confident_post_filtering_somatic.txt'
Usage: x_TEA_main.py [options]

x_TEA_main.py: error: no such option: --bamsnap
Usage: x_TEA_main.py [options]

x_TEA_main.py: error: no such option: --bamsnap
Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 1130, in <module>
    gff.annotate_results(sf_input, sf_output)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_gene_annotation.py", line 238, in annotate_results
    with open(sf_ori_rslt) as fin_rslt, open(sf_out, "w") as fout_rslt:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns_high_confident_post_filtering_somatic.txt'
/home/pmc_research/rvanamerongen/.conda/envs/xtea/lib/python3.6/site-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.18.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/home/pmc_research/rvanamerongen/.conda/envs/xtea/lib/python3.6/site-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.18.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 1033, in <module>
    gc.predict_for_site(pkl_model, sf_xTEA, sf_new)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_genotype_classify.py", line 149, in predict_for_site
    site_features = self.prepare_arff_from_xTEA_output(sf_xTEA, sf_arff)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_genotype_classify.py", line 130, in prepare_arff_from_xTEA_output
    l_features = self.load_in_feature_from_xTEA_output(sf_xTEA, b_train)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_genotype_classify.py", line 192, in load_in_feature_from_xTEA_output
    with open(sf_xtea) as fin_xtea:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns.txt.high_confident.post_filtering_somatic_with_gene.txt'
sort: cannot read: /hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns.txt.high_confident.post_filtering_somatic_with_gene_gntp.txt: No such file or directory
Running command: sort -k1,1V -k2,2n -o /hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns.txt.high_confident.post_filtering_somatic_with_gene_gntp.txt.sorted /hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns.txt.high_confident.post_filtering_somatic_with_gene_gntp.txt

Traceback (most recent call last):
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_TEA_main.py", line 1065, in <module>
    gvcf.cvt_raw_rslt_to_gvcf(s_sample_id, sf_bam, sf_raw_rslt, i_rep_type, sf_ref, sf_vcf)
  File "/hpc/pmc_gen/rvanamerongen/software/callers/xtea/xtea/x_gvcf.py", line 199, in cvt_raw_rslt_to_gvcf
    with open(sf_raw_rslt_sorted) as fin_rslt:
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/pmc_gen/rvanamerongen/output/xtea/tumor/PMCID205AAA/HERV/candidate_disc_filtered_cns.txt.high_confident.post_filtering_somatic_with_gene_gntp.txt.sorted'

The x_TEA_main.py: error: no such option: --bamsnap, however, is also present in the output for the other transposon types but they did have a vcf file.

issue(ModuleNotFoundError Traceback (most recent call last) /tmp/ipykernel_2662/967368989.py in <module> ----> 1 from sklearn.ensemble import RandomForestClassifier ModuleNotFoundError: No module named 'sklearn')

ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_3995/1439741196.py in
----> 1 import sklearn

ModuleNotFoundError: No module named 'sklearn'

.when i check sklearn are installed or not .. by using commont !pip list.
there i cant found it and itry to import other model all of other are prepectly work except scikit learn

i try many time but i can't understand about any more ..plz help me ..i try install scikit-learn directly by terminal and also install in conda project env. but not solved this issue.
i used
pip install -U pip
pip install scikit-learn
conda install scikit-learn
after this... error are same as before

plz help if anyone who know how to fix this issue..plz help

File not found

Traceback (most recent call last):
File "/root/xTea/xtea/x_TEA_main.py", line 1031, in
pkl_model = gc.load_model_from_file(sf_model)
File "/root/xTea/xtea/x_genotype_classify.py", line 182, in load_model_from_file
pickle_model = pickle.load(file, encoding='latin1')
ModuleNotFoundError: No module named 'sklearn.ensemble.forest'
Running command: sort -k1,1V -k2,2n -o ./xtea/HG002/Alu/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt.sorted ./xtea/HG002/Alu/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt

sort: cannot read: ./xtea/HG002/Alu/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt: No such file or directory
Traceback (most recent call last):
File "/root/xTea/xtea/x_TEA_main.py", line 1065, in
gvcf.cvt_raw_rslt_to_gvcf(s_sample_id, sf_bam, sf_raw_rslt, i_rep_type, sf_ref, sf_vcf)
File "/root/xTea/xtea/x_gvcf.py", line 199, in cvt_raw_rslt_to_gvcf
with open(sf_raw_rslt_sorted) as fin_rslt:
FileNotFoundError: [Errno 2] No such file or directory: './xtea/HG002/Alu/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt.sorted'

xTea on plant species

Hi,
We wanted to try xTea on a plant species. Is that possible or does it require a human reference?

Agnieszka

error: no such option: -P -K

i prepare te ref file from https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep

xtea -P -K -p ./ -r path-of-reference-genome.fa -a path-to-rep-lib-folder/full-length-TE-type_rmsk.out -o path-output-folder/TE_copies_with_flank.fa -e 100

but i get error, there is no such opthin: -K -P

Can't install on Mac

Hi, trying to install via miniconda3 on Mac with OS Mojave. I though any python greater than 3.6 would do, but it seems not to allow my python (3.8)- or maybe there are other issues which i don't fully understand. I guess i can try to go to 3.6

conda install numpy scikit-learn=0.18.1 pandas -y
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: |
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

scikit-learn=0.18.1 -> python[version='2.7.|3.5.|3.6.*']

Your python: python=3.8

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package libgfortran4 conflicts for:
scikit-learn=0.18.1 -> scipy -> libgfortran4[version='>=7.5.0']
numpy -> libblas[version='>=3.8.0,<4.0a0'] -> libgfortran4[version='>=7.5.0']

Package numpy conflicts for:
pandas -> numpy[version='1.10.|1.11.|1.12.|1.13.|>=1.11|>=1.11.|>=1.12.1,<2.0a0|>=1.14.6,<2.0a0|>=1.15.4,<2.0a0|>=1.16.5,<2.0a0|>=1.16.6,<2.0a0|>=1.17.5,<2.0a0|>=1.19.5,<2.0a0|>=1.19.4,<2.0a0|>=1.18.5,<2.0a0|>=1.19.2,<2.0a0|>=1.18.4,<2.0a0|>=1.18.1,<2.0a0|>=1.9.3,<2.0a0|>=1.9.|>=1.9|>=1.8|>=1.7|>=1.13.3,<2.0a0|>=1.11.3,<2.0a0|>=1.9.3,<1.10.0a0']
scikit-learn=0.18.1 -> scipy -> numpy[version='1.10.|1.13.|>=1.11|>=1.11.3,<2.0a0|>=1.14.6,<2.0a0|>=1.16.5,<2.0a0|>=1.16.6,<2.0a0|>=1.17.5,<2.0a0|>=1.19.5,<2.0a0|>=1.19.4,<2.0a0|>=1.18.5,<2.0a0|>=1.19.2,<2.0a0|>=1.18.1,<2.0a0|>=1.9.3,<2.0a0|>=1.9|>=1.15.1,<2.0a0']
scikit-learn=0.18.1 -> numpy[version='1.11.|1.12.']

Package libcxxabi conflicts for:
python=3.8 -> libcxx[version='>=4.0.1'] -> libcxxabi[version='4.0.1|4.0.1|8.0.0|8.0.0|8.0.0|8.0.0|8.0.1',build='hebd6815_0|hcfea43d_1|1|3|0|4|2']
pandas -> libcxx[version='>=4.0.1'] -> libcxxabi[version='4.0.1|4.0.1|8.0.0|8.0.0|8.0.0|8.0.0|8.0.1',build='hebd6815_0|hcfea43d_1|1|3|0|4|2']
numpy -> libcxx[version='>=4.0.1'] -> libcxxabi[version='4.0.1|4.0.1|8.0.0|8.0.0|8.0.0|8.0.0|8.0.1',build='hebd6815_0|hcfea43d_1|1|3|0|4|2']

Package mkl_fft conflicts for:
numpy -> mkl_fft[version='>=1.0.14,<2.0a0|>=1.0.4|>=1.0.6,<2.0a0|>=1.2.1,<2.0a0']
pandas -> numpy[version='>=1.19.5,<2.0a0'] -> mkl_fft[version='>=1.0.14,<2.0a0|>=1.0.6,<2.0a0|>=1.2.1,<2.0a0|>=1.0.4']

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Hi,
Thanks for making this wonderful tool!
I am running version 0.1.6 installed using conda.
we run the case-ctrl model using the following comand:
xtea --case_control --tumor -i sample_id/sample_id.txt -b bam_config/case_ctrl_bam_list.txt -p working/ -o submit_jobs.sh -l rep_lib_annotation/ -r reference/Homo_sapiens_assembly38.fasta -g gff3/gencode.v33.annotation.gff3 --xtea tools/xTea/xtea -y 7 -f 10 --slurm -t 0-12:00 -q indel-debug -n 4 -m 2
but we got the error bellow:
Traceback (most recent call last):
File tools/miniconda3/envs/mutpipe/bin/xtea", line 1114, in
if os.path.isfile(sf_bams_10X) == False:
File "tools/miniconda3/envs/mutpipe/lib/python3.6/genericpath.py", line 30, in isfile
st = os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
The case_ctrl_bam_list.txt file is sth like this:
sample_166 tumor/Panel19-LZ-166T/tumor_deduped.sorted.bam normal/Panel152-LZ-166B/tumor_deduped.sorted.bam

Can you help me out?
Thanks
Ada

Broken relative symbolic link

I have been trying to get xTea running on a HPC, but I am encountering several errors:
First, the program tries to access a file with the clipped reads but cannot find it and therefore creates a symbolic link:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[E::main_mem] fail to open file `output/xtea/PMLBM000AAF_PMCRZ500LPO_WGS.subset.chr1/pub_clip/0/PMLBM000AAF_PMCRZ500LPO_WGS.subset.chr1.bam.clipped.fq'.
Running command: ln -s output/xtea/PMLBM000AAF_PMCRZ500LPO_WGS.subset.chr1/HERV/tmp/clip/0/PMLBM000AAF_PMCRZ500LPO_WGS.subset.chr1.bam.clipped.fq output/xtea/PMLBM000AAF_PMCRZ500LPO_WGS.subset.chr1/pub_clip/0/PMLBM000AAF_PMCRZ500LPO_WGS.subset.chr1.bam.clipped.fq

However, as I had provided a relative path to the '-p' option, it also tries to create a symbolic link with relative path, which is sadly broken.
As a result, bwa cannot access the reads via the symbolic link and produces the following error:
ValueError: file has no sequences defined (mode='r') - is it SAM/BAM format? Consider opening with check_sq=False.

If you think something else is at play, I would be happy to provide any additional info.

pseudogenes insertion from pacbio bam

Hi,
I am facing a problem in identifying "pseudogenes" insertion from pacbio reads; I am using the following command:

python3 xTea/bin/xtea_long -i sample_id -b bam -r GRCh38_latest_genomic.primary.fa -g GRCh38_latest_genomic.gff --xtea=tools/xTea/xtea_long -o exons -y 64 --rep=tools/xTea/

sh run_xTEA_pipeline.sh

And I am getting the following error:

Running command: minimap2 -k11 -w5 --sr --frag=yes -A2 -B4 -O4,8 -E2,1 -r150 -p.5 -N5 -n1 -m20 -s30 -g200 -2K50m --MD --heap-sort=yes --secondary=no --cs -a -t 10 tools/xTea/consensus_mask_lrd/Psudogene.fa ./all_ins_seqs.fa.after_polyA_round.fa | samtools view -hSb - | samtools sort -o ./tmp/classification/Psudogene_cns.bam -

Running command: samtools index ./tmp/classification/Psudogene_cns.bam

Traceback (most recent call last):
File "tools/xTea/xtea_long/l_main.py", line 311, in
lrc.classify_ins_seqs(sf_rep_ins, sf_ref, flk_lenth, sf_rslt)
File "tools/xTea/xtea_long/l_rep_classification.py", line 175, in classify_ins_seqs
self.get_unmasked_seqs(sf_rep_ins_tmp, sf_tmp_out, sf_new_tmp)
File "ools/xTea/xtea_long/l_rep_classification.py", line 288, in get_unmasked_seqs
with open(sf_slcted) as fin_slcted:
FileNotFoundError: [Errno 2] No such file or directory: ''

Feedback is appreciated.
Thanks.

ModuleNotFoundError: No module named 'sklearn.ensemble.forest'

Hi Simon,

I followed up my question here in case the closed issue won't send you a notification. Sorry if I send the twice.

I downloaded the latest version xTea 1.1 recently but got a similar error message as Jainy's (issue #5) as below:

Traceback (most recent call last):
File "/n/data1/bch/genetics/lee/penny/xTea_1_1/xtea/x_TEA_main.py", line 1031, in
pkl_model = gc.load_model_from_file(sf_model)
File "/n/data1/bch/genetics/lee/penny/xTea_1_1/xtea/x_genotype_classify.py", line 182, in load_model_from_file
pickle_model = pickle.load(file, encoding='latin1')
ModuleNotFoundError: No module named 'sklearn.ensemble.forest'
sort: cannot read: /n/data1/bch/genetics/lee/penny/xTea_1000_genome/working_dir/HG00740/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt: No such file or directory
Running command: sort -k1,1V -k2,2n -o /n/data1/bch/genetics/lee/penny/xTea_1000_genome/working_dir/HG00740/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt.sorted /n/data1/bch/genetics/lee/penny/xTea_1000_genome/working_dir/HG00740/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt

Traceback (most recent call last):
File "/n/data1/bch/genetics/lee/penny/xTea_1_1/xtea/x_TEA_main.py", line 1065, in
gvcf.cvt_raw_rslt_to_gvcf(s_sample_id, sf_bam, sf_raw_rslt, i_rep_type, sf_ref, sf_vcf)
File "/n/data1/bch/genetics/lee/penny/xTea_1_1/xtea/x_gvcf.py", line 199, in cvt_raw_rslt_to_gvcf
with open(sf_raw_rslt_sorted) as fin_rslt:
FileNotFoundError: [Errno 2] No such file or directory: '/n/data1/bch/genetics/lee/penny/xTea_1000_genome/working_dir/HG00740/SVA/candidate_disc_filtered_cns.txt.high_confident.post_filtering_with_gene_gntp.txt.sorted'

I am not sure whether the following errors are causing by "ModuleNotFoundError: No module named 'sklearn.ensemble.forest'". Also, the scikit-learn version that I installed is 0.21.3 and I am using Python 3.7.4. Would you help take a look at that?

Thank you so much!

Best,
Penny

Insertion detected by MELT in HG002

For SLURM your default email address should change

Running with SLURM, looks like email is still set:

#SBATCH [email protected]

Would be nice to have -e option for email.

Thanks,
Phil

xTea for long reads cannot find some temporary files

Hi,

I'm trying to run xTea on ONT long reads for a sample with 10x coverage and this is my batch file:

#!/bin/bash

#SBATCH --account=my_account
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH -t 1-0:0:0
#SBATCH --mem=32G
#SBATCH -p my_partition
#SBATCH -o 104_hg00733_%j.out
PREFIX=/my/prefix/
############
############
REF=/address/of/fasta/file/hg38.fasta
XTEA_PATH=/address/to/xtea_long/
BAM_LIST=${PREFIX}"bam_list.txt"
TMP=${PREFIX}"tmp/"
REP_LIB=/address/to/rep/libarary/xTea/data/
SVA_REF_COPY=null
############
############
python ${XTEA_PATH}"l_main.py" -C -b ${BAM_LIST} -r ${REF} -p ${TMP} -o ${PREFIX}"candidate_list_from_clip.txt"  -n 20 -w 75  
python ${XTEA_PATH}"l_main.py" -A -b ${BAM_LIST} -r ${REF} -p ${TMP} -i ${PREFIX}"candidate_list_from_clip.txt" -o ${PREFIX}"all_ins_seqs.fa" --rep ${REP_LIB} -n 20 
python ${XTEA_PATH}"l_main.py" -N -b ${BAM_LIST} -r ${REF} -p ${TMP}"ghost" -o ${PREFIX}"ghost_reads.fa" --rmsk /somthing/LINE/hg38/hg38_L1_larger_500_with_all_L1HS.out --cns /something/consensus/LINE1.fa --min 4000 -n 20
python ${XTEA_PATH}"l_main.py" -Y -i ${PREFIX}"all_ins_seqs.fa" -r ${REF} -p ${TMP}"classification" --rep ${REP_LIB} -y 15 -o ${PREFIX}"classified_results.txt" -n 20
python ${XTEA_PATH}"l_main.py" --clean -b ${BAM_LIST} -r ${REF} -p ${TMP} -i ${PREFIX}"candidate_list_from_clip.txt"  -n 20

It ran for about 6.5 hours but after that it stoped and gave me this error message:

Traceback (most recent call last):
  File ".../xTea/xtea_long/l_main.py", line 311, in <module>
    lrc.classify_ins_seqs(sf_rep_ins, sf_ref, flk_lenth, sf_rslt)
  File ".../xTea/xtea_long/l_rep_classification.py", line 139, in classify_ins_seqs
    polyA_masker.parse_polyA_from_algnmt(sf_algnmt, sf_ref, sf_tmp_out)
  File ".../xTea/xtea_long/l_polyA_masker.py", line 15, in parse_polyA_from_algnmt
    samfile = pysam.AlignmentFile(sf_algnmt, s_open_fmt)  # read in the sam file
  File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 941, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file `.../104_hg00733/tmp/classification/polyA_cns.bam`: No such file or directory

before this message there are multiple messages also with "Failed to open file" message for these files:

.../104_hg00733/tmp/classification/all_tei_seq_2_ref.bam
.../104_hg00733/tmp/classification/polyA_cns.bam

These seems to be temporary files.
Also I should mention that I didn't include option -g for gff3 file because it was not included in the example run for long reads, although it is listed as required options.

If you can please help me figure out the problem,
Thanks,
Milad

True TE-insertion of HG002

Hi,
Could you share nonreference TE-insertion set !

ValueError: unknown field code 'AH' in record 'SQ'

Hi,
I am facing a problem in running 'sh run_xTEA_pipeline.sh '.

I am getting the following error:

Working on "clip" step!
Traceback (most recent call last):
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_TEA_main.py", line 505, in
b_force, b_tumor, f_purity)
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_TEA_main.py", line 364, in automatic_gnrt_parameters
rcd=x_basic_info.get_cov_is_rlth(sf_bam_list, sf_ref, search_win, b_force)
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_basic_info.py", line 47, in get_cov_is_rlth
m_info=self.collect_dump_basic_info_samples(sf_bam_list, sf_ref, search_win)
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_basic_info.py", line 60, in collect_dump_basic_info_samples
m_sample_info=self.collect_basic_info_samples(sf_bam_list, sf_ref, search_win)
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_basic_info.py", line 117, in collect_basic_info_samples
l_sites=self._random_slct_site(sf_bam, sf_ref, n_sites)
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_basic_info.py", line 428, in _random_slct_site
m_chrm_info=bam_info.get_all_chrom_name_length()
File "/home2/wangyx/miniconda3/envs/xtea/lib/x_alignments.py", line 56, in get_all_chrom_name_length
header = bamfile.header
File "pysam/calignmentfile.pyx", line 1258, in pysam.calignmentfile.AlignmentFile.header.get (pysam/calignmentfile.c:15426)
ValueError: unknown field code 'AH' in record 'SQ'

The input file 'sample.Mdup.recal.bam' is processed by gatk. Can you help me? Thank you so much!

te library format

i want runin te for non-human wgs
how should i prepare the library for xTea
i mean how to prepare the format of library than xTea need, ps, repeatmasker output format -> xTea library format; edta format -> xTea format

annotaion file

Dear author,

I am trying to run xTEA locally, there is an error showing
IOError: [Errno 2] No such file or directory: '/scratch/yangyang0110/xTea/rep_lib_annotation/LINE/hg19/hg19_L1.fa.out'

I can't find the hg19_L1.fa.out in the annotation folder neither, may I ask if the annotation folder is missing some necessary files?

Looking forward to your reply. Thank you for sharing.

Yang

FileNotFoundError

I tried to reinstall xtea or install free same result were return. Python is 2.7.1 and 'from sklearn.ensemble import RandomForestClassifier' do not show any error or warnings!

[Dockerfile]installing error:chmod: cannot access '.pyc': No such file or directory The command '/bin/sh -c chmod +x .pyc' returned a non-zero code: 1

Hi,
Very nice tools! I tried to install on my system using Dockerfile. But the process was stopped. And the terminal showed

Step 14/17 : ENV LANG=C.UTF-8
 ---> Using cache
 ---> 2472a041ff3d
Step 15/17 : COPY *.pyc *.sh ./
 ---> Using cache
 ---> 57417a2fde8f
Step 16/17 : RUN chmod +x *.pyc
 ---> Running in 168d71410252
**chmod: cannot access '*.pyc': No such file or directory
The command '/bin/sh -c chmod +x *.pyc' returned a non-zero code: 1**

I tried to fixed it but I did not find whereis the problem. Could you give me a hand?

Bests,
Nan