Giter Club home page Giter Club logo

epiteome's Introduction

INSTALL

Dependencies
    All dependencies must be executable and findable in the user's PATH.

    perl (version 5.x): Generally installed in linux and mac OS by default. Expected to be installed at /usr/bin/perl

    perl lib (Bio::SeqIO, Tools::GFF, DB::Sam, Set::IntervalTree, Statistics::Descriptive)

    samtools (version 1.3.1 or higher)

    bedtools (version v2.26.0 or higher)

    ngsutils (version 0.5.7 or higher)

    segemehl (version 0.2.0 or higher)

    gzip/bzip2 Generally installed in linux and mac OS by default.

Install
    EpiTEome is a perl program that does not need to be compiled.
    Make sure it is executable. For convenience, can be added to your PATH.
    epiTEome assumes your perl installation to be at /usr/bin/perl.

Tips
    All libraries can be installed using `perl -MCPAN -e 'install Bio::SeqIO'`
    ngsutils can be installed using `pip install ngsutils`

Test environment
    This release of epiTEome (v1) was tested on Mac OSX (10.11.6), perl
    5.18.2, samtools 1.3.1, bedtools v2.26.0, ngsutils 0.5.7, segemehl 0.2.0

CITATION

If you use epiTEome in your work, please cite one of the following:

USAGE

INDEX: Reference fasta file should be indexed in the segemehl index format.
       Prior the indexing, idxEpiTEome.pl will mask the 3’ edge of the LTR5 
       and 5’ edge of the LTR3 to avoid multi-mapping read competition within 
       a single TE. Because LTRs of a single TE are identically duplicated at
       the TE insertion time, this light masking will prevent split-reads that
       could map at the junctions TE-flanking DNA to map inside the TE.

    Usage: idxEpiTEome.pl —l [max read length] -gff <gff3> -t <target> —ref <fasta>

    <gff3>    TE annotation in gff3 format.
        
    <target>  list of TEid of interest

    <ref>     FASTA formated (.fa, .fna or fasta) genome file.

    -l        Maximum read length present in FASTQ file.


EPITEOME: Identify new TE insertion sites and quantify their methylation level from MethylC-seq datasets.

    Usage: epiTEome.pl [options] -gff <gff3> -t <target> —ref <fasta> -un <fastq>

    <gff3>    TE annotation in gff3 format.
        
    <target>  TEid list of interest

    <ref>     FASTA formated (.fa, .fna or fasta) genome file.

    <un>      FASTQ file of reads that failed to map to the reference genome (unmapped reads)

OPTIONS
  EpiTEome Specific Options:
    -chop [integer] : read end length of chopped (defaut 25,30,40).
                      Usage of several lengths will improve epiTEome sensitivity. 
    -b    [integer] : number of TEs per batch (defaut 5000).
    -w    [integer] : window size for methylation metaplot analysis (defaut 10 bp)

  Alignment Options:
    -E    [integer] : segemehl max evalue (default:5)
    -p    [integer] : number of threads used in segemehl (defaut 1).
                      All other portions of epiTEome are single-threaded.

OUTPUT
  epiTEome output 4 different files .newInsertionSite.tab, .newInsertionSite.sam, .met.meta.tab and .met.row.tab

GFF3 INPUT FILE FORMAT

GFF3 input file follow the standard GFF3 format, except column 3 and 9 that have specific tags.
TE annotated features (mandatory): 
    - Column 3 (type) should be referred to as 'te'
    - Column 9 (attributes) should have the following list of attributes: ID=teid, sF=superfamily_name, fam=family_name.
LTR annotated features (optional):
     - Column 3 (type) should be referred to as LTR5 or LTR3
     - Column 9 (attributes) should have tag Parent=teid.

OUTPUT FILE FORMAT

.newInsertionSite.tab: coordinate of non-reference TEs
    1.  chrom - name of the chromosome or scaffold
    2.  chromStart - start position of feature containing new insertion site (0 base)
    3.  chromEnd - end position of the feature containing new insertion site
    4.  name - feature name
    5.  mapping type [uniq|multi] - Feature has been identified using split-reads that uniquely map to the reference sequence (uniq)
    or map to the reference sequence multiple time (multi).  
    6.  strand
    7.  tsdStart - Start of the TSD (target-site duplication)
    8.  tsdEnd - End of the TSD
    9.  nubReads - Number of split-reads aligned to identify this feature
    10. family - TE family name
    11. teid - teid name

.newInsertionSite.sam: standard sam aligment file diplaying split-reads aligment profile. Note that all 
                       split-reads susceptible to detect a non-reference TE will be store in this file,
                       allowing user to identify false negative predictions.

.met.row.tab: methylation level at each cytosine position (used for barplot, Figure 4A)
    1. methylation context [CG|CHG|CHH]
    2. location [neo|te] - neo (at flanking DNA at newinsertion site) or te (at TE)
    3. edge [5|3|8] - 5prime (5), 3prime (3), both (8)
    4. nbCm - number of cytosines methylated
    5. nbR - number of reads mapped
    6. name - feature name
    7. teid - teid name

.met.meta.tab: process methylation level for metaplot analysis (used for metaplot, Figure 4B)
    1. methylation context [CG|CHG|CHH]
    2. location [neo|te] - neo (at flanking DNA at newinsertion site) or te (at TE)
    3. edge: [5|3|8] - 5prime (5), 3prime (3), both (8)
    4. window id
    5. methylation level (%)
    6. confidence interval (95%)

TEST

Test / demonstration data for epiTEome.
- Step 1: Indexing reference file
   $idxEpiTEome.pl -l 85 -gff tair10TEs.gff3 -t subteid.lst -fasta Chr2.fasta 

- Step 2: Run epiTEome analysis
   $epiTEome.pl -gff tair10TEs.gff3 -ref Chr2.epiTEome.masked.fasta -un unmapped.fastq -t teid.lst 

- Output: 4 different files (unmapped.newInsertionSite.tab, unmapped.newInsertionSite.sam, unmapped.met.meta.tab and unmapped.met.row.tab) will be automatically generated by epiTEome in the $CWD. To check whether epiTEome worked successfully, those files could be compared to reference output files present in the test folder (refOutput_*).

epiteome's People

Contributors

jdaron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

epiteome's Issues

Using paired-end reads

I believe epiTEome does not support paired-end reads. Do you have any suggestions how paired-end reads could be supported? It seems this could significantly boost the power for detecting insertions.

Can't locate object method "qstring" via package "Bio::DB::Bam::AlignWrapper" at epiTEome.pl line 2275."

Dear Professor:
I meet a error when I run the test program in step 2. I hope you can give me some advice and the computer operating system for me is ubuntu 16.04.
the issue is that:
"yangqing@yangqing-Lenovo-G480:~/epiTEome-master/test$ perl epiTEome.pl -gff tair10TEs.gff3 -ref Chr2.epiTEome.masked.fasta -un unmapped.fastq -t teid.lst
Possible precedence issue with control flow operator at epiTEome.pl line 482.
INFO epiTEome.pl Sun May 20 09:29:55 2018 Start program!
INFO epiTEome.pl Sun May 20 09:29:55 2018 Run Module: readGffFile!

INFO epiTEome.pl Sun May 20 09:29:58 2018 STEP 1: read ends mapping.
INFO epiTEome.pl: --> Skiping Step 1: unmapped.step1.sort.bam allready exist.
INFO epiTEome.pl Sun May 20 09:29:58 2018 Run Module: splitFastq!

INFO epiTEome.pl Sun May 20 09:29:58 2018 STEP 2: split-reads mapping.

INFO epiTEome.pl Sun May 20 09:29:58 2018 BATCH #1.
INFO epiTEome.pl Sun May 20 09:29:58 2018 Run Module: selectPairedEndReads !
AT2TE41090 284
INFO epiTEome.pl Sun May 20 09:29:58 2018 Run Module: pairedEndGrab !
/home/yangqing/miniconda2/bin/bamutils filter unmapped.step1.sort.bam unmapped.matePairedEnd.bam -whitelist unmapped.fishReads.list
Done! (0:00)
208 kept
775 failed
/home/yangqing/miniconda2/bin/samtools index unmapped.matePairedEnd.bam
INFO epiTEome.pl Sun May 20 09:29:58 2018 Run Module: checkPairedEndPosition !
/home/yangqing/miniconda2/bin/fastqutils filter -whitelist unmapped.step1.list unmapped.fastq > unmapped.step1.fastq
112 reads in whitelist
Done! (0:00)
Criteria Kept Altered Removed
FASTQReader 121 0 0
WhitelistFilter 112 0 9
INFO epiTEome.pl Sun May 20 09:29:58 2018 Run Module: splitFastq!
INFO epiTEome.pl Sun May 20 09:29:58 2018 Run Module: Segemehl mapping !
/usr/bin/segemehl.x --silent -t 1 -D 1 -E 5 -i Chr2.epiTEome.masked.fasta.ctidx -j Chr2.epiTEome.masked.fasta.gaidx -d Chr2.epiTEome.masked.fasta -q unmapped.step1.sr.fastq -o unmapped.step2.sam -F 1 2>log
/home/yangqing/miniconda2/bin/samtools view -b unmapped.step2.sam | /home/yangqing/miniconda2/bin/samtools sort - -o unmapped.step2.sort.bam
/home/yangqing/miniconda2/bin/samtools index unmapped.step2.sort.bam
INFO epiTEome.pl Sun May 20 09:30:02 2018 Run Module: selectSplitReads !
INFO epiTEome.pl Sun May 20 09:30:02 2018 Run Module: splitReadsGrab !
/home/yangqing/miniconda2/bin/bamutils filter unmapped.step2.sort.bam unmapped.mateSplitReads.bam -whitelist unmapped.fishReads.list
Done! (0:00)
759 kept
3411 failed
/home/yangqing/miniconda2/bin/samtools index unmapped.mateSplitReads.bam
INFO epiTEome.pl Sun May 20 09:30:03 2018 Run Module: groupOverlapingmNeoReads, 103 to process.
INFO epiTEome.pl Sun May 20 09:30:03 2018 Run Module: neoInsertionFinder !
INFO epiTEome.pl: Cleaning file: /bin/rm unmapped.fishReads.list unmapped.matePairedEnd.bam unmapped.matePairedEnd.bam.bai unmapped.mateSplitReads.bam unmapped.mateSplitReads.bam.bai unmapped.pe.fastq unmapped.step1.fastq unmapped.step1.list unmapped.step1.sr.fastq unmapped.step2.sort.bam unmapped.step2.sort.bam.bai
Can't locate object method "qstring" via package "Bio::DB::Bam::AlignWrapper" at epiTEome.pl line 2275."
thank you very much!

Possible precedence issue with control flow operator at epiTEome.pl line 482.

root@sc-Latitude-E6230:/usr/bin/epiTEome# idxEpiTEome.pl -l 85 -gff tair10TEs.gff3 -t teid.lst -fasta Chr2.fasta
idxEpiTEome.pl: command not found
(base) root@sc-Latitude-E6230:/usr/bin/epiTEome# perl idxEpiTEome.pl -l 85 -gff test/tair10TEs.gff3 -t test/teid.lst -fasta test/Chr2.fasta
INFO idxEpiTEome.pl: All programs have been found succesfully.
INFO idxEpiTEome.pl: Run Module: Tue Jun 4 14:32:52 2019 readGffFile !
INFO idxEpiTEome.pl: Reading GFF file ... done !
INFO idxEpiTEome.pl: Run Module: Tue Jun 4 14:32:54 2019 maskFastaIndex !
/root/anaconda3/bin/maskFastaFromBed -fi test/Chr2.fasta -bed /usr/bin/epiTEome/prepRefSeq.bed -fo test/Chr2.epiTEome.masked.fasta
/root/anaconda3/bin/segemehl.x --silent -x test/Chr2.epiTEome.masked.fasta.ctidx -y test/Chr2.epiTEome.masked.fasta.gaidx -d test/Chr2.epiTEome.masked.fasta -F 1 2> log
All files have been created succesfully !

root@sc-Latitude-E6230:/usr/bin/epiTEome# perl epiTEome.pl -gff test/tair10TEs.gff3 -ref test/Chr2.epiTEome.masked.fasta -un test/unmapped.fastq -t test/teid.lst
Possible precedence issue with control flow operator at epiTEome.pl line 482.
INFO epiTEome.pl Tue Jun 4 14:33:55 2019 Start program!
DIE epiTEome.pl: Could not find segemehl index file .ctidx

HELP: epiTEome.pl

AUTHOR: Josquin DARON, Slotkin Lab, Ohio State University
VERSION: 1.0 -- 2017-02-01

PURPOSE: Identify non-reference TE insertion sites and their methylation level.

USAGE: epiTEome.pl —l [max read length] -gff -t —ref -un

    <gff>     TE annotation should be given in gff3 format.
              For TE annotated features, column 9 should have following list of tags:
              ID (teid), sF (superfamily name), fam (family name).
              For LTR annotated features, they should be referred as LTR5 or LTR3 in column 3,
              column 9 should have tags Parent (teid).
        
    <target>  list of TEid of interest

    <fasta>   FASTA formated (.fa, .fna or fasta) genome file.

    <fastq>   FASTQ file of reads that failed to map the reference genome (unmapped reads)

OPTIONS
  EpiTEome Specific Options:
    -chop [integer] : read ends length of chopped (defaut 25,30,40).
                      Use of several length will improve epiTEome sensitivity. 
    -b    [integer] : number of TE per batch (defaut 5000).
    -w    [integer] : window size for methylation metaplot analysis.

  Alignment Options:
    -E    [integer] : segemehl max evalue (default:5)
    -p    [integer] : number of threads use in segemehl (defaut 1).
                      All other portions are single-threaded.

OUTPUT
  epiTEome output 4 different files such as .newInsertionSite.tab, .newInsertionSite.sam, .met.meta.tab and .met.row.ta

any help much appreciated.

Thank you!
Regards

CS791

Can't create .newInsertionSite.tab. at epiTEome.pl line 2209.

Hi jdaron, I have meet the problem like this:
INFO epiTEome.pl Mon Jan 22 12:50:38 2018 BATCH #675.
INFO epiTEome.pl: --> Skiping Step 2: LZH369L_unmap_ambiguous_reads.step2.sort.bam allready exist.
Use of uninitialized value $out in scalar chomp at /home/sunyb/biosoft/epiTEome-master/epiTEome.pl line 283.
Use of uninitialized value $out in numeric eq (==) at /home/sunyb/biosoft/epiTEome-master/epiTEome.pl line 284.
INFO epiTEome.pl: LZH369L_unmap_ambiguous_reads.step2.sort.bam header troncated.
INFO epiTEome.pl: Cleaning file: /usr/bin/rm LZH369L_unmap_ambiguous_reads.fishReads.list LZH369L_unmap_ambiguous_reads.matePairedEnd.bam LZH369L_unmap_ambiguous_reads.matePairedEnd.bam.bai LZH369L_unmap_ambiguous_reads.mateSplitReads.bam LZH369L_unmap_ambiguous_reads.mateSplitReads.bam.bai LZH369L_unmap_ambiguous_reads.step1.fastq LZH369L_unmap_ambiguous_reads.step1.list LZH369L_unmap_ambiguous_reads.step1.sr.fastq LZH369L_unmap_ambiguous_reads.step2.sort.bam LZH369L_unmap_ambiguous_reads.step2.sort.bam.bai
Use of uninitialized value $dir in concatenation (.) or string at /home/sunyb/biosoft/epiTEome-master/epiTEome.pl line 2208.
Can't create /LZH369L_unmap_ambiguous_reads.newInsertionSite.tab. at /home/sunyb/biosoft/epiTEome-master/epiTEome.pl line 2209.
I don't know what is the reason lead to this problem, and how to solve this problem.
This result is very important for my paper, I have wait for this result for 2-weeks....

Not able to run it on Linux/Ubuntu

Hello,
is there a installation pipeline also for Ubuntu users. I tried the recommended pipeline but its not working.
It would be nice if you could help me.

How do handle read1 and read2 of unmapped fq

I have now used bismark to get unmaped read1(_1.fq.gz) and read2(_2.fq.gz)。But when I use epiTEome -un can't input read1 fq and read2 fq at the same time, what do you have to solve this?
Thanks !

which samtools version should be used ?

In your installation instructions, you write:

samtools (version 1.3.1 or higher)

But you also write that

Bio::DB::Sam

is required.

Samtools 1.3.1 and higher neither install bam.h nor libbam.a, but these files are required by Bio::DB::Sam. Most people install Bio::DB::Sam only with samtools releases 0.1.x (the old versions that still come with bam.h and libbam.a).

This is very confusing.

Help needed to instal epiTEome in Ubuntu

Hi,

I am planning to run epiTEome in my PE bisulphite sequencing data. I am using Ubuntu 14.04 lts OS workstation with perl 5.18.2 in my path variable. It seems that all the lib modules are installed. Other dependencies like samtools 1.9 (in path - Using htslib 1.9); bedtools v2.28.0 9in path); ngsutils0.5.9 (in path when used venv/bin/activate command); segemehl v.0.3.4 (unable to set in path). I have downloaded epiTEome-master. When, trying to run:

icar@icar-crijaf:$ source venv/bin/activate
(venv) icar@icar-crijaf:
$ cd /home/icar/Programs/epiTEome-master
(venv) icar@icar-crijaf:/Programs/epiTEome-master$ export PATH=$PATH:/home/icar/Programs/segemehl-0.3.4
(venv) icar@icar-crijaf:
/Programs/epiTEome-master$ perl epiTEome.pl

Produced following error:
Can't locate Bio/DB/Sam.pm in @inc (you may need to install the Bio::DB::Sam module) (@inc contains: /lib/bioperl-1.2.3 /lib/perl_modules /home/icar/perl5/lib/perl5/5.18.2/x86_64-linux-gnu-thread-multi /home/icar/perl5/lib/perl5/5.18.2 /home/icar/perl5/lib/perl5/x86_64-linux-gnu-thread-multi /home/icar/perl5/lib/perl5 /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /home/icar/perl5/lib/perl5/5.18.1 /usr/local/lib/site_perl .) at epiTEome.pl line 10.
BEGIN failed--compilation aborted at epiTEome.pl line 10.

It seems there is a problem with Installation of 'Bio::DB::Sam'.
So, I did tried to install it by typing: perl -MCPAN -e 'install Bio::DB::Sam'

Reading '/home/icar/.cpan/Metadata'
Database was generated on Wed, 18 Aug 2021 00:17:02 GMT
Fetching with LWP:
http://mirror.ox.ac.uk/sites/www.cpan.org/authors/01mailrc.txt.gz
Reading '/home/icar/.cpan/sources/authors/01mailrc.txt.gz'
............................................................................DONE
Fetching with LWP:
http://mirror.ox.ac.uk/sites/www.cpan.org/modules/02packages.details.txt.gz
Reading '/home/icar/.cpan/sources/modules/02packages.details.txt.gz'
Database was generated on Thu, 19 Aug 2021 00:17:03 GMT
.............
New CPAN.pm version (v2.28) available.
[Currently running version is v2.00]
You might want to try
install CPAN
reload cpan
to both upgrade CPAN.pm and run the new version without leaving
the current session.

...............................................................DONE
Fetching with LWP:
http://mirror.ox.ac.uk/sites/www.cpan.org/modules/03modlist.data.gz
Reading '/home/icar/.cpan/sources/modules/03modlist.data.gz'
DONE
Writing /home/icar/.cpan/Metadata
Running install for module 'Bio::DB::Sam'
Running make for L/LD/LDS/Bio-SamTools-1.43.tar.gz
Checksum for /home/icar/.cpan/sources/authors/id/L/LD/LDS/Bio-SamTools-1.43.tar.gz ok

CPAN.pm: Building L/LD/LDS/Bio-SamTools-1.43.tar.gz

This module requires samtools 0.1.10 or higher (samtools.sourceforge.net).
Please enter the location of the bam.h and compiled libbam.a files:

I provided the location as "/home/icar/Programs/samtools-1.9"

The following error was produced, which I couldn't figure it out further.

Found /home/icar/Programs/samtools-1.9/bam.h and /home/icar/Programs/samtools-1.9/libbam.a.
Created MYMETA.yml and MYMETA.json
Creating new 'Build' script for 'Bio-SamTools' version '1.43'
Building Bio-SamTools
cc -I/home/icar/Programs/samtools-1.9 -Ic_bin -I/usr/lib/perl/5.18/CORE -fPIC -D_IOLIB=2 -D_FILE_OFFSET_BITS=64 -Wformat=0 -c -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -o c_bin/bam2bedgraph.o c_bin/bam2bedgraph.c
cc -I/home/icar/Programs/samtools-1.9 -Ic_bin -I/usr/lib/perl/5.18/CORE -DXS_VERSION="1.43" -DVERSION="1.43" -fPIC -D_IOLIB=2 -D_FILE_OFFSET_BITS=64 -Wformat=0 -c -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -o lib/Bio/DB/Sam.o lib/Bio/DB/Sam.c
lib/Bio/DB/Sam.xs:29:19: fatal error: khash.h: No such file or directory
#include "khash.h"
^
compilation terminated.
error building lib/Bio/DB/Sam.o from 'lib/Bio/DB/Sam.c' at /usr/share/perl/5.18/ExtUtils/CBuilder/Base.pm line 177.
LDS/Bio-SamTools-1.43.tar.gz
./Build -- NOT OK
Running Build test
Can't test without successful make
Running Build install
Make had returned bad status, install seems impossible

It seems somewhere there is a dir/file: lib/Bio/DB/Sam.xs, where #include "khash.h" need to be included. I am unable to find the above file. I am not sure if this errors are projected due to faulty dependencies or any other issue. Complete gone clueless.

Can you please help me to install a working program of epiTEome in my system? Not very expert in Bioinformatics, therefore may need a detailed advice.

Thanks in advance

~DipSaha

file: libs/merge.c, line: 553: Multiple alignments for read 100257_Chr2:8770108-8770192_30/1 with same HI tag value found. Exit forced

Dear Professor:
I meet a error when I run the test program in step 2. I hope you can give me some advice .
the command is :
perl ../../epiTEome.pl -gff ../tair10TEs.gff3 -ref ../Chr2.epiTEome.masked.fasta -un ../unmapped.fastq.bz2 -t ../te.list
and the issue is that:
[SEGEMEHL] file: libs/merge.c, line: 553: Multiple alignments for read 100257_Chr2:8770108-8770192_30/1 with same HI tag value found. Exit forced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.