Comments (9)
I know how to configure RepeatMasker to merge the RepBase library and then export LIBDIR PATH to that Library before running RepeatModeler, though I don't know if I should do anything else to configure the RepeatModeler?
RepeatModeler does not recognize a LIBDIR
parameter. You could point REPEATMASKER_DIR
to an alternative RepeatMasker installation with customized contents in Libraries
/, though.
Also, I am wondering what is the command for adding RepBase to the RepeatMasker library inside tetools-1.1? because it doesn't recongnize addRepBase.pl. should I pull out the library from the container, cp -r /opt/RepeatMasker/Libraries/ ./ and then try ./configure inside the shell?
RepeatMasker 4.1.0 did not have a script to do that step separately; it was all through configure
. RM 4.1.0 does support -libdir
/LIBDIR
, so it should be possible to install RepBase RepeatMasker Edition to Libraries/
in a non-container installation and configure it, then run RepeatMasker in the container with that directory as a custom LIBDIR
.
from tetools.
RepeatMasker 4.1.2 is released, fixing the problem with classifications in RepeatMasker 4.1.1 that probably caused most of these differences. This version of RepeatMasker is included in the latest release of the Dfam TE Tools container, 1.3
! (CHANGELOG for version 1.3)
I think nearly all the questions in this thread have been answered, so I'm closing the issue. Please re-open this or a new issue if you have further problems or questions!
from tetools.
This is most likely caused by these two issues:
- Genome sampling (likely small impact)
RepeatModeler uses a randomized sampling approach, so two runs on the same genome even from the same version of RepeatModeler will produce different output depending on the size and abundance of different elements discovered in different orders between runs. It looks like you used -LTRStruct
in one run, which will also affect this. The -srand
parameter is available if you need to reproduce prior runs, using the "seed number" reported in the RepeatModeler log output.
Even so, the overall composition of the library should not be too different, especially since this genome is small enough to be completely analyzed over the course of all 6 rounds of RepeatModeler.
- Increase in "Unclassified" sequences (likely large impact)
This is due to the following bug in RepeatMasker, which has a larger impact than I first realized in some genomes:
We recently identified a bug in RepeatMasker 4.1.1 which affects classifications from RepeatClassifier that are based on similarity to TEs. (Classifications based on homology with known protein sequences is unaffected). The bug can be rectified by running this command inside the RepeatMasker directory:
./famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib
and then re-runningRepeatClassifier
on the consensus library (re-running all ofRepeatModeler
is not necessary).
Re-running RepeatClassifier
configured for RepeatMasker 4.1.0 instead of RepeatMasker 4.1.1 should be roughly equivalent to fixing the error as well. This version of RepeatMasker was included in the dfam/tetools:1.1
version of the image.
p.s.s I have added and merged the RepBase library to the RepeatMasker_4.1.1 inside the container, but it doesn't indicate it in the output table file, I don't know if it is normal?
This looks like it is simply different/missing information being reported by the different versions of the tools, possibly a bug. Since you are using the -lib
option instead of -species
, the results should not be affected whether or not RepBase RepeatMasker Edition was installed. RepBase RepeatMasker Edition can, however, influence the classifications produced by RepeatClassifier which are based on similarity to already-known TEs.
from tetools.
the older version is run by a former colleague on another server, I only have her command lines to compare with mine, so I don't have the parameters to reproduce it.
so is it normal that the RepeatMasker-4.1.0 library (Repeatmasker.lib) is larger in size than the RepeatMasker-4.1.1?
also I tried the debugging strategy, the result is almost the same (even a little less!) so I decided to pool the tetool:1.1 image and run from scratch, but when running RepeatModeler
, I receive the error:
RepeatModeler dependency missing or incorrectly set for TRF_PRGM!
Rerun ./configure or check your command line to ensure that RepeatModeler
has access to and the correct version of this dependency.
running configure
from inside the RepeatModeler doesn't solve the issue:
Singularity> configure
The RepeatMasker configure script must be run from
inside the RepeatMasker installation directory:
/opt/RepeatMasker
Perhaps this is not the "configure" you are looking for?
how should I fix this?
from tetools.
so is it normal that the RepeatMasker-4.1.0 library (Repeatmasker.lib) is larger in size than the RepeatMasker-4.1.1?
This is mostly due the bug in question - the smaller file size is due to the missing information that RepeatClassifier needed. It is normal for the size to fluctuate between versions for other reasons, though.
Although it seems strange that you have a RepeatModeler/Libraries
directory at all. The Libraries
directory should normally be in the RepeatMasker
directory.
also I tried the debugging strategy, the result is almost the same (even a little less!)
Which debugging strategy do you mean? (My two suggestions were to fix the RepeatMasker.lib
file in the newer version of RepeatMasker or to use the older version of RepeatMasker.) If you still have them, the exact commands you ran would be helpful in case there was something wrong or missing in my explanation or a typo in one of the commands.
so I decided to pool the tetool:1.1 image and run from scratch, but when running RepeatModeler, I receive the error:
That's my fault - I forgot that the 1.1 version of the container required TRF to be installed in the host, which makes it much more complicated to use. I can pull up older versions of the scripts and instructions, but one of these other approaches should be easier.
from tetools.
sorry for not being clear. your first suggestion was to rectify the bug in the newer version of RMasker using your command. so I did this:
singularity exec --bind $PWD:$PWD tetools.sif famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib
regarding the directory name, I am told that the directories in the container are "read only" so I can't really modify them, right? so I have to create a directory in the host and bind it with the same directory inside the container, I named that directory "RepeatModeler" (in your example: /work) then using cp -r /opt/RepeatMasker/Libraries/ ./
command I pulled out the "Librraies" from "RepeatMasker" and put in my newly created "RepeatModeler" dir, I know it seems confusing now, but I didn't really think about the naming when I first did it. so all the changes (RepBase merging and later the debugging command) is done on this library. Am I correct? if I have done it right, then it didn't really work! due to the still noticeable size differences and having the same result.
then you suggested to re-run it using the "RMasker-4.1.0" which is included in the dfam/tetools:1.1
which I did and the result was an error. I have already installed RepeatMasker-4.1.0 and its dependencies, e.g. TRF, manually.
The reason why I wanted to try the container is that maybe the RepeatModeler and/or its searching engines (rmblast, Recon, RepeatScout) are of different versions and who knows, sometimes older versions do work better! for instance I really expected using -LTRStruct
would recover more LTRs in the new output than the older one!!
so I would appreciate it if you could send me the instruction of how to make 1.1 work.
from tetools.
oh and I forgot to ask, If I use different blasting tools for RepeatMasker-4.1.0 : rmblast, hmmer and abbalst, is there any syntax for RepeatMasker to merge them all together and then remove the redundancies so that I could recover more TEs?
from tetools.
so I have to create a directory in the host and bind it with the same directory inside the container, I named that directory "RepeatModeler" (in your example: /work) then using cp -r /opt/RepeatMasker/Libraries/ ./ command I pulled out the "Librraies" from "RepeatMasker" and put in my newly created "RepeatModeler" dir, I know it seems confusing now, but I didn't really think about the naming when I first did it. so all the changes (RepBase merging and later the debugging command) is done on this library. Am I correct? if I have done it right, then it didn't really work! due to the still noticeable size differences
Sorry, I answered your question about file sizes as if you had checked it before running the command to generate a corrected file. Since you did, I think it's more likely that the size differences were unrelated (different versions, whether or not you merged RepBase and what version, number of blank spaces or line breaks that happened to be in the file, etc.). To easily tell whether or not the file is impacted by that bug in 4.1.1, you can check the first line of the file. It should read >something#class/subclass
instead of simple >something
.
and having the same result.
Based on what you said so far I am not sure you actually ran with the fixed library. To do this you would need to configure RepeatModeler to use an installation of RepeatMasker that was itself configured to use the alternative Libraries/
directory. This is why I mentioned it would probably be easier to use RepeatMasker 4.1.0 at least for the classification step, since the only difference between 4.1.0 and 4.1.1 that explains the large difference in your results is this bug I mentioned.
The reason why I wanted to try the container is that maybe the RepeatModeler and/or its searching engines (rmblast, Recon, RepeatScout) are of different versions and who knows, sometimes older versions do work better! for instance I really expected using -LTRStruct would recover more LTRs in the new output than the older one!!
so I would appreciate it if you could send me the instruction of how to make 1.1 work.
Sure. You can use this version of the script to get the necessary --trf_prgm
option: https://raw.githubusercontent.com/Dfam-consortium/TETools/1.1/dfam-tetools.sh, and a command line along these lines:
dfam-tetools-1.1.sh --container dfam/tetools:1.1 --trf_prgm=/path/to/trf -- RepeatClassifier ...
oh and I forgot to ask, If I use different blasting tools for RepeatMasker-4.1.0 : rmblast, hmmer and abbalst, is there any syntax for RepeatMasker to merge them all together and then remove the redundancies so that I could recover more TEs?
@rmhubley should be able to answer this part much better than I can. HMMER will be harder to tackle since it works on profile hidden markov models, but RepBase and your custom library both contain only consensus sequences (used by RMBlast and AB-BLAST). We don't provide a script to merge them, and it's a difficult thing due to the sheer number of corner cases to consider when trying to merge output (e.g. whether to pick "longer" or "better" annotations, and how to handle differently overlapping or merged annotations)
There are a lot of steps and versions at play. Here is my understanding of what you are running and where the large differences in output might be coming from, from start to finish. Hopefully it makes a few things more clear:
- Configure RepeatMasker (including installing RepBase RepeatMasker Edition if wanted) and RepeatModeler. This is where the bug in RepeatMasker 4.1.1 is, which affects step 3 later.
- Run
RepeatModeler
to get a species-specific library. RepeatModeler 2 introduced the-LTRStruct
option, which is one big difference. The version of LTR_retriever may matter a bit, but it doesn't explain most of your discrepancies. EveryRepeatModeler
run (unless you use-srand
) will also yield a slightly different library. RepeatModeler
automatically runsRepeatClassifier
, which can also be run separately. This step uses theRepeatMasker.lib
file from whichever RepeatMasker directory that RepeatModeler is configured to use. The classifications will be better if you installed RepBase RepeatMasker Edition since it gets included inRepeatMasker.lib
. This is the step where many of your elements are incorrectly labeled as "Unclassified", if step 1 was done with RepeatMasker 4.1.1.- Extract a library of ancestral repeats from the RepeatMasker libraries (with or without RepBase RepeatMasker Edition!) via
queryRepeatDatabase.pl
orfamdb.py
. The programs involved did change, but this step should produce equivalent files (the same sequences, with the same labels, but maybe in a different ordering). - Combine the ancestral library with the one produced in steps 2+3.
- Run RepeatMasker with the
-lib
option and any search engine. The results of this step should not be affected by any of the problems we've discussed, and I would not expect changing the RepeatMasker version or search engine here to explain much more than a few percent difference.
from tetools.
- Configure RepeatMasker (including installing RepBase RepeatMasker Edition if wanted) and RepeatModeler. This is where the bug in RepeatMasker 4.1.1 is, which affects step 3 later.
RepeatModeler
automatically runsRepeatClassifier
, which can also be run separately. This step uses theRepeatMasker.lib
file from whichever RepeatMasker directory that RepeatModeler is configured to use. The classifications will be better if you installed RepBase RepeatMasker Edition since it gets included inRepeatMasker.lib
. This is the step where many of your elements are incorrectly labeled as "Unclassified", if step 1 was done with RepeatMasker 4.1.1.
I know how to configure RepeatMasker to merge the RepBase library and then export LIBDIR PATH to that Library before running RepeatModeler, though I don't know if I should do anything else to configure the RepeatModeler?
Also, I am wondering what is the command for adding RepBase to the RepeatMasker library inside tetools-1.1? because it doesn't recongnize addRepBase.pl
. should I pull out the library from the container, cp -r /opt/RepeatMasker/Libraries/ ./
and then try ./configure
inside the shell?
from tetools.
Related Issues (20)
- > The combine is a very confusing place for us. What causes such a deviation?
- Feature request: Make the Docker image multi-platform HOT 8
- forksys: Program terminated by a signal 9. HOT 1
- addRepbase.pl: no such file
- reasonaTE "https://github.com/DerKevinRiehl/transposon_annotation_reasonaTE" HOT 1
- hangup error on round5 of RepeatModeler on singularity sif v1.8, v1.85 HOT 1
- Error running repeatmodeler in container HOT 2
- rmblast does not work in new docker image for TETools 1.86. HOT 2
- error of repeatmasker in container HOT 1
- Docker Image Cannot Run LTRStruct pipeline HOT 1
- Customizing RepeatMasker libraries: Absent HOT 2
- Problems configuring RepeatClasifier on docker. HOT 8
- Request: LTR_retriever update from version 2.9.0 HOT 1
- Bump version to 2.0 HOT 1
- Command line fasta file scaffolds_final.fa does not exist! HOT 2
- famdb.py: command not found HOT 2
- Taxonomy::new() needs a path for a famdb directory! HOT 6
- LTRPipeline : Error - could not open clusters.dat! HOT 2
- RepeatModeler BuildDatabase can not open file
- MAFFT failed while running RepeatModeler
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tetools.