Giter Club home page Giter Club logo

Comments (9)

jebrosen avatar jebrosen commented on August 11, 2024 1

I know how to configure RepeatMasker to merge the RepBase library and then export LIBDIR PATH to that Library before running RepeatModeler, though I don't know if I should do anything else to configure the RepeatModeler?

RepeatModeler does not recognize a LIBDIR parameter. You could point REPEATMASKER_DIR to an alternative RepeatMasker installation with customized contents in Libraries/, though.

Also, I am wondering what is the command for adding RepBase to the RepeatMasker library inside tetools-1.1? because it doesn't recongnize addRepBase.pl. should I pull out the library from the container, cp -r /opt/RepeatMasker/Libraries/ ./ and then try ./configure inside the shell?

RepeatMasker 4.1.0 did not have a script to do that step separately; it was all through configure. RM 4.1.0 does support -libdir/LIBDIR, so it should be possible to install RepBase RepeatMasker Edition to Libraries/ in a non-container installation and configure it, then run RepeatMasker in the container with that directory as a custom LIBDIR.

from tetools.

jebrosen avatar jebrosen commented on August 11, 2024 1

RepeatMasker 4.1.2 is released, fixing the problem with classifications in RepeatMasker 4.1.1 that probably caused most of these differences. This version of RepeatMasker is included in the latest release of the Dfam TE Tools container, 1.3! (CHANGELOG for version 1.3)

I think nearly all the questions in this thread have been answered, so I'm closing the issue. Please re-open this or a new issue if you have further problems or questions!

from tetools.

jebrosen avatar jebrosen commented on August 11, 2024

This is most likely caused by these two issues:

  1. Genome sampling (likely small impact)

RepeatModeler uses a randomized sampling approach, so two runs on the same genome even from the same version of RepeatModeler will produce different output depending on the size and abundance of different elements discovered in different orders between runs. It looks like you used -LTRStruct in one run, which will also affect this. The -srand parameter is available if you need to reproduce prior runs, using the "seed number" reported in the RepeatModeler log output.

Even so, the overall composition of the library should not be too different, especially since this genome is small enough to be completely analyzed over the course of all 6 rounds of RepeatModeler.

  1. Increase in "Unclassified" sequences (likely large impact)

This is due to the following bug in RepeatMasker, which has a larger impact than I first realized in some genomes:

We recently identified a bug in RepeatMasker 4.1.1 which affects classifications from RepeatClassifier that are based on similarity to TEs. (Classifications based on homology with known protein sequences is unaffected). The bug can be rectified by running this command inside the RepeatMasker directory:
./famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib
and then re-running RepeatClassifier on the consensus library (re-running all of RepeatModeler is not necessary).

Re-running RepeatClassifier configured for RepeatMasker 4.1.0 instead of RepeatMasker 4.1.1 should be roughly equivalent to fixing the error as well. This version of RepeatMasker was included in the dfam/tetools:1.1 version of the image.


p.s.s I have added and merged the RepBase library to the RepeatMasker_4.1.1 inside the container, but it doesn't indicate it in the output table file, I don't know if it is normal?

This looks like it is simply different/missing information being reported by the different versions of the tools, possibly a bug. Since you are using the -lib option instead of -species, the results should not be affected whether or not RepBase RepeatMasker Edition was installed. RepBase RepeatMasker Edition can, however, influence the classifications produced by RepeatClassifier which are based on similarity to already-known TEs.

from tetools.

xyz0o avatar xyz0o commented on August 11, 2024

the older version is run by a former colleague on another server, I only have her command lines to compare with mine, so I don't have the parameters to reproduce it.

so is it normal that the RepeatMasker-4.1.0 library (Repeatmasker.lib) is larger in size than the RepeatMasker-4.1.1?

Capture3
Capture2

also I tried the debugging strategy, the result is almost the same (even a little less!) so I decided to pool the tetool:1.1 image and run from scratch, but when running RepeatModeler, I receive the error:

  RepeatModeler dependency missing or incorrectly set for TRF_PRGM!
  Rerun ./configure or check your command line to ensure that RepeatModeler
  has access to and the correct version of this dependency.

running configure from inside the RepeatModeler doesn't solve the issue:

Singularity> configure

    The RepeatMasker configure script must be run from
    inside the RepeatMasker installation directory:

       /opt/RepeatMasker

    Perhaps this is not the "configure" you are looking for?

how should I fix this?

from tetools.

jebrosen avatar jebrosen commented on August 11, 2024

so is it normal that the RepeatMasker-4.1.0 library (Repeatmasker.lib) is larger in size than the RepeatMasker-4.1.1?

This is mostly due the bug in question - the smaller file size is due to the missing information that RepeatClassifier needed. It is normal for the size to fluctuate between versions for other reasons, though.

Although it seems strange that you have a RepeatModeler/Libraries directory at all. The Libraries directory should normally be in the RepeatMasker directory.

also I tried the debugging strategy, the result is almost the same (even a little less!)

Which debugging strategy do you mean? (My two suggestions were to fix the RepeatMasker.lib file in the newer version of RepeatMasker or to use the older version of RepeatMasker.) If you still have them, the exact commands you ran would be helpful in case there was something wrong or missing in my explanation or a typo in one of the commands.

so I decided to pool the tetool:1.1 image and run from scratch, but when running RepeatModeler, I receive the error:

That's my fault - I forgot that the 1.1 version of the container required TRF to be installed in the host, which makes it much more complicated to use. I can pull up older versions of the scripts and instructions, but one of these other approaches should be easier.

from tetools.

xyz0o avatar xyz0o commented on August 11, 2024

sorry for not being clear. your first suggestion was to rectify the bug in the newer version of RMasker using your command. so I did this:
singularity exec --bind $PWD:$PWD tetools.sif famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib

regarding the directory name, I am told that the directories in the container are "read only" so I can't really modify them, right? so I have to create a directory in the host and bind it with the same directory inside the container, I named that directory "RepeatModeler" (in your example: /work) then using cp -r /opt/RepeatMasker/Libraries/ ./ command I pulled out the "Librraies" from "RepeatMasker" and put in my newly created "RepeatModeler" dir, I know it seems confusing now, but I didn't really think about the naming when I first did it. so all the changes (RepBase merging and later the debugging command) is done on this library. Am I correct? if I have done it right, then it didn't really work! due to the still noticeable size differences and having the same result.

then you suggested to re-run it using the "RMasker-4.1.0" which is included in the dfam/tetools:1.1 which I did and the result was an error. I have already installed RepeatMasker-4.1.0 and its dependencies, e.g. TRF, manually.
The reason why I wanted to try the container is that maybe the RepeatModeler and/or its searching engines (rmblast, Recon, RepeatScout) are of different versions and who knows, sometimes older versions do work better! for instance I really expected using -LTRStruct would recover more LTRs in the new output than the older one!!
so I would appreciate it if you could send me the instruction of how to make 1.1 work.

from tetools.

xyz0o avatar xyz0o commented on August 11, 2024

oh and I forgot to ask, If I use different blasting tools for RepeatMasker-4.1.0 : rmblast, hmmer and abbalst, is there any syntax for RepeatMasker to merge them all together and then remove the redundancies so that I could recover more TEs?

from tetools.

jebrosen avatar jebrosen commented on August 11, 2024

so I have to create a directory in the host and bind it with the same directory inside the container, I named that directory "RepeatModeler" (in your example: /work) then using cp -r /opt/RepeatMasker/Libraries/ ./ command I pulled out the "Librraies" from "RepeatMasker" and put in my newly created "RepeatModeler" dir, I know it seems confusing now, but I didn't really think about the naming when I first did it. so all the changes (RepBase merging and later the debugging command) is done on this library. Am I correct? if I have done it right, then it didn't really work! due to the still noticeable size differences

Sorry, I answered your question about file sizes as if you had checked it before running the command to generate a corrected file. Since you did, I think it's more likely that the size differences were unrelated (different versions, whether or not you merged RepBase and what version, number of blank spaces or line breaks that happened to be in the file, etc.). To easily tell whether or not the file is impacted by that bug in 4.1.1, you can check the first line of the file. It should read >something#class/subclass instead of simple >something.

and having the same result.

Based on what you said so far I am not sure you actually ran with the fixed library. To do this you would need to configure RepeatModeler to use an installation of RepeatMasker that was itself configured to use the alternative Libraries/ directory. This is why I mentioned it would probably be easier to use RepeatMasker 4.1.0 at least for the classification step, since the only difference between 4.1.0 and 4.1.1 that explains the large difference in your results is this bug I mentioned.

The reason why I wanted to try the container is that maybe the RepeatModeler and/or its searching engines (rmblast, Recon, RepeatScout) are of different versions and who knows, sometimes older versions do work better! for instance I really expected using -LTRStruct would recover more LTRs in the new output than the older one!!
so I would appreciate it if you could send me the instruction of how to make 1.1 work.

Sure. You can use this version of the script to get the necessary --trf_prgm option: https://raw.githubusercontent.com/Dfam-consortium/TETools/1.1/dfam-tetools.sh, and a command line along these lines:

dfam-tetools-1.1.sh --container dfam/tetools:1.1 --trf_prgm=/path/to/trf -- RepeatClassifier ...

oh and I forgot to ask, If I use different blasting tools for RepeatMasker-4.1.0 : rmblast, hmmer and abbalst, is there any syntax for RepeatMasker to merge them all together and then remove the redundancies so that I could recover more TEs?

@rmhubley should be able to answer this part much better than I can. HMMER will be harder to tackle since it works on profile hidden markov models, but RepBase and your custom library both contain only consensus sequences (used by RMBlast and AB-BLAST). We don't provide a script to merge them, and it's a difficult thing due to the sheer number of corner cases to consider when trying to merge output (e.g. whether to pick "longer" or "better" annotations, and how to handle differently overlapping or merged annotations)


There are a lot of steps and versions at play. Here is my understanding of what you are running and where the large differences in output might be coming from, from start to finish. Hopefully it makes a few things more clear:

  1. Configure RepeatMasker (including installing RepBase RepeatMasker Edition if wanted) and RepeatModeler. This is where the bug in RepeatMasker 4.1.1 is, which affects step 3 later.
  2. Run RepeatModeler to get a species-specific library. RepeatModeler 2 introduced the -LTRStruct option, which is one big difference. The version of LTR_retriever may matter a bit, but it doesn't explain most of your discrepancies. Every RepeatModeler run (unless you use -srand) will also yield a slightly different library.
  3. RepeatModeler automatically runs RepeatClassifier, which can also be run separately. This step uses the RepeatMasker.lib file from whichever RepeatMasker directory that RepeatModeler is configured to use. The classifications will be better if you installed RepBase RepeatMasker Edition since it gets included in RepeatMasker.lib. This is the step where many of your elements are incorrectly labeled as "Unclassified", if step 1 was done with RepeatMasker 4.1.1.
  4. Extract a library of ancestral repeats from the RepeatMasker libraries (with or without RepBase RepeatMasker Edition!) via queryRepeatDatabase.pl or famdb.py. The programs involved did change, but this step should produce equivalent files (the same sequences, with the same labels, but maybe in a different ordering).
  5. Combine the ancestral library with the one produced in steps 2+3.
  6. Run RepeatMasker with the -lib option and any search engine. The results of this step should not be affected by any of the problems we've discussed, and I would not expect changing the RepeatMasker version or search engine here to explain much more than a few percent difference.

from tetools.

xyz0o avatar xyz0o commented on August 11, 2024
  1. Configure RepeatMasker (including installing RepBase RepeatMasker Edition if wanted) and RepeatModeler. This is where the bug in RepeatMasker 4.1.1 is, which affects step 3 later.
  2. RepeatModeler automatically runs RepeatClassifier, which can also be run separately. This step uses the RepeatMasker.lib file from whichever RepeatMasker directory that RepeatModeler is configured to use. The classifications will be better if you installed RepBase RepeatMasker Edition since it gets included in RepeatMasker.lib. This is the step where many of your elements are incorrectly labeled as "Unclassified", if step 1 was done with RepeatMasker 4.1.1.

I know how to configure RepeatMasker to merge the RepBase library and then export LIBDIR PATH to that Library before running RepeatModeler, though I don't know if I should do anything else to configure the RepeatModeler?

Also, I am wondering what is the command for adding RepBase to the RepeatMasker library inside tetools-1.1? because it doesn't recongnize addRepBase.pl. should I pull out the library from the container, cp -r /opt/RepeatMasker/Libraries/ ./ and then try ./configure inside the shell?

from tetools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.