ranwez / macse_v2_pipelines Goto Github PK

This repository provides source code for several pipelines dedicated to the alignment of nucleotide coding sequences that are based on MACSE. These pipelines are mostly bash scripts encapsulated within singularity containers and sometimes combined in nextflow workflows.

Home Page: https://bioweb.supagro.inra.fr/macse/

Shell 57.21% Nextflow 22.06% Perl 19.51% HTML 0.39% Python 0.84%

bioinformatics-pipeline barcoding sequence-alignment sequence-alignments singularity-container nextflow

macse_v2_pipelines's Introduction

MACSE pipelines

This repository provides source code for several pipelines dedicated to the alignment of nucleotide coding sequences that are based on MACSE. These pipelines are mostly bash scripts encapsulated within Singularity containers and sometimes combined into NextFlow workflows.

Pipeline overview

pipelines to align CDS/exons

alfix: this pipeline uses MACSE and HmmCleaner to produce a high quality alignment of nucleotide (NT) coding sequences using their amino acid (AA) translations. It is well suited for datasets containing a few dozen of sequences of a few Kb.
OMM_MACSE: this pipeline also produces a codon-aware alignment thanks to MACSE, which could be filtered by HmmCleaner, but it can handle larger datasets by relying on MAFFT, MUSCLE or PRANK to scale up.

These two pipelines are described in our MACSE tutorial paper [ranwez et al. 2020]

pipelines dedicated to barcoding

macse_barcode this nextflow pipeline allows to aligns hundred of thousands of barcoding sequences
build_ref_align this nextflow pipeline identifies a small subset of sequences that are representative of the diversity of the barcoding input sequence dataset
enrich_align this nextflow pipeline aligns barcoding sequences based on a reference alignment.
representative_sequences A bash script that identifies a small subset of sequences that are representative of the diversity of the barcoding input sequence dataset and is chained with OMM_MACSE in the build_ref_align workflow.

These pipelines are detailed in our book chapter dedicated to MACSE and barcoding datasets [Delscuc & Ranwez, 2020]. While using macse_barcode is the easiest solution, chaining build_ref_align and enrich_align allows to check the quality of the proposed reference alignment and to manually curate it, if needed, before using it to align the barcode sequences.

We used the macse_barcode pipeline to align COI, matK and rbcL sequences for numerous taxonomic groups, all resulting alignments are available here.

MACSE overview

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons.

A wide range of molecular analyses relies on multiple sequence alignments (MSA). Until now the most efficient solution to align nucleotide (NT) sequences containing open reading frames was to use indirect procedures that align amino acid (AA) translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.

MACSE [Ranwez et al, 2011] aligns coding NT sequences with respect to their AA translation while allowing NT sequences to contain multiple frameshifts and/or stop codons. MACSE is hence the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.

For further details about the underlying algorithm see the original publication: MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery PLoS One 2011, 6(9): e22594.

More information (including documentations and tutorials) are available on the MACSE website

Singularity overview

A singularity container [Kurtzer, 2017] contains everything that is needed to execute a specific task. The person building the container has to handle dependencies and environment configuration so that the end-user do not need to bother. The file specifying the construction of the container is a simple text file called a recipe (we provide the recipe of our container as well as the containers). As our scripts/pipelines often relies on several other scripts and external tools (e.g. MAFFT) singularity container is very handy as the end user just need to install singularity and download the container without having to care for installing dependencies or setting environment variables.

A brief introduction to singularity is available here.

If you got an error message stating that your input file does not exist it is probably related to the fact that the folder containing them is not visible from the singularity container. A solution found by one user is to use the SINGULARITY_BINDPATH variable:

export SINGULARITY_BINDPATH="/path/to/fasta"

Nextflow overview

Nextflow [Di Tommaso, 2017] enables scalable and reproducible scientific workflowsusing software containers allowing the adaptation of pipelines written in the most commonscripting languages.

Nextflow separates the workflow itself from the directive regarding the correct way to execute it in the environment. One key advantage of Nextflow is that, by changing slightly the “nextflow.config” file, the same workflow will be parallelized and launched to exploit the full resources of a high performance computing (HPC) cluster.

References

Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame, C.(2017). Nextflow enables reproducible computational workflows. Nature Biotechnology,35(4):316–319. Nextflow web site

Kurtzer, G. M., Sochat, V., and Bauer, M. W. (2017). Singularity: Scientific containers formobility of compute. PloS One, 12(5):e0177459. singularity web site

MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery PLoS One 2011, 6(9): e22594. MACSE web site

Aligning protein-coding nucleotide sequences with MACSE. V. Ranwez, N. Chantret, F Delsuc. To appear in Methods in Molecular Biology (2020).

Accurate alignment of (meta)barcoding datasets using MACSE. Frédéric Delsuc and Vincent Ranwez (2020). In Scornavacca, C., Delsuc, F., and Galtier, N., editors, Phylogenetics in the Genomic Era, chapter No. 2.3, pp. 2.3:1–2.3:30. No commercial publisher | Authors open access book. The book is freely available at https://hal.inria.fr/PGE.

macse_v2_pipelines's People

Contributors

Stargazers

Watchers

Forkers

pythseq rubimeza hj1994412 wangdi2014 cnyuanh yuzhenpeng simonll planetypus nylander

macse_v2_pipelines's Issues

Can't seem to find the .sif file

I am a total newbie to singularity, but I was trying to follow the instructions in the 2020 paper, and it refers to using .sif file with singularity (for example, singularity run ./container.sif) but none of the files are .sif. Am I missing something? The .def files have different format for singularity. Thanks for any clarification.

Minimum number of threads needed per MACSE job

Hello @ranwez ,

I have been attempting to scale up MACSE's use in a high performance computing environment, but recently ran into a few cases that threw a few different errors I believe are related. I only generate these errors if I try to incorporate MACSE as a function, then export that MACSE function with GNU Parallel. My goal is to run as many parallel MACSE operations as the computing resources allow.

I was curious if the minimum number of threads required for MACSE to run is 1, 2, or some other value. It appears that the Java garbage collector requires a dedicated thread in addition to the thread running MACSE, so my guess is the minimum number of threads per MACSE job is at least 2. Thus, if my machine has 32 cores, perhaps MACSE can run no more than 16 parallel tasks? Or, if MACSE requires 4 threads per job, then I shouldn't specify more than 8 parallel tasks on a machine with 32 cores.

Thank you for any information you can provide

specifying different mafft options in OMM_MACSE

Is there a way to provide different mafft options (or MUSCLE, or any of the other aligners) in OMM_MACSE besides the one it uses? For example, if I wanted to include --adjustdirection.

handling of degenerate nucleotides in amino acid and nucleotide output files

In the second section of the documentation regarding genetic codes, it is clear that when a degenerate nucleotide character would encode the same amino acid character, MACSE will by default translate this into the expected amino acid residue:

When a codon contains a single ambiguous nucleotide (e.g. N, R, Y) and all its possible translation lead to the same amino acid, MACSE uses this unambiguous translation despite the nucleotide ambiguity

My error concerns a related, but different concern, in which certain, but not all degenerate nucleotide characters are modified in the output nucleotide alignment file. For example, consider a file (called in.fna) containing just two-sample input of DNA characters, in which the second sample has a series of degenerate bases in certain codons:

>Sample1
ATGGCCCAAATCTCCTAA
>Sample2
ATGGCKSARATHTCRTAA

then run MACSE's alignSequences program:

java -jar macse.jar -prog alignSequences -seq in.fna -out_AA out.faa -out_NT out.fna

In Sample2, we have four different codons with one or more degenerate bases to consider:

second codon: GCK. This is a 4-way degenerate site for Alanine, thus if staying true to MACSE's intentions as noted in their documentation, this should be translated into Alanine.
third codon: SAR. Unlike the second codon, this could be translated into either Gln or Glu, thus I would expect MACSE to translate this as "X", because there is more than one amino acid possible among the four possible ways to create this codon (CAA, CAG; GAA, GAG).
fourth codon: ATH. This is a 3-way degenerate site, all encoding the same amino acid - Isoleucine. I would expect MACSE to translate this codon.
fifth codon: TCR. Similar to the second codon, this is a 4-way degenerate base for Serine. In fact, this codon is the codon used in the MACSE documentation as an example of how a degenerate base might be translated.

Here is the translated amino acid output alignment (file out.faa):

>Sample1
MAQIS*
>Sample2
MAXXS*

I find that:

The second (GCK) and fifth (TCR) codons are translated as expected, in that they retain the expected amino acid despite having a degenerate base.
The third amino acid is given an ambiguous amino acid character (X) as expected, because there were multiple possible amino acids possible given the degenerate nucleotide input characters.
I'm unclear why the fourth codon, (ATH) is not translated to the expected amino acid I. The degenerate H nucleotide character should produce an unambiguous codon that translates to isoleucine, but this does not appear to be the case.

Question 1
I would like to understand why ATH is not translated to I in this amino acid output alignment.

Next, if I review the aligned nucleotide output alignment (file out.fna), I notice that some, but not all of the degenerate nucleotides are masked. This was unexpected to me, because I could not find any documentation in MACSE that noted that nucleotide character states may change between input and output files (while understanding that additional characters like gaps and frameshifts may be introduced):

>Sample1
ATGGCCCAAATCTCCTAA
>Sample2
ATGGCNNARATNTCRTAA

To compare the input and output nucleotide sequences for Sample2 for easier viewing, I've lined up the input and output sequences and denoted with a | symbol where the input degenerate codon was swapped for an ambiguous nucleotide character, N:

                 .M..A..X..I..S..*.
Sample2-input:   ATGGCKSARATHTCRTAA
                      ||    |
Sample2-output:  ATGGCNNARATNTCRTAA

Question(s) 2
2a. I would have thought that no degenerate codons would be replaced by N in the output nucleotide alignment. Is there documentation stating that this is the expected behavior?
2b. I would have predicted that, if degenerate bases were masked, then ALL degenerate bases would be swapped. But this is clearly not the case. Is there a rule that can predict when this would or would not occur?
2c. Given the fact that the second codon is currently translated (as Alanine) with MACSE, I would have thought that the degenerate bases in that codon would not be masked. However, it appears that the K in GCK is indeed masked, despite that codon being properly translated. Is this expected?
2d. Given the fact that the third codon is NOT unambiguously translated, I would have thought that the degenerate bases may be masked because the aligned amino acid for that codon is X. However, this is only true for one, but not both, of the degenerate codons in this position (the S, but not R, in SAR. Is this the expected behavior?

No FS option

Hi,

My only aim is having easy codon alignment without back-translation. And the alignment output file I need should have gaps in triplets only (which will be the input for another tool). Is it possible to align sequences without "!" insertion for frameshifts? If I replace ! with - this will not be useful for me since I can only have gaps in triplets.

I can insert three - per frameshift site, or delete ! . But then, alignment lengths will not be equal, and this is not favorable as well. If there would be a -noFS option which would do codon alignment without detecting frameshift sites, this would solve the problem for me.

Any suggestions?

Thanks a million!

file does not exist

I am having another issue with running OMM_MACSE in a Singularity container although I think it is related to the get_in_file_param() function.

When I try to run OMM_MACSE, I get the error that all.fa does not exist:

[aguang@login004 msa]$ singularity run omm_macse_v10.02.sif --in_seq_file all.fa  --out_dir omm_macse --out_file_prefix all_macse --java_mem 4000m


Problem with option --in_seq_file, File all.fa does not exist

This script aligns sequences using: 1) MACSE pre-filtering, 2) MACSE alignment to find frameshifts, 3) MAFFT for aligning the resulting AA sequences, and 4) HMMcleaner for cleaning resulting alignments.

your command line is incorrect please check your options
 usage example:
S_OMM_MACSE_V10.02.sh --out_dir out_dir --out_file_prefix PREFIX --in_seq_file seq_file.fasta [--genetic_code_number code_number] [--alignAA_soft MAFFT/MUSCLE/PRANK] ][--aligner_extra_option "--localpair --maxiterate 1000"] [--min_percent_NT_at_ends 0.7] [--out_detail_dir SAVE_DETAILS/] [--in_seq_lr_file less_reliable_seq_file.fasta] [--java_mem 500m] [--no_prefiltering] [--no_FS_detection] [--no_filtering] [--no_postfiltering] [--min_seqToKeepSite] [--replace_FS_by_gaps] [--save_details] [--debug]


For further details please check the documentation on MACSE website: https://bioweb.supagro.inra.fr/macse

However it very clearly does:

[aguang@login004 msa]$ ls
all.fa  omm_macse_v10.02.sif

I see that it is for some reason triggering the error message in get_in_file_param(), however I cannot figure out why, since the file is clearly there.

which program is parameter -alphabet_AA applicable?

Hello,
Could you help clarify whether or not the -alphabet_AA parameter is intended for us in the alignSequences program (or not)?

If I run the following command, the program executes successfully:

java -jar macse.jar -prog alignSequences -out_AA {aa.out} -out_NT {nt.out} -seq {input.fa}

Likewise, if I run the translateNT2AA program, and add in the -alphabet_AA parameter with a non-default option, the program completes without error:

java -jar macse.jar -prog translateNT2AA -out_AA {aa.translate.out} -seq {input.fa} -alphabet_AA Dayhoff_6

However, if I run the alignSequences program and add the -alphabet_AA parameter, I get an error

java -jar macse.jar -prog alignSequences -out_AA {aa.out} -out_NT {nt.out} -seq {input.fa} -alphabet_AA Dayhoff_6

This error is repeated for different input .fa sequences, and different alphabete_AA options:

java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 15
	at programs.align.NodeST.getChild(NodeST.java:80)
	at programs.align.SuffixTree.buildSuffixTree(SuffixTree.java:194)
	at programs.align.SuffixTree.<init>(SuffixTree.java:54)
	at programs.align.Aligner.computeInitialSequenceDistanceMEM(Aligner.java:129)
	at programs.align.Aligner.alignMultiSequences(Aligner.java:78)
	at programs.align.Aligner.alignSequenceSet(Aligner.java:48)
	at programs.align.AlignSequences.execute(AlignSequences.java:89)
	at cli.CLI_program.parse(CLI_program.java:110)
	at cli.CLI_api.parse(CLI_api.java:234)
	at main.MacseMain.parse(MacseMain.java:111)
	at main.MacseMain.filterCommands(MacseMain.java:54)
	at main.MacseMain.main(MacseMain.java:203)

Because the alphabet_AA parameter was listed in both programs, I assumed it would be applicable to both program. Perhaps I am mistaken and should not be applying it for the alignSequences program?

Thank you for your assistance

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

I keep getting this error when trying to run MACSE on my university's HPC:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at align.ProfileAligner.<init>(ProfileAligner.java:120) at align.CodingMSA.<init>(CodingMSA.java:64) at align.CodingMSA.buildAlignmentReliable(CodingMSA.java:633) at align.CodingMSA.run(CodingMSA.java:659) at utils.MacseMain.main(MacseMain.java:426) srun: error: cn28: task 0: Exited with exit code 1

This is what the log files say before failing:
13 sequence(s) with genetic code The_Standard_Code Build draft alignment using greedy strategy based on UPGMA tree

This is my job script (an array- one job per concatenated FASTA):

`#!/bin/bash
#SBATCH --job-name=MACSE
#SBATCH --array=1-7426
#SBATCH --mem=350000
#SBATCH --cpus-per-task=12
#SBATCH --time=2-00:00:00
#SBATCH --export=ALL

module load anaconda3
source /packages/anaconda3/2022.10/etc/profile.d/conda.sh
conda activate macse

busco_id=$(sed -n "$SLURM_ARRAY_TASK_ID"p /projects/tollis_lab/crocs/analysis/BUSCO_croc.bird.turt/01_BUSCO_meta/busco_nt_merged/mafft_croc_input)
cd /projects/tollis_lab/crocs/analysis/BUSCO_croc.bird.turt

srun macse -prog alignSequences -seq 01_BUSCO_meta/busco_nt_merged/${busco_id} -out_NT 02_macse/${busco_id}_outNT.fa -out_AA 02_macse/${busco_id}_outAA.fa &&
srun macse -prog exportAlignment -align 02_macse/${busco_id}_outNT.fa -codonForInternalStop NNN -codonForInternalFS --- -charForRemainingFS - -out_NT 02_macse/${busco_id}_NT.macse.fa -out_AA 02_macse/${busco_id}_AA.macse.fa`

I've tried different amounts of CPUs and have increased the memory to the max. Am I misspecifying something?

Issue with prefiltering

HI,
I wanted to run omm_macse on ~ 1200 sequences.

omm_macse_v12.01.sif --out_dir macseC2 --out_file_prefix C --in_seq_file C_all.ffn --genetic_code_number 11 --java_mem 2000m --min_percent_NT_at_ends 0.8 --aligner_extra_option "--thread 3 --retree 1"

However I see from the program std.out and the final alignment that over 200 sequences are discarded directly from the start.

848 sequences with genetic code The_Bacterial_Archaeal_and_Plant_Plastid_Code

I also tried running with --no_prefiltering option but the same happen.

Does the program remove identical sequences or what else is going on? I already dereplicated the sequences prior to alignment

How to specify the reading frame in MACSE

Dear Concern,

Sometimes MACSE ignores the reading frame of the original protein sequence provided, and introduces frameshifts in the beginning. Is there a way of getting around that? How can I tell the program to consider the first base to be where the first codon starts?

Thanks and best regards,
Hassan

Am I reaching the limits of omm-macse?

@ranwez et al

Hello everyone,

I am not sure if this is the right place since no other issues have been raised on this git.

I have target capture sequence data for >1,000 nuclear (coding) genes, and created fasta files for ~2,000 samples each.
I have successfully been using the omm-macse pipeline using singularity on our bare metal machine as follows and I have been very happy with the results so far.

./omm_macse_v10.02.sif
--in_seq_file ${gene}_contig.fasta
--out_dir ${gene}_omm_macse_results
--out_file_prefix ${gene}
--java_mem 30g

This worked when including up to 1,000 to 1,500 samples in a single fasta file. However, now that my dataset is growing and exceeding 2,000 samples, it appears that omm-macse has a really hard time finishing, which may take many hours or some days for a single gene. Given the high number of genes I want to process, I am facing too long a total time.

Is there any way to speed up the process, maybe using further arguments for omm-macse?
I already tried upscaling to 100g, but this seems not really to help, and I 'only' have 378g, meaning I can not run multiple genes in parallel (like I was used to).

Looking forward to your help. It is much appreciated. :-)

Best wishes,

Kasper