metagenlab / zamp Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 4.0 202.72 MB

zAMP is a bioinformatic pipeline designed for convenient, reproducible and scalable amplicon-based metagenomics

Home Page: https://zamp.readthedocs.io/en/latest/

License: MIT License

Python 11.90% R 84.03% Dockerfile 3.72% Shell 0.35%

snakemake pipeline metagenomics amplicon

zamp's People

Contributors

Stargazers

Watchers

Forkers

vikash84 sedreh gunzivan28 ayooluwab

zamp's Issues

AttributeError with snakemake <7.32.4

Error :

AttributeError in file zAMP/Snakefile, line 11:
can't set attribute 'singularity_args'
 File "zAMP/Snakefile", line 11, in <module>

line:

## When using singularity
if "--use-singularity" in sys.argv:
    ### Bind the directory of the database to the singularity containers.
    workflow.singularity_args += f' -B {config["tax_DB_path"]}:{config["tax_DB_path"]}'

Add tool versions and references

Add references for tools used in docs

Use cli wrapper

NameError in make_input_lists.py

Error :

NameError in file bin/zAMP/rules/0_preprocessing/scripts/make_input_lists.py, line 132:                                                                                        
name 'sample_name' is not defined

alpha diversity plot is looping over all facors several times

https://github.com/metagenlab/microbiome16S_pipeline/blob/master/rules/common_visualization/scripts/alpha_diversity.R

Alpha-diversity plot is looping around the different factors of the grouping_column several times because of the expand in the general_rule already and the loop is a loop in the script

Copy reference databases hash codes for traceability.

The Snakemake rules for reference database preprocessing generates hashes of the reference database. These hashes should be copied in the log folder to ensure traceability.

Database processing with latest SILVA and greengenes versions

When using SILVA v138.1 (wSpecies_train_set) I get an error in Derep_and_merge_taxonomy.

fasta file

>1
AACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAG
TCGAGCGGCAGCACGGGTACTTGTACCTGGTGGCGAGCGGCGGACGGGTGAGTAATGCCT

taxonomy

>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;amygdali;
>Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Pectobacteriaceae;Dickeya;phage;
>Bacteria;Actinobacteriota;Actinobacteria;Actinomycetales;Actinomycetaceae;F0332;
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;equi;
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;porcinus;
>Bacteria;Actinobacteriota;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharomonospora;
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;
>Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;[Eubacterium] nodatum group;
>Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Porticoccaceae;Porticoccus;hydrocarbonoclasticus;

Inspecting the log:

[1] ‘1.0.2’

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[1] ‘0.8.5’
[1] ‘0.20.41’
[1] ‘1.4.0’
[1] ‘0.5.0’
Error in `$<-.data.frame`(`*tmp*`, V2, value = character(0)) :
  replacement has 0 rows, data has 452064
Calls: $<- -> $<-.data.frame
Execution halted

Reads in the wrong direction are going through the DADA2 pipeline

It appears that reads wrongly orientated were going through the DADA2 pipeline, causing spurious alignements in phylogenetic tree, which could cause distortion in Unifrac distances for instance.

The best solution would be to reverse complement the reads if needed. Yet cutadapt current version of cutadapt 2.10 is not capable to do that for paired-end read.

This is not a problem with vsearch since PANDAseq removes sequences not presenting the forward primer on forward reads and the reverse primer on reverse reads.

Sample with less reads than the rarefaction cut-off are dropped

Using phyloseq rarefaction function, the samples containing less reads than the cut-off are dropped.

Aldex2 transformed clr counts are not giving the expected output

I thought that the ouput of:

aldex_obj <- ALDEx2::aldex.clr(reads = OTU)
reads_trfs <- aldex_obj@reads

would give me the output of the clr transformation. Instead, it is giving the input + a constant (0.5)

Dereplication of the V1V2 region of ref DB EzBioCloud

Our reference database pre-processing pipeline is based on in silico amplification to extract only the region of interest. However, it appears that most of the priming region for the primer 27F (which is used with the 338R primer to cover the regions V1V2) is not included into the full-length 16S rRNA reference database of EzBioCloud. Thus, this pre-processing won't work with this pre-processing pipeline and this database.

27F:
5′-AGMGTTYGATYMTGGCTCAG-3′

338R:
5′-GCTGCCTCCCGTAGGAGT-3′

A "Rplots.pdf" at the root of the project directory

Could be explained by :

I had a similar problem recently after upgrading to R-3.0.3 (yes we're a little behind!). It turns out that palette("default") opens a device now, though it didn't used to.

https://stackoverflow.com/questions/6535927/how-do-i-prevent-rplots-pdf-from-being-generated

Improve alpha diversity plots

https://github.com/metagenlab/microbiome16S_pipeline/blob/master/rules/common_visualization/scripts/alpha_diversity.R

Needs more color (Otherwise group are plotted in white)
Needs better adaptation of the width
Needs to filter PCR-neg samples

config file

config file is missing of logs

Database processing

Regarding the database "my" version of the EzBioCloud databse was prepared as indicated here. In the future, we would need to have a script capable of:

Detecting the identical sequences on the V3V4 stretch in order to have their taxonomic tag grouped (e.g. Staphylococcus epidermidis/capitis/...). This is necessary so that when reading the results we know that they cannot be distinguished based on this sequence.
Preparing the database on its own so that we don't have to include the database on github but just the recipe to make it.

Key error when running the pipeline from SRA samples + SRA toolkits outdated

/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules:
'SRR6309209'
  File "/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/Snakefile", line 27, in <module>
  File "/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules", line 242, in <module>
  File "/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules", line 229, in sample_list_run_DADA2_stats

Generate a sequence length plot with assignment for QC purpose

The length filtering is arbitrary: it would be interesting to have a plot showing the length disperstion with the taxonomic assignement. It would allow for instance to see if we could exlcue more sequence based on their length. I had made one previously (draft of code hereunder) but not adapted to the current output of the pipeline.

Dada2 sequences

Filter to have the sequences of all unassigned features to blast them.

Blast commands in the ipynb in the same folder


## Obtain  summary by otu of the frequencey

### read the table created with the barplot function where we kept all unassigned features of all samples
all_unassigned_table <- read.table("r_figures/quantitative_barplots/Absolute/Absolute/Table/Kingdom_Unclassified_Absolute>20_study_name_noyes_CURML_sample_source_abundancy_table.tsv", sep = "\t", header = T)

### Use summarise to have only once each OTU and the sum of the abundance for each
summarized_unassiged_table <- all_unassigned_table %>%
  group_by(OTU) %>%
  summarise(sum_abudance = sum(Abundance))

### Filter to keep only the ones with  more than 20 reads, to reduce the number of sequences to blast
filtered_unassigned_table <- summarized_unassiged_table %>%
  filter(sum_abudance >= 20)

### Read the fasta file of all sequences denoised by dada
all_fasta_seq <- read_fasta(file_path = "1_dada_denoised/dna-sequences.fasta")

### Transform the rownames as first column, needed to join with deplyr
all_fasta_seq <- as.data.frame(all_fasta_seq)
all_fasta_seq <- rownames_to_column(all_fasta_seq, var = "OTU")

### Keep only the sequences in the filtered table of all unassigned sequences
filtered_seq <- semi_join(all_fasta_seq, filtered_unassigned_table, by = c("OTU"))

### Write in into a fasta format table
df.fasta = dataframe2fas(filtered_seq, file="filtered_seq.fasta")

Create a function to have the length of the sequences

   
## Rename the sequence column in the all_fasta_seq table
names(all_fasta_seq)[2]<-paste("sequences")

## Calculate the sequences length
sequence_length <- all_fasta_seq %>%
  mutate(length = nchar(as.character(all_fasta_seq$sequences)))

## Get a table with the overall abundance of all features et their assignment
all_sequences_counts <- physeq_df %>%
  group_by(OTU, Kingdom) %>%
  summarise(sum_abundance = sum(Abundance))

## Join the two just created table with all sequences and abundance
joined_sequences_tables <- full_join(all_sequences_counts, sequence_length, by = "OTU")

## Join by length and by kingdom retaining the sum of the abundance for plotting
joined_sequences_tables_length_sum <- joined_sequences_tables %>%
  group_by(length, Kingdom) %>%
  summarise(length_sum_abundance = sum(sum_abundance))

## Create a version without the 0
joined_sequences_no_zero <- joined_sequences_tables_length_sum %>%
  ungroup() %>%
  filter(length_sum_abundance > 0)
  ### Write it as tsv
  write.table(x = joined_sequences_no_zero, file = "length_of_sequences.tsv", append = F, quote = F, sep = "\t", row.names = F, col.names = TRUE)

## Plot the size by Kingdom
read_length_plot <- joined_sequences_tables_length_sum %>%
  ggplot(aes(x = length, y = length_sum_abundance, fill = Kingdom)) +
  geom_col(position = "stack", width = 1) 

##Save this plots 
ggsave(read_length_plot, filename = "amplicons_length_distribution.png", width = 10, height = 5)

## Zoom to lower < values for better visualization

read_length_plot_zoomed <- read_length_plot + coord_cartesian(
xlim = c(250, 450), ylim = c(0, 50000))

##Save this plots 
ggsave(read_length_plot_zoomed, filename = "amplicons_length_distribution_zoomed.png", width = 10, height = 5)

To compare sequences length and results, same analysis with vsearch join paired sequences

Load objects into R


## Function to load in R environnement the taxonomy table, the metadata table, the features counts table, the tree, the Shannon and the PCO objects


load_objects_fct <- function(taxonomy_class_table, replace_empty_tax = TRUE, features_counts_table , metadata_table, tree, shannon, pco) {

    
  ## Qiime2 .qza features counts table
    
    ### Load Qiime2 .qza counts table into R environment
    features_counts_table<-read_qza(features_counts_table)
    
    ### Shows the unique identifier of the file
    print("features counts table")
    print(features_counts_table$uuid)
    
    ### Print information about the object
    names(features_counts_table)
      
    ### Shows the unique identifier of the file
    print("features count table")
    print(features_counts_table$uuid)
    
    ### Show the first 5 samples and features
    print(features_counts_table$data[1:5,1:5]) 
    
    ### Save this table in global environnement
    features_counts_table <<- features_counts_table

    
}



## Run the function to load all needed object into R

load_objects_fct( features_counts_table =  "1b_vsearch_joining/dereplicate/dereplicated_table.qza")


## Create a phyloseq object from what has been loaded in the previous chunks

  ### Generate a function to create a phyloseq object
  create_phyloseq_fct <- function(physeq_name = "physeq", melted_df_name = "physeq_df", otu_table){

      ### Create the phyloseq object
      physeq_obj<-phyloseq(
      otu_table(otu_table$data, taxa_are_rows = T))
      
      ### Assign the object into the environnement 
      assign(value = physeq_obj, x = physeq_name, envir =  .GlobalEnv)
      
      ### Melt the physeq objet into a dataframe with one row per feature.id and per sample, needed later
      physeq_df <- psmelt(physeq_obj)
      
      ### Assign the object into the environnement 
      assign(value = physeq_df, x = melted_df_name, envir =  .GlobalEnv)
      
    
    }

  ### Use the function
  create_phyloseq_fct(physeq_name = "physeq_vsearch", melted_df_name = "physeq_df_vsearch", otu_table = features_counts_table)

Filter to have the sequences of all unassigned features



### Read the fasta file of all sequences denoised by dada
all_fasta_seq_vsearch <- read_fasta(file_path = "1b_vsearch_joining/dereplicate/dna-sequences.fasta")

### Transform the rownames as first column, needed to join with deplyr
all_fasta_seq_vsearch <- as.data.frame(all_fasta_seq_vsearch)
all_fasta_seq_vsearch <- rownames_to_column(all_fasta_seq_vsearch, var = "OTU")

## Rename the sequence column in the all_fasta_seq table
names(all_fasta_seq_vsearch)[2]<-paste("sequences")


## Separate the first column where the feature ID and the sample name are merged
all_fasta_seq_vsearch <- separate(all_fasta_seq_vsearch, sep = " ", col = OTU, into = "OTU")

Create a function to have the length of the sequences



## Calculate the sequences length
sequence_length_vsearch <- all_fasta_seq_vsearch %>%
  mutate(length = nchar(as.character(all_fasta_seq_vsearch$sequences)))

## Get a table with the overall abundance of all features et their assignment
all_sequences_counts_vsearch <- physeq_df_vsearch %>%
  group_by(OTU) %>%
  summarise(sum_abundance = sum(Abundance))

## Join the two just created table with all sequences and abundance
joined_sequences_tables_vsearch <- full_join(all_sequences_counts_vsearch, sequence_length_vsearch, by = "OTU")

### Write it as tsv
  write.table(x = joined_sequences_tables_vsearch, file = "length_of_sequences_vsearch.tsv", append = F, quote = F, sep = "\t", row.names = F, col.names = TRUE)
  
  
## Join by length and by kingdom retaining the sum of the abundance for plotting
joined_sequences_tables_length_sum_vsearch <- joined_sequences_tables_vsearch %>%
  group_by(length) %>%
  summarise(length_sum_abundance = sum(sum_abundance))
  

## Plot the size by Kingdom
read_length_plot <- joined_sequences_tables_length_sum_vsearch %>%
  ggplot(aes(x = length, y = length_sum_abundance)) +
  geom_col(position = "stack", width = 1) 

##Save this plots 
ggsave(read_length_plot, filename = "amplicons_length_distribution_vsearch.png", width = 10, height = 5)

## Zoom to lower < values for better visualization

read_length_plot_zoomed <- read_length_plot + coord_cartesian(
xlim = c(0, 600), ylim = c(0, 50000))

##Save this plots 
ggsave(read_length_plot_zoomed, filename = "amplicons_length_distribution_zoomed_vsearch.png", width = 10, height = 5)

Improve DADA2 script

The DADA2 script microbiome16S_pipeline/rules/DADA2_ASV/scripts/2_DADA2_denoising.R could benefit from updates to:

Split the script in several with saved intermediate file, not to have to rerun all steps if stopped midway (since it is very long)
Implement parallelization (https://benjjneb.github.io/dada2/bigdata.html) - to be tested first
Include errors models plotting (https://benjjneb.github.io/dada2/tutorial.html)

Key error with DADA2

Random error, probably due to special as "-" characters in "sample_label"

InputFunctionException in line 110 of /home/valentinscherz/bin/microbiome16S_pipeline/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules:
Error:
  KeyError: 'MN-MOCK-20180809'
Wildcards:
  RUN=runa549
  sample=MN-MOCK-20180809

Can't open /miniconda3/622b4111/lib/R/etc/ldpaths

The following error sometimes occurs near the end of the pipeline, when running the visualizations scripts. A rerun of the same snakemake command usually solves it.

Activating conda environment: /home/pipeline_user/miniconda3/622b4111
/home/pipeline_user/miniconda3/622b4111/lib/R/bin/R: 238: .: Can't open /home/pipeline_user/miniconda3/622b4111/lib/R/etc/ldpaths

metagenlab / zamp Goto Github PK

zamp's People

Contributors

Stargazers

Watchers

Forkers

zamp's Issues

Dada2 sequences

Filter to have the sequences of all unassigned features to blast them.

Create a function to have the length of the sequences

To compare sequences length and results, same analysis with vsearch join paired sequences

Load objects into R

Filter to have the sequences of all unassigned features

Create a function to have the length of the sequences

Recommend Projects

Recommend Topics

Recommend Org