Giter Club home page Giter Club logo

zamp's People

Contributors

da2798 avatar farchaab avatar julietcharlotte avatar sedreh avatar tpillone avatar valscherz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

zamp's Issues

AttributeError with snakemake <7.32.4

Error :

AttributeError in file zAMP/Snakefile, line 11:
can't set attribute 'singularity_args'
 File "zAMP/Snakefile", line 11, in <module>

line:

## When using singularity
if "--use-singularity" in sys.argv:
    ### Bind the directory of the database to the singularity containers.
    workflow.singularity_args += f' -B {config["tax_DB_path"]}:{config["tax_DB_path"]}'

NameError in make_input_lists.py

Error :

NameError in file bin/zAMP/rules/0_preprocessing/scripts/make_input_lists.py, line 132:                                                                                        
name 'sample_name' is not defined

Database processing with latest SILVA and greengenes versions

When using SILVA v138.1 (wSpecies_train_set) I get an error in Derep_and_merge_taxonomy.

  • fasta file
>1
AACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAG
TCGAGCGGCAGCACGGGTACTTGTACCTGGTGGCGAGCGGCGGACGGGTGAGTAATGCCT
  • taxonomy
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;amygdali;
>Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Pectobacteriaceae;Dickeya;phage;
>Bacteria;Actinobacteriota;Actinobacteria;Actinomycetales;Actinomycetaceae;F0332;
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;equi;
>Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;porcinus;
>Bacteria;Actinobacteriota;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharomonospora;
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;
>Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;[Eubacterium] nodatum group;
>Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;
>Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Porticoccaceae;Porticoccus;hydrocarbonoclasticus;

Inspecting the log:

[1] ‘1.0.2’

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[1] ‘0.8.5’
[1] ‘0.20.41’
[1] ‘1.4.0’
[1] ‘0.5.0’
Error in `$<-.data.frame`(`*tmp*`, V2, value = character(0)) :
  replacement has 0 rows, data has 452064
Calls: $<- -> $<-.data.frame
Execution halted

Reads in the wrong direction are going through the DADA2 pipeline

It appears that reads wrongly orientated were going through the DADA2 pipeline, causing spurious alignements in phylogenetic tree, which could cause distortion in Unifrac distances for instance.

The best solution would be to reverse complement the reads if needed. Yet cutadapt current version of cutadapt 2.10 is not capable to do that for paired-end read.

This is not a problem with vsearch since PANDAseq removes sequences not presenting the forward primer on forward reads and the reverse primer on reverse reads.

Dereplication of the V1V2 region of ref DB EzBioCloud

Our reference database pre-processing pipeline is based on in silico amplification to extract only the region of interest. However, it appears that most of the priming region for the primer 27F (which is used with the 338R primer to cover the regions V1V2) is not included into the full-length 16S rRNA reference database of EzBioCloud. Thus, this pre-processing won't work with this pre-processing pipeline and this database.

27F:
5′-AGMGTTYGATYMTGGCTCAG-3′

338R:
5′-GCTGCCTCCCGTAGGAGT-3′

Database processing

Regarding the database "my" version of the EzBioCloud databse was prepared as indicated here. In the future, we would need to have a script capable of:

  • Detecting the identical sequences on the V3V4 stretch in order to have their taxonomic tag grouped (e.g. Staphylococcus epidermidis/capitis/...). This is necessary so that when reading the results we know that they cannot be distinguished based on this sequence.
  • Preparing the database on its own so that we don't have to include the database on github but just the recipe to make it.

Key error when running the pipeline from SRA samples + SRA toolkits outdated

/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules:
'SRR6309209'
  File "/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/Snakefile", line 27, in <module>
  File "/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules", line 242, in <module>
  File "/home/valentinscherz/bin/microbiome16S_pipeline2/zAMP/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules", line 229, in sample_list_run_DADA2_stats

Generate a sequence length plot with assignment for QC purpose

The length filtering is arbitrary: it would be interesting to have a plot showing the length disperstion with the taxonomic assignement. It would allow for instance to see if we could exlcue more sequence based on their length. I had made one previously (draft of code hereunder) but not adapted to the current output of the pipeline.

Dada2 sequences

Filter to have the sequences of all unassigned features to blast them.

Blast commands in the ipynb in the same folder


## Obtain  summary by otu of the frequencey

### read the table created with the barplot function where we kept all unassigned features of all samples
all_unassigned_table <- read.table("r_figures/quantitative_barplots/Absolute/Absolute/Table/Kingdom_Unclassified_Absolute>20_study_name_noyes_CURML_sample_source_abundancy_table.tsv", sep = "\t", header = T)

### Use summarise to have only once each OTU and the sum of the abundance for each
summarized_unassiged_table <- all_unassigned_table %>%
  group_by(OTU) %>%
  summarise(sum_abudance = sum(Abundance))

### Filter to keep only the ones with  more than 20 reads, to reduce the number of sequences to blast
filtered_unassigned_table <- summarized_unassiged_table %>%
  filter(sum_abudance >= 20)

### Read the fasta file of all sequences denoised by dada
all_fasta_seq <- read_fasta(file_path = "1_dada_denoised/dna-sequences.fasta")

### Transform the rownames as first column, needed to join with deplyr
all_fasta_seq <- as.data.frame(all_fasta_seq)
all_fasta_seq <- rownames_to_column(all_fasta_seq, var = "OTU")

### Keep only the sequences in the filtered table of all unassigned sequences
filtered_seq <- semi_join(all_fasta_seq, filtered_unassigned_table, by = c("OTU"))

### Write in into a fasta format table
df.fasta = dataframe2fas(filtered_seq, file="filtered_seq.fasta")


Create a function to have the length of the sequences

   
## Rename the sequence column in the all_fasta_seq table
names(all_fasta_seq)[2]<-paste("sequences")

## Calculate the sequences length
sequence_length <- all_fasta_seq %>%
  mutate(length = nchar(as.character(all_fasta_seq$sequences)))

## Get a table with the overall abundance of all features et their assignment
all_sequences_counts <- physeq_df %>%
  group_by(OTU, Kingdom) %>%
  summarise(sum_abundance = sum(Abundance))

## Join the two just created table with all sequences and abundance
joined_sequences_tables <- full_join(all_sequences_counts, sequence_length, by = "OTU")

## Join by length and by kingdom retaining the sum of the abundance for plotting
joined_sequences_tables_length_sum <- joined_sequences_tables %>%
  group_by(length, Kingdom) %>%
  summarise(length_sum_abundance = sum(sum_abundance))

## Create a version without the 0
joined_sequences_no_zero <- joined_sequences_tables_length_sum %>%
  ungroup() %>%
  filter(length_sum_abundance > 0)
  ### Write it as tsv
  write.table(x = joined_sequences_no_zero, file = "length_of_sequences.tsv", append = F, quote = F, sep = "\t", row.names = F, col.names = TRUE)

## Plot the size by Kingdom
read_length_plot <- joined_sequences_tables_length_sum %>%
  ggplot(aes(x = length, y = length_sum_abundance, fill = Kingdom)) +
  geom_col(position = "stack", width = 1) 

##Save this plots 
ggsave(read_length_plot, filename = "amplicons_length_distribution.png", width = 10, height = 5)

## Zoom to lower < values for better visualization

read_length_plot_zoomed <- read_length_plot + coord_cartesian(
xlim = c(250, 450), ylim = c(0, 50000))

##Save this plots 
ggsave(read_length_plot_zoomed, filename = "amplicons_length_distribution_zoomed.png", width = 10, height = 5)

To compare sequences length and results, same analysis with vsearch join paired sequences

Load objects into R


## Function to load in R environnement the taxonomy table, the metadata table, the features counts table, the tree, the Shannon and the PCO objects


load_objects_fct <- function(taxonomy_class_table, replace_empty_tax = TRUE, features_counts_table , metadata_table, tree, shannon, pco) {

    
  ## Qiime2 .qza features counts table
    
    ### Load Qiime2 .qza counts table into R environment
    features_counts_table<-read_qza(features_counts_table)
    
    ### Shows the unique identifier of the file
    print("features counts table")
    print(features_counts_table$uuid)
    
    ### Print information about the object
    names(features_counts_table)
      
    ### Shows the unique identifier of the file
    print("features count table")
    print(features_counts_table$uuid)
    
    ### Show the first 5 samples and features
    print(features_counts_table$data[1:5,1:5]) 
    
    ### Save this table in global environnement
    features_counts_table <<- features_counts_table

    
}



## Run the function to load all needed object into R

load_objects_fct( features_counts_table =  "1b_vsearch_joining/dereplicate/dereplicated_table.qza")



## Create a phyloseq object from what has been loaded in the previous chunks

  ### Generate a function to create a phyloseq object
  create_phyloseq_fct <- function(physeq_name = "physeq", melted_df_name = "physeq_df", otu_table){

      ### Create the phyloseq object
      physeq_obj<-phyloseq(
      otu_table(otu_table$data, taxa_are_rows = T))
      
      ### Assign the object into the environnement 
      assign(value = physeq_obj, x = physeq_name, envir =  .GlobalEnv)
      
      ### Melt the physeq objet into a dataframe with one row per feature.id and per sample, needed later
      physeq_df <- psmelt(physeq_obj)
      
      ### Assign the object into the environnement 
      assign(value = physeq_df, x = melted_df_name, envir =  .GlobalEnv)
      
    
    }

  ### Use the function
  create_phyloseq_fct(physeq_name = "physeq_vsearch", melted_df_name = "physeq_df_vsearch", otu_table = features_counts_table)

Filter to have the sequences of all unassigned features



### Read the fasta file of all sequences denoised by dada
all_fasta_seq_vsearch <- read_fasta(file_path = "1b_vsearch_joining/dereplicate/dna-sequences.fasta")

### Transform the rownames as first column, needed to join with deplyr
all_fasta_seq_vsearch <- as.data.frame(all_fasta_seq_vsearch)
all_fasta_seq_vsearch <- rownames_to_column(all_fasta_seq_vsearch, var = "OTU")

## Rename the sequence column in the all_fasta_seq table
names(all_fasta_seq_vsearch)[2]<-paste("sequences")


## Separate the first column where the feature ID and the sample name are merged
all_fasta_seq_vsearch <- separate(all_fasta_seq_vsearch, sep = " ", col = OTU, into = "OTU")

Create a function to have the length of the sequences



## Calculate the sequences length
sequence_length_vsearch <- all_fasta_seq_vsearch %>%
  mutate(length = nchar(as.character(all_fasta_seq_vsearch$sequences)))

## Get a table with the overall abundance of all features et their assignment
all_sequences_counts_vsearch <- physeq_df_vsearch %>%
  group_by(OTU) %>%
  summarise(sum_abundance = sum(Abundance))

## Join the two just created table with all sequences and abundance
joined_sequences_tables_vsearch <- full_join(all_sequences_counts_vsearch, sequence_length_vsearch, by = "OTU")

### Write it as tsv
  write.table(x = joined_sequences_tables_vsearch, file = "length_of_sequences_vsearch.tsv", append = F, quote = F, sep = "\t", row.names = F, col.names = TRUE)
  
  
## Join by length and by kingdom retaining the sum of the abundance for plotting
joined_sequences_tables_length_sum_vsearch <- joined_sequences_tables_vsearch %>%
  group_by(length) %>%
  summarise(length_sum_abundance = sum(sum_abundance))
  

## Plot the size by Kingdom
read_length_plot <- joined_sequences_tables_length_sum_vsearch %>%
  ggplot(aes(x = length, y = length_sum_abundance)) +
  geom_col(position = "stack", width = 1) 

##Save this plots 
ggsave(read_length_plot, filename = "amplicons_length_distribution_vsearch.png", width = 10, height = 5)

## Zoom to lower < values for better visualization

read_length_plot_zoomed <- read_length_plot + coord_cartesian(
xlim = c(0, 600), ylim = c(0, 50000))

##Save this plots 
ggsave(read_length_plot_zoomed, filename = "amplicons_length_distribution_zoomed_vsearch.png", width = 10, height = 5)

Key error with DADA2

Random error, probably due to special as "-" characters in "sample_label"

InputFunctionException in line 110 of /home/valentinscherz/bin/microbiome16S_pipeline/rules/1_2_DADA2_ASVs/2_DADA2_denoising.rules:
Error:
  KeyError: 'MN-MOCK-20180809'
Wildcards:
  RUN=runa549
  sample=MN-MOCK-20180809

Can't open /miniconda3/622b4111/lib/R/etc/ldpaths

The following error sometimes occurs near the end of the pipeline, when running the visualizations scripts. A rerun of the same snakemake command usually solves it.

Activating conda environment: /home/pipeline_user/miniconda3/622b4111
/home/pipeline_user/miniconda3/622b4111/lib/R/bin/R: 238: .: Can't open /home/pipeline_user/miniconda3/622b4111/lib/R/etc/ldpaths

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.