Giter Club home page Giter Club logo

kneaddata's Introduction

ATTENTION

Before opening a new issue here, please check the appropriate help channel on the KneadData bioBakery Support Forum and consider opening or commenting on a thread there.

For additional information, visit the KneadData Tutorial.


KneadData User Manual

KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. In these experiments, samples are typically taken from a host in hopes of learning something about the microbial community on the host. However, sequencing data from such experiments will often contain a high ratio of host to bacterial reads. This tool aims to perform principled in silico separation of bacterial reads from these "contaminant" reads, be they from the host, from bacterial 16S sequences, or other user-defined sources. Additionally, KneadData can be used for other filtering tasks. For example, if one is trying to clean data derived from a human sequencing experiment, KneadData can be used to separate the human and the non-human reads.

If you use the KneadData software, please cite our manuscript: TBA

Contents

Requirements

  1. Trimmomatic (version == 0.33) (automatically installed)
  2. Bowtie2 (version >= 2.2) (automatically installed)
  3. Python (version >= 2.7)
  4. Java Runtime Environment
  5. TRF (optional)
  6. Fastqc (optional)
  7. SAMTools (only required if input file is in BAM format)
  8. Memory (>= 4 Gb if using Bowtie2, >= 8 Gb if using BMTagger)
  9. Operating system (Linux or Mac)

Optionally, BMTagger can be used instead of Bowtie2.

The executables for the required software packages should be installed in your $PATH. Alternatively, you can provide the location of the Bowtie2 install ($BOWTIE2_DIR) with the following KneadData option “--bowtie2 $BOWTIE2_DIR”.

kneaddata_workflow

Installation

Before installing KneadData, please install the Java Runtime Environment (JRE). First download the JRE for your platform. Then follow the instructions for your platform: Linux 64-bit or Mac OS. At the end of the installation, add the location of the java executable to your $PATH.

Download KneadData

You can download the latest KneadData release or the development version. The source contains example files. If installing with pip, it is optional to first download the KneadData source.

Option 1: Latest Release (Recommended)

Option 2: Development Version

  • Create a clone of the repository:

    $ git clone https://github.com/biobakery/kneaddata.git

    Note: Creating a clone of the repository requires Git to be installed.

Install KneadData

Install with pip

  • $ pip install kneaddata
  • This command will automatically install Trimmomatic and Bowtie2. To bypass the install of dependencies, add the option "--install-option='--bypass-dependencies-install'".
  • If you do not have write permissions to '/usr/lib/', then add the option "--user" to the install command. This will install the python package into subdirectories of '$HOME/.local'. Please note when using the "--user" install option on some platforms, you might need to add '$HOME/.local/bin/' to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message kneaddata: command not found when trying to run KneadData after installing with the "--user" option.

Install from source

  1. Follow the instructions to download KneadData
  2. Move to the KneadData source directory: $ cd kneaddata
  3. Install KneadData
    • $ python setup.py install
    • This command will automatically install Trimmomatic and Bowtie2. To bypass the install of dependencies, add the option "--bypass-dependencies-install".
    • If you do not have write permissions to '/usr/lib/', then add the option "--user" to the install command. This will install the python package into subdirectories of '$HOME/.local'. Please note when using the "--user" install option on some platforms, you might need to add '$HOME/.local/bin/' to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message kneaddata: command not found when trying to run KneadData after installing with the "--user" option.

Download the database

It is recommended that you download the human (hg37_and_human_contamination) reference database (approx. size = 3.5 GB). However, this step is not required if you are using your own custom reference database or if you will not be running with a reference database.

This database is based on the Decoy Genome (http://www.cureffi.org/2013/02/01/the-decoy-genome/) and contaminants taken from “Human contamination in bacterial genomes has created thousands of spurious proteins” (Salzberg et. al. 2019)

  • $ kneaddata_database --download human_genome bowtie2 $DIR
  • When running this command, $DIR should be replaced with the full path to the directory you have selected to store the database.

If you are running with bmtagger instead of bowtie2, then download the bmtagger database instead of the bowtie2 database with the following command.

  • $ kneaddata_database --download human_genome bmtagger $DIR
  • When running this command, $DIR should be replaced with the full path to the directory you have selected to store the database.

The human transcriptome (hg38) reference database is also available for download (approx. size = 254 MB).

  • $ kneaddata_database --download human_transcriptome bowtie2 $DIR

The SILVA Ribosomal RNA reference database is also available for download (approx. size = 11 GB).

  • $ kneaddata_database --download ribosomal_RNA bowtie2 $DIR

The mouse (C57BL) reference database is also available for download (approx. size = 3 GB).

  • $ kneaddata_database --download mouse_C57BL bowtie2 $DIR

Create a Custom Database

A reference database can be downloaded to use when running KneadData. Alternatively, you can create your own custom reference database.

Select Reference Sequences

First you must select reference sequences for the contamination you are trying to remove. Say you wish to filter reads from a particular "host." Broadly defined, the host can be an organism, or a set of organisms, or just a set of sequences. Then, you simply must generate a reference database for KneadData from a FASTA file containing these sequences. Usually, researchers want to remove reads from the human genome, the human transcriptome, or ribosomal RNA. You can access some of these FASTA files using the resources below:

  • Ribosomal RNA: Silva provides a comprehensive database for ribosomal RNA sequences spanning all three domains of life (Bacteria, Archaea, and Eukarya).

  • Human Genome & Transcriptome: Information about the newest assembly of human genomic data can be found at the NCBI project page. USCS provides a convenient website to download this data.

Generating KneadData Databases

KneadData requires that your reference sequences (FASTA files) be indexed to form KneadData databases beforehand. This only needs to be done once per reference sequence.

For certain common databases, we provide indexed files. If you use these, you can skip the manual build steps below. Alternatively if you would like to bypass the reference alignment portion of the workflow, a database does not need to be provided when running KneadData.

To download the indexed human reference database, run the following command:

  • $ kneaddata_database --download human bowtie2 $DIR
  • When running this command, $DIR should be replaced with the full path to the directory you have selected to store the database.

Creating a Bowtie2 Database

Simply run the bowtie2-build indexer included with Bowtie2 as follows:

$ bowtie2-build <reference> <db-name>

Where <reference> is the reference FASTA file, and <db-name> is the name you wish to call your Bowtie2 database. For more details, refer to the bowtie2-build-documentation

Note: Creating SILVA ribosomal_RNA Database

Creating the SILVA ribosomal_RNA database requires one additional step. Run the following python program before bowtie2-build command which converts the "U"s to "T"s in the fasta sequences.
Script link: modify_RNA_to_DNA.py $ python -u modify_RNA_to_DNA.py input.fasta output.fa

Creating a BMTagger Database

KneadData includes kneaddata_build_database, an executable that will automatically generate these databases for BMTagger. Simply run

$ kneaddata_build_database reference.fasta

By default, this will generate the reference databases, whose names are prefixed with reference.fasta.

A note on PATH: The above command will fail if the tools in the BMTagger suite (specifically, bmtool and srprism) and the NCBI BLAST executables are not in your PATH. If this is the case, you can specify a path to these tools using the -b, -s, and -m options. Run

$ kneaddata_build_database --help

for more details.

Example Custom Database Build

Say you want to remove human reads from your metagenomic sequencing data. You downloaded the human genome in a file called Homo_sapiens.fasta.

Then, you can generate the KneadData database by executing:

$ bowtie2-build Homo_sapiens.fasta -o Homo_sapiens_db

for Bowtie2, or

$ kneaddata_build_database Homo_sapiens.fasta -o Homo_sapiens_db

All of the required KneadData database files will have file names prefixed by Homo_sapiens_db and have various file extensions.

Note: For creating SILVA ribosomal_RNA database

Run the following python program before bowtie2-build command which converts the "U"s to "T"s in the fasta sequences for creating SILVA ribosomal_RNA database.
Script link: modify_RNA_to_DNA.py $ python -u modify_RNA_to_DNA.py input.fasta output.fa

How to Run

After downloading or generating your database file, you can start to remove contaminant reads. As input, KneadData requires FASTQ files. It supports both single end and paired end reads. KneadData uses either Bowtie2 (default) or BMTagger to identify the contaminant reads.

Single End Run

To run KneadData in single end mode, run

$ kneaddata --unpaired seq.fastq --reference-db $DATABASE --output kneaddata_output

This will create files in the folder kneaddata_output named

  • seq_kneaddata_$DATABASE_bowtie2_contam.fastq: FASTQ file containing reads that were identified as contaminants from the database (named $DATABASE).
  • seq_kneaddata.fastq: This file includes reads that were not in the reference database.
  • seq_kneaddata.trimmed.fastq: This file has trimmed reads.
  • seq_kneaddata.log

To run KneadData in single end mode with BMTagger, run

$ kneaddata --unpaired seq.fastq --reference-db $DATABASE --run-bmtagger

By default, this will create the same four files as running with bowtie2. The only differences are the contaminants file will have "bmtagger" in the name instead of "bowtie2" and the included $DATABASE name would differ.

If you wanted to use BMTagger and the BMTagger executable was located at $HOME/bmtagger/bmtagger.sh which is not in your $PATH you would add the option "--bmtagger $HOME/bmtagger/bmtagger.sh" to the command.

If you wanted to select the basenames of the output files, you would add the option "--output-prefix $NAME", replacing $NAME with the name you would like used.

Paired End Run

To run KneadData in paired end mode with Bowtie2, run

$ kneaddata --input1 seq1.fastq --input2 seq2.fastq -db $DATABASE --output kneaddata_output

To run KneadData in paired end mode with BMTagger, run

$ kneaddata --input seq1.fastq --input seq2.fastq -db $DATABASE --run-bmtagger --output kneaddata_output

  • seq1.fastq: Your input FASTQ file, first mate
  • seq2.fastq: Your input FASTQ file, second mate
  • $DATABASE: Prefix for the KneadData database.
  • kneaddata_output: The folder to write the output files.

The outputs depend on what happens during the quality filtering and trimming part of the pipeline.

When performing quality filtering and trimming for paired end files, three things can happen:

  1. Both reads in the pair pass.
  2. The read in the first mate passes, and the one in the second does not pass.
  3. The read in the second mate passes, and the one in the first does not pass.

The number of outputs are a function of the read quality.

KneadData + Bowtie2 (or BMTagger) Outputs: There can be up to 8 outputs per reference database, plus up to 5 aggregate outputs.

Instead of single end reads, say you have paired end reads and you want to separate the reads that came from bacterial mRNA, bacterial rRNA, and human RNA. You have two databases, one prefixed bact_rrna_db and the other prefixed human_rna_db, and your sequence files are seq1.fastq and seq2.fastq. To run with Bowtie2, execute

$ kneaddata --input1 seq1.fastq --input2 seq2.fastq -db bact_rrna_db -db human_rna_db --output seq_out

This will output files in the folder seq_out named:

Files for just the bact_rrna_db database:

  • seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_1.fastq: Reads from the first mate in situation (1) above that were identified as belonging to the bact_rrna_db database.
  • seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_2.fastq: Reads from the second mate in situation (1) above that were identified as belonging to the bact_rrna_db database.
  • seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_1.fastq: Reads from the first mate in situation (1) above that were identified as NOT belonging to the bact_rrna_db database.
  • seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_2.fastq: Reads from the second mate in situation (1) above that were identified as NOT belonging to the bact_rrna_db database.

Depending on the input FASTQ, one or more of the following may be output:

  • seq_kneaddata_unmatched_1_bact_rrna_db_bowtie2_contam.fastq: Reads from the first mate in situation (2) above that were identified as belonging to the bact_rrna_db database.
  • seq_kneaddata_unmatched_1_bact_rrna_db_bowtie2_clean.fastq: Reads from the first mate in situation (2) above that were identified as NOT belonging to the bact_rrna_db database.
  • seq_kneaddata_unmatched_2_bact_rrna_db_bowtie2_contam.fastq: Reads from the second mate in situation (3) above that were identified as belonging to the bact_rrna_db database.
  • seq_kneaddata_unmatched_2_bact_rrna_db_bowtie2_clean.fastq: Reads from the second mate in situation (3) above that were identified as NOT belonging to the bact_rrna_db database.

Files for just the human_rna_db database:

  • seq_kneaddata_paired_human_rna_db_bowtie2_contam_1.fastq: Reads from the first mate in situation (1) above that were identified as belonging to the human_rna_db database.
  • seq_kneaddata_paired_human_rna_db_bowtie2_contam_2.fastq: Reads from the second mate in situation (1) above that were identified as belonging to the human_rna_db database.
  • seq_kneaddata_paired_human_rna_db_bowtie2_clean_1.fastq: Reads from the first mate in situation (1) above that were identified as NOT belonging to the human_rna_db database.
  • seq_kneaddata_paired_human_rna_db_bowtie2_clean_2.fastq: Reads from the second mate in situation (1) above that were identified as NOT belonging to the human_rna_db database.

Depending on the input FASTQ, one or more of the following may be output:

  • seq_kneaddata_unmatched_1_human_rna_db_bowtie2_contam.fastq: Reads from the first mate in situation (2) above that were identified as belonging to the human_rna_db database.
  • seq_kneaddata_unmatched_1_human_rna_db_bowtie2_clean.fastq: Reads from the first mate in situation (2) above that were identified as NOT belonging to the human_rna_db database.
  • seq_kneaddata_unmatched_2_human_rna_db_bowtie2_contam.fastq: Reads from the second mate in situation (2) above that were identified as belonging to the human_rna_db database.
  • seq_kneaddata_unmatched_2_human_rna_db_bowtie2_clean.fastq: Reads from the second mate in situation (2) above that were identified as NOT belonging to the human_rna_db database.

Note, the files named "*_clean.fastq" will only be written if running with the option "--store-temp-output".

Aggregated files:

  • seq_kneaddata.log: Log file containing statistics about the run.
  • seq_kneaddata_paired_1.fastq: Reads from the first mate in situation (1) identified as NOT belonging to any of the reference databases.
  • seq_kneaddata_paired_2.fastq: Reads from the second mate in situation (1) identified as NOT belonging to any of the reference databases.
  • seq_kneaddata_unmatched_1.fastq: Reads from the first mate in situation (2) identified as NOT belonging to any of the reference databases.
  • seq_kneaddata_unmatched_2.fastq: Reads from the second mate in situation (3) identified as NOT belonging to any of the reference databases.

Demo Run

The examples folder contains a demo input file. This file is a single read, fastq format.

$ kneaddata --unpaired examples/demo.fastq --reference-db examples/demo_db --output kneaddata_demo_output

This will create four output files:

  1. kneaddata_demo_output/demo_kneaddata.fastq
  2. kneaddata_demo_output/demo_kneaddata_demo_db_bowtie2_contam.fastq
  3. kneaddata_demo_output/demo_kneaddata.log
  4. kneaddata_demo_output/demo_kneaddata.trimmed.fastq

Sequencer Source for trimming Adapter Contents

Kneaddata will use "NexteraPE" adapters provided by trimomatic to trim the adapter contents by default.

The other available options are: ["NexteraPE", "TruSeq2", "TruSeq3","none"]. Based on the source of the sequencer and the FASTQC report, it is highly reccommended to choose the correct sequencer source to ensure the removal of adapter contents by Kneaddata.

Example: Trimmming adapter sequence using TruSeq3 sequencer adapters in the workflow:
kneaddata --unpaired demo.fastq -db demo_db -o kneaddata_output --sequencer-source TruSeq3 --fastqc FastQC
Example: Skipping adapter trimming in the workflow:
kneaddata --unpaired demo.fastq -db demo_db -o kneaddata_output --sequencer-source none --fastqc FastQC

Trim Overrepresented/Repetitive sequences

It is highly recommeded to use --run-trim-repetitive flag for Shotgun sequences (Metatranscriptomics-MTX, Metagenomics-MGX) to trim the overrepresented sequences if shown in FASTQC reports.

However, Kneaddata will not trim the overrepresented sequences by default as Amplicon sequences usually have a large number of repetitive reads resulting in depletion of the read count.

Example: Trimming overrepresented sequences using the Fastqc reports:
kneaddata --unpaired demo.fastq -db demo_db -o kneaddata_output --run-trim-repetitive --fastqc FastQC
Example: Trimming overrepresented sequences and TruSeq3 adapters:
kneaddata --unpaired demo.fastq -db demo_db -o kneaddata_output --run-trim-repetitive --sequencer-source TruSeq3 --fastqc FastQC

Additional Arguments

If you want to specify additional arguments for Bowtie2 using the --bowtie2-options flag, you will need to use the equals sign along with quotes. Add additional flags for each option.

For example:

$ kneaddata --unpaired demo.fastq --output kneaddata_output --reference-db database_folder --bowtie2-options="--very-fast" --bowtie2-options="-p 2"

A similar approach is used to specify additional arguments for Trimmomatic:

$ kneaddata --unpaired demo.fastq --output kneaddata_output --reference-db database_folder --trimmomatic-options="LEADING:3" --trimmomatic-options="TRAILING:3"

NOTE: Manually specifying additional arguments will completely override the defaults.

Also more than one database can be provided for each run. The database argument can contain the folder that includes the database or the prefix of the database files.

For example:

$ kneaddata --unpaired demo.fastq --output kneaddata_output --reference-db database_folder --reference-db database_folder2/demo

Contributions

Thanks go to these wonderful people:

Complete Option List

All options can be accessed with $ kneaddata --help.

usage: kneaddata [-h] [--version] [-v] [-i1 INPUT1] [-i2 INPUT2]
                 [-un UNPAIRED]  -o OUTPUT_DIR
                 [-db REFERENCE_DB] [--bypass-trim] [--run-trim-repetitive]
                 [--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>]
                 [-q {phred33,phred64}] [--run-bmtagger]
                 [--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]
                 [--cat-final-output]
                 [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG]
                 [--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY]
                 [--trimmomatic-options TRIMMOMATIC_OPTIONS]
                 [--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS]
                 [--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH]
                 [--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI]
                 [--minscore MINSCORE] [--maxperiod MAXPERIOD]
                 [--fastqc FASTQC_PATH]

KneadData

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         additional output is printed

global options:
  --version             show program's version number and exit
  -i INPUT, --input INPUT
                        input FASTQ file (add a second argument instance to run with paired input files)
  -o OUTPUT_DIR, --output OUTPUT_DIR
                        directory to write output files
  -db REFERENCE_DB, --reference-db REFERENCE_DB
                        location of reference database (additional arguments add databases)
  --run-trim-repetitive Option to trim repetitive/overrepresented sequences generated by FASTQC reports 
  --bypass-trim         bypass the trim step
  --output-prefix OUTPUT_PREFIX
                        prefix for all output files
                        [ DEFAULT : $SAMPLE_kneaddata ]
  -t <1>, --threads <1>
                        number of threads
                        [ Default : 1 ]
  -p <1>, --processes <1>
                        number of processes
                        [ Default : 1 ]
  -q {phred33,phred64}, --quality-scores {phred33,phred64}
                        quality scores
                        [ DEFAULT : phred33 ]
  --run-bmtagger        run BMTagger instead of Bowtie2 to identify contaminant reads
  --bypass-trf          option to bypass the removal of tandem repeats
  --run-fastqc-start    run fastqc at the beginning of the workflow
  --run-fastqc-end      run fastqc at the end of the workflow
  --store-temp-output   store temp output files
                        [ DEFAULT : temp output files are removed ]
  --cat-final-output    concatenate all final output files
                        [ DEFAULT : final output is not concatenated ]
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        level of log messages
                        [ DEFAULT : DEBUG ]
  --log LOG             log file
                        [ DEFAULT : $OUTPUT_DIR/$SAMPLE_kneaddata.log ]

trimmomatic arguments:
  --trimmomatic TRIMMOMATIC_PATH
                        path to trimmomatic
                        [ DEFAULT : $PATH ]
  --max-memory MAX_MEMORY
                        max amount of memory
                        [ DEFAULT : 500m ]
  --trimmomatic-options TRIMMOMATIC_OPTIONS
                        options for trimmomatic
                        [ DEFAULT : SLIDINGWINDOW:4:20 MINLEN:50 ]
                        MINLEN is set to 50 percent of total input read length. The user can alternatively specify a length (in bases) for MINLEN.
  --sequencer-source    options for sequencer-source
                        [ DEFAULT: NexteraPE]
                        Available sequencers: ["NexteraPE","TruSeq2","TruSeq3"]

bowtie2 arguments:
  --bowtie2 BOWTIE2_PATH
                        path to bowtie2
                        [ DEFAULT : $PATH ]
  --bowtie2-options BOWTIE2_OPTIONS
                        options for bowtie2
                        [ DEFAULT : --very-sensitive ]

bmtagger arguments:
  --bmtagger BMTAGGER_PATH
                        path to BMTagger
                        [ DEFAULT : $PATH ]

trf arguments:
  --bypass-trf          bypass the TRF step
  --trf TRF_PATH        path to TRF
                        [ DEFAULT : $PATH ]
  --match MATCH         matching weight
                        [ DEFAULT : 2 ]
  --mismatch MISMATCH   mismatching penalty
                        [ DEFAULT : 7 ]
  --delta DELTA         indel penalty
                        [ DEFAULT : 7 ]
  --pm PM               match probability
                        [ DEFAULT : 80 ]
  --pi PI               indel probability
                        [ DEFAULT : 10 ]
  --minscore MINSCORE   minimum alignment score to report
                        [ DEFAULT : 50 ]
  --maxperiod MAXPERIOD
                        maximum period size to report
                        [ DEFAULT : 500 ]

fastqc arguments:
  --fastqc FASTQC_PATH  path to fastqc
                        [ DEFAULT : $PATH ]

kneaddata's People

Contributors

carze avatar chuttenh avatar jjensen44 avatar ljmciver avatar sagun98 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kneaddata's Issues

version of human genome

Hi,
I read "It is recommended that you download the human (hg37_and_human_contamination) reference database (approx. size = 3.5 GB)". May I ask the version of human genome? Does hg37 mean "GRCh37", which is also known as "hg19"?
Many thanks
Wang

kneaddata outputs disordered reads pairs

Hi,

When I used the kneaddata to remove host reads, I found the output reads pairs are disodered. And BWA aligner gave errors. Please see this issue https://github.com/lh3/bwa/issues/228.

Bowtie 2 doesn't require the condition of mates in the same order, but while aligning disordered FASTQs with Bowtie 2 (as of my experience) you get less mapped reads.

I think it is better to reorder the reads pairs inside the software.

Feature request: add option to compress output fastq files

I cannot find the option in kneaddata to compress the output fastq files. When running kneaddata in parallel, the output uncompressed fastq files can take up >1T disk space in a day.
Could you add an option to kneaddata to compress the output fastq files?

All paired-end MiSeq reads end up unmatched

Running KneadData against MiSeq paired-end reads results in all reads ending up in *_unmatched_1.fq and *_unmatched_2.fq.

This appears to be a regression introduced in commit ecf589d. In particular, the deletion of these lines:

    if " 1:" in lines[0]:
        lines[0]=lines[0].replace(" 1","").rstrip()+"#0/1\n"
    elif " 2:" in lines[0]:
        lines[0]=lines[0].replace(" 2","").rstrip()+"#0/2\n"

The replace(" 1","") and replace(" 2","") used to strip the Illumina read index, so that reformatted read headers became:

@M01336:233:000000000-AJKD1:1:1101:15356:1498:N:0:15#0/1
@M01336:233:000000000-AJKD1:1:1101:15356:1498:N:0:15#0/2

Since commit ecf589d the headers are rewritten to:

@M01336:233:000000000-AJKD1:1:1101:15356:1498.1:N:0:15#0/1
@M01336:233:000000000-AJKD1:1:1101:15356:1498.2:N:0:15#0/2

and therefore don't match up.

Description in FASTA title prevents correct reformatting

I have a file with reads formatted like:

@SRR3132037.1 1 length=142
AAGGCGGCTGTTGTTATGTTATCGGCTCCATGCGAAACAGCAAAATTATTACAGCCCAGCGCATGTTACTGGTTCGCCTGTACCAGTTAGCAGAAAGGCGCGACCGTGTAACAGACGTACCGCGCCCCATCGCTATTTATCT
+SRR3132037.1 1 length=142
/EEEEEEAEE/EEAEAEEEEEEEEEEE/EEEEEEEE6EEEEEEEEAEEEEEAEEAAA/AEEEEEAEEEEEEE/EE/EEAAE/EEEEEEEEAEE//EEE<EEEA/<E</<E/<EAEEAAA<EEEEEEA6EEEEA<E/EEEE<E

It's not a bad file according to https://en.wikipedia.org/wiki/FASTQ_format, and it's a default format from fastq-dump.

It's reformatted as such:

head -n4 reordered_c1gxnbm__reformatted_identifierslaj96u_u_reads reordered_6p35sxlp_reformatted_identifiersz4yibe76_reads_R
==> reordered_c1gxnbm__reformatted_identifierslaj96u_u_reads <==
@SRR3132037.1.11length=142
AAGGCGGCTGTTGTTATGTTATCGGCTCCATGCGAAACAGCAAAATTATTACAGCCCAGCGCATGTTACTGGTTCGCCTGTACCAGTTAGCAGAAAGGCGCGACCGTGTAACAGACGTACCGCGCCCCATCGCTATTTATCT
+SRR3132037.1.1 1 length=142
/EEEEEEAEE/EEAEAEEEEEEEEEEE/EEEEEEEE6EEEEEEEEAEEEEEAEEAAA/AEEEEEAEEEEEEE/EE/EEAAE/EEEEEEEEAEE//EEE<EEEA/<E</<E/<EAEEAAA<EEEEEEA6EEEEA<E/EEEE<E

==> reordered_6p35sxlp_reformatted_identifiersz4yibe76_reads_R <==
@SRR3132037.1.21length=142
AACCAGTTTGAATTTGGCATTTTCAACGACCGCACCGACAACGGCATCGCCTTTCAAATCTGCCGGGCAGTCTTTTAAATCGGCAATGGACTGGTGATCCATAATAATATGTACGCCTTTATCCCGCGCCGCACCTAATCCC
+SRR3132037.1.2 1 length=142
/A/EEEEAEEEEEEEEEEEEEEAEEEEAAEEAEE<6E<EEEAEEEEEEAEAEEEAEAEEEEEAE/AEAEE<EE/EEEAEE/6EEA<EEEE//EEAAEA/E//EEA/A<EA/E/EEEAAE/AE</<6<<6AAE/EAAE<<A66

I've found this issue when running a fork under #16, which assumes the mates of reads are labelled. I think it's still an issue otherwise but not very severe - the kneaddata_bowtie2_discordant_pairs would, I guess, not match SRR3132037.1.11length=142 and SRR3132037.1.21length=142, as a consequence put them in a wrong file and the --strict option there wouldn't work correctly.

I have a workaround for the issue on my end, but if you like I am happy to fix it in the codebase. I see that utilities.get_reformatted_identifiers does some reformatting. I would change it to remove the description, and also format the quality line to just "+".

Running into a python error when running Kneaddata

Hi

I am trying to run kneaddata with the following command:
kneaddata --bypass-trim --input reads_1.fastq.gz --input reads_2.fastq.gz -db ref/test-db --output kneaddata_output --bowtie2-options="--very-sensitive" --bowtie2-options="-p 4"

I am running into the following error:

  Traceback (most recent call last):
    File "/venv/bin/kneaddata", line 11, in <module>
      load_entry_point('kneaddata==0.10.0', 'console_scripts', 'kneaddata')()
    File "/venv/lib/python3.8/site-packages/kneaddata/knead_data.py", line 453, in main
      args.input[index]=utilities.get_reformatted_identifiers(args.input[index],args.output_dir, temp_output_files, args.input)
    File "/venv/lib/python3.8/site-packages/kneaddata/utilities.py", line 362, in get_reformatted_identifiers
      reformat_file=check_sequence_identifier_format(file)
    File "/venv/lib/python3.8/site-packages/kneaddata/utilities.py", line 318, in check_sequence_identifier_format
      first_seq_identifiers_list=get_first_n_seq_identifiers(file,num_seq_to_check)
    File "/venv/lib/python3.8/site-packages/kneaddata/utilities.py", line 351, in get_first_n_seq_identifiers
      lines=next(all_lines)
  StopIteration

I am using python3.8 , and I am running kneaddata via Docker .

Please let me know if you need any other details from me to resolve this.

Thanks
R

kneaddata: output file(paired file) is empty

Hello, I run

"for i in $(ls /home/yxtu/Data/CRC_AOM/*_1.fq.gz); do j=basename $i;
echo "kneaddata -i ${i} -i ${i%%_1.fq.gz}_1.fq.gz -o /home/yxtu/Data/CRC_AOM/kneaddata/output/ -v
--reference-db /home/yxtu/Data/CRC_Apc/KneaddataDB/mouse_C57BL_bowtie2/
--trimmomatic /home/yxtu/software/Trimmomatic-0.38/
--trimmomatic-options 'SLIDINGWINDOW:4:20 MINLEN:50'
-t 9 --bowtie2-options '--very-sensitive --dovetail' --remove-intermediate-output" > /home/yxtu/Data/CRC_AOM/kneaddata/output/${j%%_1.fq.gz}.pbs;done"

but I found my paired file is empty:
image.
How can I solve it?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 28: invalid start byte

Hello,

I created a conda environment where I successfully installed kneaddata 0.12.0, bowtie2 2.5.3, trf 4.09.1 and python 3.10.14.
However, I ran into this error when executing the command "kneaddata -un seq_fastq.zip --reference-db ./databases -t 15 --output ./output_dir":

Traceback (most recent call last):
File "/miniconda3/envs/meta_env/bin/kneaddata", line 8, in
sys.exit(main())
File "/miniconda3/envs/meta_env/lib/python3.10/site-packages/kneaddata/knead_data.py", line 480, in main
file_format=utilities.get_file_format(args.input[0])
File "/miniconda3/envs/meta_env/lib/python3.10/site-packages/kneaddata/utilities.py", line 552, in get_file_format
first_line = file_handle.readline()
File "/miniconda3/envs/meta_env/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 28: invalid start byte

I tried different solutions proposed on the web but they didn't work. Do you have any suggestions?

Thank you very much,
Paola Maragno

Options for paired input files

I guess the options (--input1 --input2) should be replaced by --input.

https://huttenhower.sph.harvard.edu/kneaddata/

Paired-End Inputs:
$ kneaddata --input1 $INPUT1 --input2 $INPUT2 --reference-db $DATABASE --output $OUTPUT_DIR

https://github.com/biobakery/kneaddata

$ kneaddata --input1 seq1.fastq --input2 seq2.fastq -db $DATABASE --output kneaddata_output

$ kneaddata --input1 seq1.fastq --input2 seq2.fastq -db bact_rrna_db -db human_rna_db --output seq_out

usage: kneaddata [-h] [--version] [-v] [-i1 INPUT1] [-i2 INPUT2]
$kneaddata -h

usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR

  -i INPUT, --input INPUT
                        input FASTQ file (add a second argument instance to run with paired input files)

Trimmomatic does not perform the quality control step

Hi,

I am using kneaddata to remove human contamination from microbial samples. I use the exact argumets for trimmomatic as I run trimmomatic directly on my dataset (-q phred33, --sequencer-source TruSeq3, --trimmomatic-options SLIDINGWINDOW:4:15). But I get different results with kneaddata, meaning that I still have bad quality reads in the output. Whereas if I run trimmomatic directly on the samples, it gets rid of those bad quality reads. Any suggestions?

error in kneaddata

Hi all
I am new to kneaddata
I installed kneaddata using the
https://huttenhower.sph.harvard.edu/kneaddata/
but when I ran the first code, i got an error
kneaddata --input1 /media/sf_Shared_folder/ovas/Final/81_R1.fastq --input2 /media/sf_Shared_folder/ovas/Final/81_R2.fastq --reference-db /media/sf_Shared_folder/Homo_sapiens/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/ --output /home/qiime2/Desktop/ovas/
ERROR: Unable to find trimmomatic. Please provide the full path to trimmomatic with --trimmomatic.
then i checked with kneaddata, I have trimmomatic installed
kindly guide
thanks in advance

kneaddata's version can not match

I have the kneaddata installed by running "pip install kneaddata==0.9.0", and it says "Successfully installed kneaddata-0.9.0". However, when I run "kneaddata --version", get the following message "kneaddata v0.8.0", which doesn't match the version I specified in the installation command.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.