qiime2 / q2-quality-control Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 17.0 1.04 MB

A QIIME 2 plugin for quality control of sequence and feature data

License: BSD 3-Clause "New" or "Revised" License

Python 95.83% Makefile 0.13% HTML 3.21% TeX 0.82%

hacktoberfest

q2-quality-control's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

q2-quality-control's People

Contributors

Stargazers

Watchers

Forkers

gregcaporaso nbokulich jakereps jairideout ebolyen epruesse chriskeefe turanoo gwarmstrong andrewsanchez oddant1 misialq lizgehret alienzj jordenrabasco

q2-quality-control's Issues

plugin documentation suggestions:

threads parameter should be a range bound on the lower end. Also, note that this is only relevant for vsearch?

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L38

The following:
https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L50

would be more clear as: BLAST expectation (E) value threshold

This description should mention the E-value threshold too, and how that interacts with percent identity (i.e., do both need to be above their respective thresholds)? Sort of related, but should it be possible to disable the percent identity filter, and only filter based on E-value?

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L64

exclude-seqs vsearch fails to parse with imported fasta

Bug Description
Not sure what the issue is yet, but see forum xref for steps to reproduce and error report.

I can reproduce this error locally with the data.

vsearch is probably outputting blank lines or for some other reason the lines are not being parsed correctly:

File “/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_blast.py”, line 85, in _extract_hits
query_id, subject_id, query_len, start, end = line.split(’\t’)
ValueError: not enough values to unpack (expected 5, got 1)

`evaluate_composition`: multiple score plots in viz?

Improvement Description
from @gregcaporaso :

I think the plot-observed-features* data should be in another plot - when this is added to the first plot the scale of the y-axis doesn't really make sense anymore, so if a user wants to view this info and the other info in the plot, they're probably going to need to run this command twice, once with and once without this parameter. This would be nice to fix in this PR, but isn't essential.

test additions

@nbokulich, the additional tests that you need here include (you might think of others as you go):

vary input and output (or parameters) so all queries hit
vary input and output (or parameters) so all queries miss
vary input and output (or parameters) so some queries hit and some miss (this may be what you already have)
increasing E-value threshold causes some queries to hit that previously missed
decreasing E-value threshold causes some queries to miss that previously hit
increasing percent id threshold causes some queries to hit that previously missed
decreasing percent id threshold causes some queries to miss that previously hit
searching query sequences against query sequences results in all queries hitting
searching query sequences against non-homologous subject sequences results in all queries missing
in no cases does the same sequence show up as a hit and a miss
in all cases, the number of hits plus the number of misses is equal to the number of queries

All of these should be performed for blast and vsearch, as applicable.

remove commented line(s)

Found this example:

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/quality_control.py#L10

filter-reads adds too many file extensions

Bug Description
filter-reads writes out demux fastq filenames with an extra fastq.gz on the end: sample_b_S02_L001_R2_001.fastq.gz.fastq.gz

Steps to reproduce the behavior

https://busywork.qiime2.org/teams/main/pipelines/master-distribution/jobs/unit-test-q2-quality-control/builds/97

Comments

This was caught by @Oddant1's qiime2/qiime2#538, which tightened up validation of filenames in formats

`evaluate_seqs`: add `max-accepts` parameter

Improvement Description
add max-accepts option, defaulting to 1, which shows the top max-accepts pairwise alignment results

`evaluate_seqs`: add optional input of `FeatureData[Taxonomy]` for labeling features

Improvement Description
add optional input for FeatureData[Taxonomy] corresponding to the reference sequences, so the user can see what the taxonomy of the best matches is.

Comments
this could be useful with larger values of max-accepts, as it lets the user see to what extend the best matching alignments share their taxonomic annotation (i.e., they can evaluate the "consensus taxonomy" on their own, which is something we want to encourage for features of interest)

`evaluate_composition`: add bray-curtis dissimilarity

remove private import(s)

Found this example: https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/quality_control.py#L12

This method could disappear or change its API at any time since it's private, so it's not safe to import this. If there is not a public API for the functionality that you need, another option would be to copy that function (adding a note about where it came from) to this repo, and post an issue on the source repo to see if the API could be made public in the future.

remove `vsearch` option

It looks like vsearch isn't supported anymore, so this parameter should be removed:

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L35

Add Citations

Should use the new citation API in qiime2/qiime2#387

Exact sequence start match filter for exlude-seqs

In order to implement a pipeline that can perform the steps of filtering sequences and a table for bloom sequences (See qiime2/q2-feature-table#218), I need an "alignment" method that matches query sequences to reference sequences only if the entire query sequence (length n) matches the first n bases in the reference sequence.

I do not believe this is currently possible with exclude-seqs, or at least, I am not aware of how to manipulate the BLAST or VSEARCH settings to make this possible.

I have two potential ideas for how to solve this problem:

Implement my search as an alignment method available to _blast._search_seqs, and consequently to exclude-seqs
Implement a unilateral exclude-seqs-like method for the filtering pipeline that performs the type of matching I need.

Can you let me know which is preferred?

optimize creation of result objects in exclude_seqs

This section of exclude_seqs:

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/quality_control.py#L33

should probably be optimized as it's going to require iterating multiple times over query_series, and it's also going to read all of res into memory. Could you rewrite this to read all of the hit identifiers from res into a set using Python's csv.reader, and then iterate over feature_sequences (maybe seek to the beginning of the file and use the scikit-bio fasta parser so you don't need the private import I mentioned in #2), storing each sequence in lists of hit_seqs or misses_seqs as you go, and then create the pd.Series objects on return.

For example, something like (this isn't tested):

hits_seqs = {}
misses_seqs={}
hit_ids = {e[0] for e in csv.reader(res)}
feature_sequences.seek(0)
for seq in skbio.io.read(feature_sequences):
    seq_id = seq.metadata['id']
    seq = str(seq)
    if seq_id in hit_ids:
        hits_seqs[seq_id] = seq
    else:
        misses_seqs[seq_id] = seq

return pd.Series(hits_seqs), pd.Series(misses_seqs)

`evaluate-seqs`: replace histograms with scatterplot of similarity X coverage X abundance

Improvement Description
Replace (or supplement) histograms of mismatches X count with similarity X coverage X abundance

Current Behavior
Histograms of mismatches X count are less effective if reads have different lengths (and hence mismatches indicate different % similarity)

Proposed Behavior
Replace or supplement with a better plot (see forum link below)

References
forum xref

`filter_reads` returns no sequences on single-end reads input

Bug Description
When using filter_reads to remove human sequences from the single-end reads, the artifact containing filtered sequences is empty. The same does not happen when filtering paired-end reads (i.e., everything works as expected). When examining bowties2's output I can see that the majority of my reads were not human and so they should be included in the output.

Steps to reproduce the behavior

Fetch the sequences from here and the reference from here.
Run qiime quality-control filter-reads using sample-reads-single.qza as input and human_ref_grch38.qza as database.
Summarize and visualize the resulting artifact using qiime demux summarize.
Look at the charts in the visualization.
Repeat steps 2-4 with the paired sequences (provided artifact contains the very same sequences as the single-end one, just duplicated to create a fake paired-end artifact).

Expected behavior
Samples should contain reads in both cases.

Actual behavior
No reads were found when single-reads were used as input. Zero.
When paired-end reads were filtered, the majority of those was retained (expected).

Computation Environment

OS: macOS Monterey
QIIME 2 Release: 2021.11

Comments
After a closer look I can see that the problem arises when samtools fastq is invoked. The two additional outputs there are redirected to /dev/null but that's exactly where the output goes (and hence is lost). I reproduced this behavior with a couple of different single-end inputs - every time with the same result. It would seem that the SAM flags are not set correctly (by bowtie2?) and so the fastq command sends those to a different file (some more discussion on that here).

Questions
Should all these outputs maybe just be stored as artifacts instead?

BUG: TAR/TDR values are getting switched

forum xref

`evaluate_composition`: fix _validate_metadata_values_are_subset error message

table_ids.difference(metadata_ids) should read metadata_ids.difference(table_ids)

Otherwise, if _validate_metadata_values_are_subset raises an error, it reports an empty set! (instead of the metadata values that are not represented in the expected composition table)

Better yet, this function should be improved to check only metadata rows that intersect with the observed composition table. Otherwise, metadata cannot be a superset of the observed composition table, which is a nuisance.

Full error:

Traceback (most recent call last):
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2cli/commands.py", line 224, in __call__
    results = action(**arguments)
  File "<decorator-gen-165>", line 2, in evaluate_composition
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 228, in bound_callable
    output_types, provenance)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 424, in _callable_executor_
    ret_val = self._callable(output_dir=temp_dir, **view_args)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/quality_control.py", line 69, in evaluate_composition
    plot_observed_features_ratio=plot_observed_features_ratio)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 83, in _evaluate_composition
    _validate_metadata_values_are_subset(metadata, exp)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 45, in _validate_metadata_values_are_subset
    table_ids.difference(metadata_ids))
ValueError: Missing samples in table: set()

convert `evaluate_*` visualizers to pipelines

IMP: expose 'strand' argument to `exclude-seqs`

Improvement Description
For the plugin, incorporate a strand argument to be fed to blast or vsearch that can limit whether the reverse complement is aligned against: 'strand': Str % Choices(['both', 'plus'])
This is invoked in the following lines

https://github.com/qiime2/q2-feature-classifier/blob/35a518104a2f71fe2c72e0897345ccfe776dfadc/q2_feature_classifier/_vsearch.py#L131-L137

And I think makes reasonable sense to include as a functionality in this plugin.

Current Behavior
Only both strand search is enabled for vsearch and blast

Proposed Behavior
Provide an additional parameter that allows searching the plus strand with these commands.

Questions

Do you think this feature is a reasonable fit for this plugin?

ENH: add MCC to evaluation metrics

Improvement Description
MCC is useful for scoring imbalanced classification problems and should be added to the panel of eval metrics.

Current Behavior
No MCC

Proposed Behavior
add sklearn implementation of MCC metric

_blast.py suggestions

I strongly recommend making all of the parameters required (i.e., don't define defaults) in the private functions in quality_control.py (and elsewhere, if applicable). Otherwise, it's really easy to accidentally forget to pass one of these when you call one of these functions from elsewhere in the code base, which in turn would make it not possible to override defaults. Defining defaults on private functions is something I now try to avoid.

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/_blast.py#L15

For clarity, rename _blast_seqs.py to something else (e.g., _search_seqs.py) - I thought this only supported BLAST.

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/_blast.py#L14

Could _extract_hits just return a set of ids that hit the reference? I don't think you use the other information at this time.

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/_blast.py#L59

evaluate-composition: R2 values not computing correctly

Bug Description
Linear regression R2 values appear to be miscalculating at level 1. Other levels appear correct (based on the data provided, see xref below).

scipy’s linear regression function (which is being used to calculate R here) fails if there is only one measurement each in the expected and observed

> linregress([1], [1])
LinregressResult(slope=nan, intercept=nan, rvalue=0.0, pvalue=nan, stderr=nan)
> linregress([1, 2], [1, 2])
LinregressResult(slope=1.0, intercept=0.0, rvalue=1.0, pvalue=0.0, stderr=0.0)

rvalue in the first example should be nan, not 0.0

Screenshots
Level 1 should be R2=1.0 in this plot (since all observed and expected taxa are bacteria).