Giter Club home page Giter Club logo

q2-quality-control's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

q2-quality-control's People

Contributors

andrewsanchez avatar chriskeefe avatar david-rod avatar ebolyen avatar epruesse avatar gregcaporaso avatar gwarmstrong avatar jairideout avatar lizgehret avatar misialq avatar nbokulich avatar q2d2 avatar thermokarst avatar turanoo avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q2-quality-control's Issues

plugin documentation suggestions:

threads parameter should be a range bound on the lower end. Also, note that this is only relevant for vsearch?

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L38

The following:
https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L50

would be more clear as: BLAST expectation (E) value threshold

This description should mention the E-value threshold too, and how that interacts with percent identity (i.e., do both need to be above their respective thresholds)? Sort of related, but should it be possible to disable the percent identity filter, and only filter based on E-value?

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L64

exclude-seqs vsearch fails to parse with imported fasta

Bug Description
Not sure what the issue is yet, but see forum xref for steps to reproduce and error report.

I can reproduce this error locally with the data.

vsearch is probably outputting blank lines or for some other reason the lines are not being parsed correctly:

File “/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_blast.py”, line 85, in _extract_hits
query_id, subject_id, query_len, start, end = line.split(’\t’)
ValueError: not enough values to unpack (expected 5, got 1)

`evaluate_composition`: multiple score plots in viz?

Improvement Description
from @gregcaporaso :

I think the plot-observed-features* data should be in another plot - when this is added to the first plot the scale of the y-axis doesn't really make sense anymore, so if a user wants to view this info and the other info in the plot, they're probably going to need to run this command twice, once with and once without this parameter. This would be nice to fix in this PR, but isn't essential.

test additions

@nbokulich, the additional tests that you need here include (you might think of others as you go):

  • vary input and output (or parameters) so all queries hit
  • vary input and output (or parameters) so all queries miss
  • vary input and output (or parameters) so some queries hit and some miss (this may be what you already have)
  • increasing E-value threshold causes some queries to hit that previously missed
  • decreasing E-value threshold causes some queries to miss that previously hit
  • increasing percent id threshold causes some queries to hit that previously missed
  • decreasing percent id threshold causes some queries to miss that previously hit
  • searching query sequences against query sequences results in all queries hitting
  • searching query sequences against non-homologous subject sequences results in all queries missing
  • in no cases does the same sequence show up as a hit and a miss
  • in all cases, the number of hits plus the number of misses is equal to the number of queries

All of these should be performed for blast and vsearch, as applicable.

`evaluate_seqs`: add optional input of `FeatureData[Taxonomy]` for labeling features

Improvement Description
add optional input for FeatureData[Taxonomy] corresponding to the reference sequences, so the user can see what the taxonomy of the best matches is.

Comments
this could be useful with larger values of max-accepts, as it lets the user see to what extend the best matching alignments share their taxonomic annotation (i.e., they can evaluate the "consensus taxonomy" on their own, which is something we want to encourage for features of interest)

remove private import(s)

Found this example: https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/quality_control.py#L12

This method could disappear or change its API at any time since it's private, so it's not safe to import this. If there is not a public API for the functionality that you need, another option would be to copy that function (adding a note about where it came from) to this repo, and post an issue on the source repo to see if the API could be made public in the future.

Exact sequence start match filter for exlude-seqs

In order to implement a pipeline that can perform the steps of filtering sequences and a table for bloom sequences (See qiime2/q2-feature-table#218), I need an "alignment" method that matches query sequences to reference sequences only if the entire query sequence (length n) matches the first n bases in the reference sequence.

I do not believe this is currently possible with exclude-seqs, or at least, I am not aware of how to manipulate the BLAST or VSEARCH settings to make this possible.

I have two potential ideas for how to solve this problem:

  1. Implement my search as an alignment method available to _blast._search_seqs, and consequently to exclude-seqs
  2. Implement a unilateral exclude-seqs-like method for the filtering pipeline that performs the type of matching I need.

Can you let me know which is preferred?

optimize creation of result objects in exclude_seqs

This section of exclude_seqs:

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/quality_control.py#L33

should probably be optimized as it's going to require iterating multiple times over query_series, and it's also going to read all of res into memory. Could you rewrite this to read all of the hit identifiers from res into a set using Python's csv.reader, and then iterate over feature_sequences (maybe seek to the beginning of the file and use the scikit-bio fasta parser so you don't need the private import I mentioned in #2), storing each sequence in lists of hit_seqs or misses_seqs as you go, and then create the pd.Series objects on return.

For example, something like (this isn't tested):

hits_seqs = {}
misses_seqs={}
hit_ids = {e[0] for e in csv.reader(res)}
feature_sequences.seek(0)
for seq in skbio.io.read(feature_sequences):
    seq_id = seq.metadata['id']
    seq = str(seq)
    if seq_id in hit_ids:
        hits_seqs[seq_id] = seq
    else:
        misses_seqs[seq_id] = seq

return pd.Series(hits_seqs), pd.Series(misses_seqs)

`filter_reads` returns no sequences on single-end reads input

Bug Description
When using filter_reads to remove human sequences from the single-end reads, the artifact containing filtered sequences is empty. The same does not happen when filtering paired-end reads (i.e., everything works as expected). When examining bowties2's output I can see that the majority of my reads were not human and so they should be included in the output.

Steps to reproduce the behavior

  1. Fetch the sequences from here and the reference from here.
  2. Run qiime quality-control filter-reads using sample-reads-single.qza as input and human_ref_grch38.qza as database.
  3. Summarize and visualize the resulting artifact using qiime demux summarize.
  4. Look at the charts in the visualization.
  5. Repeat steps 2-4 with the paired sequences (provided artifact contains the very same sequences as the single-end one, just duplicated to create a fake paired-end artifact).

Expected behavior
Samples should contain reads in both cases.

Actual behavior
No reads were found when single-reads were used as input. Zero.
When paired-end reads were filtered, the majority of those was retained (expected).

Computation Environment

  • OS: macOS Monterey
  • QIIME 2 Release: 2021.11

Comments
After a closer look I can see that the problem arises when samtools fastq is invoked. The two additional outputs there are redirected to /dev/null but that's exactly where the output goes (and hence is lost). I reproduced this behavior with a couple of different single-end inputs - every time with the same result. It would seem that the SAM flags are not set correctly (by bowtie2?) and so the fastq command sends those to a different file (some more discussion on that here).

Questions
Should all these outputs maybe just be stored as artifacts instead?

`evaluate_composition`: fix _validate_metadata_values_are_subset error message

table_ids.difference(metadata_ids) should read metadata_ids.difference(table_ids)

Otherwise, if _validate_metadata_values_are_subset raises an error, it reports an empty set! (instead of the metadata values that are not represented in the expected composition table)

Better yet, this function should be improved to check only metadata rows that intersect with the observed composition table. Otherwise, metadata cannot be a superset of the observed composition table, which is a nuisance.

Full error:

Traceback (most recent call last):
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2cli/commands.py", line 224, in __call__
    results = action(**arguments)
  File "<decorator-gen-165>", line 2, in evaluate_composition
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 228, in bound_callable
    output_types, provenance)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 424, in _callable_executor_
    ret_val = self._callable(output_dir=temp_dir, **view_args)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/quality_control.py", line 69, in evaluate_composition
    plot_observed_features_ratio=plot_observed_features_ratio)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 83, in _evaluate_composition
    _validate_metadata_values_are_subset(metadata, exp)
  File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 45, in _validate_metadata_values_are_subset
    table_ids.difference(metadata_ids))
ValueError: Missing samples in table: set()

IMP: expose 'strand' argument to `exclude-seqs`

Improvement Description
For the plugin, incorporate a strand argument to be fed to blast or vsearch that can limit whether the reverse complement is aligned against: 'strand': Str % Choices(['both', 'plus'])
This is invoked in the following lines

https://github.com/qiime2/q2-feature-classifier/blob/35a518104a2f71fe2c72e0897345ccfe776dfadc/q2_feature_classifier/_vsearch.py#L131-L137

And I think makes reasonable sense to include as a functionality in this plugin.

Current Behavior
Only both strand search is enabled for vsearch and blast

Proposed Behavior
Provide an additional parameter that allows searching the plus strand with these commands.

Questions

  1. Do you think this feature is a reasonable fit for this plugin?

_blast.py suggestions

I strongly recommend making all of the parameters required (i.e., don't define defaults) in the private functions in quality_control.py (and elsewhere, if applicable). Otherwise, it's really easy to accidentally forget to pass one of these when you call one of these functions from elsewhere in the code base, which in turn would make it not possible to override defaults. Defining defaults on private functions is something I now try to avoid.

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/_blast.py#L15

For clarity, rename _blast_seqs.py to something else (e.g., _search_seqs.py) - I thought this only supported BLAST.

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/_blast.py#L14

Could _extract_hits just return a set of ids that hit the reference? I don't think you use the other information at this time.

https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/_blast.py#L59

evaluate-composition: R2 values not computing correctly

Bug Description
Linear regression R2 values appear to be miscalculating at level 1. Other levels appear correct (based on the data provided, see xref below).

scipy’s linear regression function (which is being used to calculate R here) fails if there is only one measurement each in the expected and observed

> linregress([1], [1])
LinregressResult(slope=nan, intercept=nan, rvalue=0.0, pvalue=nan, stderr=nan)
> linregress([1, 2], [1, 2])
LinregressResult(slope=1.0, intercept=0.0, rvalue=1.0, pvalue=0.0, stderr=0.0)

rvalue in the first example should be nan, not 0.0

Screenshots
Level 1 should be R2=1.0 in this plot (since all observed and expected taxa are bacteria).
image

References
forum xref

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.