qiime2 / q2-quality-control Goto Github PK
View Code? Open in Web Editor NEWA QIIME 2 plugin for quality control of sequence and feature data
License: BSD 3-Clause "New" or "Revised" License
A QIIME 2 plugin for quality control of sequence and feature data
License: BSD 3-Clause "New" or "Revised" License
Improvement Description
add optional input for FeatureData[Taxonomy] corresponding to the reference sequences, so the user can see what the taxonomy of the best matches is.
Comments
this could be useful with larger values of max-accepts, as it lets the user see to what extend the best matching alignments share their taxonomic annotation (i.e., they can evaluate the "consensus taxonomy" on their own, which is something we want to encourage for features of interest)
This section of exclude_seqs
:
should probably be optimized as it's going to require iterating multiple times over query_series
, and it's also going to read all of res
into memory. Could you rewrite this to read all of the hit identifiers from res
into a set
using Python's csv.reader
, and then iterate over feature_sequences
(maybe seek
to the beginning of the file and use the scikit-bio fasta parser so you don't need the private import I mentioned in #2), storing each sequence in lists of hit_seqs
or misses_seqs
as you go, and then create the pd.Series
objects on return.
For example, something like (this isn't tested):
hits_seqs = {}
misses_seqs={}
hit_ids = {e[0] for e in csv.reader(res)}
feature_sequences.seek(0)
for seq in skbio.io.read(feature_sequences):
seq_id = seq.metadata['id']
seq = str(seq)
if seq_id in hit_ids:
hits_seqs[seq_id] = seq
else:
misses_seqs[seq_id] = seq
return pd.Series(hits_seqs), pd.Series(misses_seqs)
Bug Description
Not sure what the issue is yet, but see forum xref for steps to reproduce and error report.
I can reproduce this error locally with the data.
vsearch is probably outputting blank lines or for some other reason the lines are not being parsed correctly:
File “/home/qiime2/miniconda/envs/qiime2-2019.1/lib/python3.6/site-packages/q2_quality_control/_blast.py”, line 85, in _extract_hits
query_id, subject_id, query_len, start, end = line.split(’\t’)
ValueError: not enough values to unpack (expected 5, got 1)
Improvement Description
from @gregcaporaso :
I think the plot-observed-features* data should be in another plot - when this is added to the first plot the scale of the y-axis doesn't really make sense anymore, so if a user wants to view this info and the other info in the plot, they're probably going to need to run this command twice, once with and once without this parameter. This would be nice to fix in this PR, but isn't essential.
table_ids.difference(metadata_ids) should read metadata_ids.difference(table_ids)
Otherwise, if _validate_metadata_values_are_subset
raises an error, it reports an empty set! (instead of the metadata values that are not represented in the expected composition table)
Better yet, this function should be improved to check only metadata rows that intersect with the observed composition table. Otherwise, metadata cannot be a superset of the observed composition table, which is a nuisance.
Full error:
Traceback (most recent call last):
File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2cli/commands.py", line 224, in __call__
results = action(**arguments)
File "<decorator-gen-165>", line 2, in evaluate_composition
File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 228, in bound_callable
output_types, provenance)
File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/qiime2/sdk/action.py", line 424, in _callable_executor_
ret_val = self._callable(output_dir=temp_dir, **view_args)
File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/quality_control.py", line 69, in evaluate_composition
plot_observed_features_ratio=plot_observed_features_ratio)
File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 83, in _evaluate_composition
_validate_metadata_values_are_subset(metadata, exp)
File "/Users/nbokulich/miniconda3/envs/qiime2-2017.12/lib/python3.5/site-packages/q2_quality_control/_utilities.py", line 45, in _validate_metadata_values_are_subset
table_ids.difference(metadata_ids))
ValueError: Missing samples in table: set()
I strongly recommend making all of the parameters required (i.e., don't define defaults) in the private functions in quality_control.py
(and elsewhere, if applicable). Otherwise, it's really easy to accidentally forget to pass one of these when you call one of these functions from elsewhere in the code base, which in turn would make it not possible to override defaults. Defining defaults on private functions is something I now try to avoid.
For clarity, rename _blast_seqs.py
to something else (e.g., _search_seqs.py
) - I thought this only supported BLAST.
Could _extract_hits
just return a set
of ids that hit the reference? I don't think you use the other information at this time.
Improvement Description
add max-accepts option, defaulting to 1, which shows the top max-accepts pairwise alignment results
Improvement Description
MCC is useful for scoring imbalanced classification problems and should be added to the panel of eval metrics.
Current Behavior
No MCC
Proposed Behavior
add sklearn implementation of MCC metric
Found this example: https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/quality_control.py#L12
This method could disappear or change its API at any time since it's private, so it's not safe to import this. If there is not a public API for the functionality that you need, another option would be to copy that function (adding a note about where it came from) to this repo, and post an issue on the source repo to see if the API could be made public in the future.
Should use the new citation API in qiime2/qiime2#387
Bug Description
When using filter_reads
to remove human sequences from the single-end reads, the artifact containing filtered sequences is empty. The same does not happen when filtering paired-end reads (i.e., everything works as expected). When examining bowties2's output I can see that the majority of my reads were not human and so they should be included in the output.
Steps to reproduce the behavior
qiime quality-control filter-reads
using sample-reads-single.qza
as input and human_ref_grch38.qza
as database.qiime demux summarize
.Expected behavior
Samples should contain reads in both cases.
Actual behavior
No reads were found when single-reads were used as input. Zero.
When paired-end reads were filtered, the majority of those was retained (expected).
Computation Environment
Comments
After a closer look I can see that the problem arises when samtools fastq
is invoked. The two additional outputs there are redirected to /dev/null but that's exactly where the output goes (and hence is lost). I reproduced this behavior with a couple of different single-end inputs - every time with the same result. It would seem that the SAM flags are not set correctly (by bowtie2?) and so the fastq
command sends those to a different file (some more discussion on that here).
Questions
Should all these outputs maybe just be stored as artifacts instead?
Bug Description
filter-reads
writes out demux fastq filenames with an extra fastq.gz
on the end: sample_b_S02_L001_R2_001.fastq.gz.fastq.gz
Steps to reproduce the behavior
Comments
It looks like vsearch
isn't supported anymore, so this parameter should be removed:
@nbokulich, the additional tests that you need here include (you might think of others as you go):
All of these should be performed for blast
and vsearch
, as applicable.
Improvement Description
For the plugin, incorporate a strand argument to be fed to blast
or vsearch
that can limit whether the reverse complement is aligned against: 'strand': Str % Choices(['both', 'plus'])
This is invoked in the following lines
And I think makes reasonable sense to include as a functionality in this plugin.
Current Behavior
Only both
strand search is enabled for vsearch
and blast
Proposed Behavior
Provide an additional parameter that allows searching the plus
strand with these commands.
Questions
In order to implement a pipeline that can perform the steps of filtering sequences and a table for bloom sequences (See qiime2/q2-feature-table#218), I need an "alignment" method that matches query sequences to reference sequences only if the entire query sequence (length n
) matches the first n
bases in the reference sequence.
I do not believe this is currently possible with exclude-seqs
, or at least, I am not aware of how to manipulate the BLAST
or VSEARCH
settings to make this possible.
I have two potential ideas for how to solve this problem:
_blast._search_seqs
, and consequently to exclude-seqs
exclude-seqs
-like method for the filtering pipeline that performs the type of matching I need.Can you let me know which is preferred?
Improvement Description
Replace (or supplement) histograms of mismatches X count with similarity X coverage X abundance
Current Behavior
Histograms of mismatches X count are less effective if reads have different lengths (and hence mismatches indicate different % similarity)
Proposed Behavior
Replace or supplement with a better plot (see forum link below)
References
forum xref
Bug Description
Linear regression R2 values appear to be miscalculating at level 1. Other levels appear correct (based on the data provided, see xref below).
scipy’s linear regression function (which is being used to calculate R here) fails if there is only one measurement each in the expected and observed
> linregress([1], [1])
LinregressResult(slope=nan, intercept=nan, rvalue=0.0, pvalue=nan, stderr=nan)
> linregress([1, 2], [1, 2])
LinregressResult(slope=1.0, intercept=0.0, rvalue=1.0, pvalue=0.0, stderr=0.0)
rvalue in the first example should be nan, not 0.0
Screenshots
Level 1 should be R2=1.0 in this plot (since all observed and expected taxa are bacteria).
References
forum xref
threads
parameter should be a range bound on the lower end. Also, note that this is only relevant for vsearch
?
The following:
https://github.com/nbokulich/q2-quality-control/blob/6d89fbebe9557fde5c7a1ffef1c542564d8e551d/q2_quality_control/plugin_setup.py#L50
would be more clear as: BLAST expectation (E) value threshold
This description should mention the E-value threshold too, and how that interacts with percent identity (i.e., do both need to be above their respective thresholds)? Sort of related, but should it be possible to disable the percent identity filter, and only filter based on E-value?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.