qiime2 / q2-types Goto Github PK

View Code? Open in Web Editor NEW

13.0 9.0 41.0 8.98 MB

License: BSD 3-Clause "New" or "Revised" License

Python 99.80% Makefile 0.04% TeX 0.16%

hacktoberfest

q2-types's Issues

add tests for per_sample_sequences subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

ReferenceFeatures and SSU types should be removed

This is only referenced in q2-types and docs, so we should drop this. We initially thought we needed this for q2-feature-classifier, but ended up replacing it with using various FeatureData types instead.

disallow duplicate sample IDs in demux types/formats

Came up on the forum: duplicate sample IDs should be disallowed with types SampleData[SequencesWithQuality] and SampleData[PairedEndSequencesWithQuality] (the fix would be implemented on those types' transformers).

add AlphaDiversity semantic type

add tests for sample_data subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

add tests for tree subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

BUG: non-functional pd.Series -> DNAFASTAFormat transformer in feature_data subpackage

I've included an example code block showing the error at the bottom of this issue. Pulling the _16 transformer into a local function and adding a skbio.DNA wrapper around the sequence string allowed it to partially work as expected (sans header ids).

In [11]: def _16(data: pd.Series) -> DNAFASTAFormat:
    ...:     ff = DNAFASTAFormat()
    ...:     with ff.open() as f:
    ...:         for sequence in data:
    ...:             skbio.io.write(skbio.DNA(sequence), format='fasta', into=f)
    ...:     return ff
    ...: 

In [12]: f = _16(features.loc[data.columns, 'DenoisedSequenceVariant'])

In [13]: !head {f.path}
>
GCGAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...

The funny(/ironic?) part is that it's the only transformer in the sub-package without any tests.

In [6]: qiime2.Artifact.import_data('FeatureData[Sequence]', features.loc[data.columns, 'DenoisedSequenceVariant'])
---------------------------------------------------------------------------
UnrecognizedFormatError                   Traceback (most recent call last)
<ipython-input-6-f8fdc74db9db> in <module>()
----> 1 qiime2.Artifact.import_data('FeatureData[Sequence]', features.loc[data.columns, 'DenoisedSequenceVariant'])

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/sdk/result.py in import_data(cls, type, view, view_type)
    190 
    191         provenance_capture = archive.ImportProvenanceCapture(format_, md5sums)
--> 192         return cls._from_view(type_, view, view_type, provenance_capture)
    193 
    194     @classmethod

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/sdk/result.py in _from_view(cls, type, view, view_type, provenance_capture)
    215         transformation = from_type.make_transformation(to_type,
    216                                                        recorder=recorder)
--> 217         result = transformation(view)
    218 
    219         artifact = cls.__new__(cls)

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/core/transform.py in transformation(view)
     57             self.validate(view)
     58 
---> 59             new_view = transformer(view)
     60 
     61             new_view = other.coerce_view(new_view)

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/core/transform.py in wrapped(view)
    205         def wrapped(view):
    206             new_view = self._view_type()
--> 207             file_view = transformer(view)
    208             if transformer is not identity_transformer:
    209                 self.set_user_owned(file_view, False)

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/q2_types/feature_data/_transformer.py in _16(data)
    339     with ff.open() as f:
    340         for sequence in data:
--> 341             skbio.io.write(sequence, format='fasta', into=f)
    342     return ff
    343 

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/skbio/io/registry.py in write(obj, format, into, **kwargs)
   1164 @wraps(IORegistry.write)
   1165 def write(obj, format, into, **kwargs):
-> 1166     return io_registry.write(obj, format, into, **kwargs)
   1167 
   1168 

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/skbio/io/registry.py in write(self, obj, format, into, **kwargs)
    615             raise UnrecognizedFormatError(
    616                 "Cannot write %r into %r, no %s writer found." %
--> 617                 (format, into, obj.__class__.__name__))
    618 
    619         writer(obj, into, **kwargs)

UnrecognizedFormatError: Cannot write 'fasta' into <_io.TextIOWrapper name='/tmp/q2-DNAFASTAFormat-ohosxny8' mode='r+' encoding='utf8'>, no str writer found.

In [7]:

I can work on this if necessary!

Explicitely set newline and encoding parameters for datalayouts/formats

This will make the resulting artifacts system agnostic. Related to qiime2/qiime2#95

Add transformers/formats for TSV-style OTU tables (in either orientation)

Improvement Description
There are two styles of TSV that would be useful, and two orientations.

Styles:

Matrix style (R-lang): a labelled matrix, where there is not first cell. e.g. there are N columns, but N-1 labels.
Record style (Most everyone else): A more or less standard TSV where there are N columns and N labels for those columns.

Orientations:

Rows are samples (more intuitive)
Rows are OTUs/features (has some historical precedence)

Transformers should convert the BIOMV210Format into the 4 combinations above. I don't have good names for these, but some examples might be:

MatrixTSVBySampleFormat
MatrixTSVByFeatureFormat
RecordTSVBySampleFormat
RecordTSVByFeatureFormat

In the future we could do smarter things with TSVs and schemas, but for now, the above would help a lot of people with a pretty mundane conversion.

Rename `FeatureTableDirectoryFormat` to `BIOMV1DirFmt`

In preparation for supporting BIOM V2 files.

Format `SingleEndFastqManifestPhred33` and friends do not validate that files are gzipped

It seems to permit both .fastq and fastq.gz files as "input" for the manifest format.

It doesn't look like FastqGzFormat used in SingleLanePerSampleSingleEndFastqDirFmt (or it's paired variant) verify this fact either. It should reject files that aren't gzipped in it's sniff method.

It would be nice to be able to gzip in the transformers from the .*FastqManifest.* formats if possible.

support FeatureData[Sequences] (OTU Map)

Improvement Description
It'd be useful to support FeatureData[Sequences], i.e. analogous to QIIME 1's "OTU Map". This type/format describes the sequences in each feature (e.g. sequences that clustered into an OTU).

Comments
We had planned to add this type but deferred until we could come up with a reasonable file format (the QIIME 1 OTU Map format is un-parsable in Python when the lines are too long).

References
This type was requested on the QIIME 2 forum here.

Commas in Sample IDs break the internal manifests for per-sample fastq formats

References
This recently came up on the forum.

add tests for feature_data subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

replace phylogeny qza files

The current tests/data/phylogeny-rooted.qza and tests/data/phylogeny-unrooted.qza files are really big and don't work with the wiki tutorial. We should replace them with these files:

https://dl.dropboxusercontent.com/u/2868868/phylogeny-rooted.qza
https://dl.dropboxusercontent.com/u/2868868/phylogeny-unrooted.qza

It would be nice to re-write the history to remove the current files that are in there, since they are much larger than anything else.

Thanks @jairideout for catching this issue!

add citation_text and user_support_text

Mostly for example purposes in this repository, but it'll be useful since this will have a release today.

Support MiSeq demultiplexed data when importing

This format doesn't have a lane identifier so we would need another format to support this.

It would be much easier to use this format than to create a fastq-manifest with potentially hundreds of lines.

Issues with HeaderlessTSVTaxonomyFormat

@thermokarst I think the issue I was having with the HeaderlessTSVTaxonomyFormat was possibly related to the wrong base class being used? I'm not sure but it is obviously not like the rest.
https://github.com/qiime2/q2-types/blob/master/q2_types/feature_data/_format.py#L69

I noticed this because I was getting this error and I didn't know why it was trying to do a transformation to a HeaderlessTSVTaxonomyFormat when I am positive I already gave it a HeaderlessTSVTaxonomyFormat.

Well, it's also not defined yet, so I'll just make my own for now. Thanks for all your help!

add transformer for AlignedDNAFASTAFormat to pd.Series

TaxonomyFormat assumes first line is header

When reading TaxonomyFormat with any of the transformers, the first line is assumed to be a (non-comment) header, followed by the taxonomy mapping lines. The sniffer is very lenient and only cares that the file is two-column TSV.

Not all taxonomy files include a header (for example, Greengenes). When a transformer is invoked to read the file, the first line is interpreted as a header, causing the feature ID to be set as Index.name and the taxonomy string to be set as Series.name.

For example, suppose we have the following taxonomy.tsv file (I used the first few lines from the Greengenes taxonomy map):

228054  k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
228057  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Pelagibacteraceae; g__; s__
73627   k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Mycobacteriaceae; g__Mycobacterium; s__
378462  k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Staphylococcaceae; g__Staphylococcus; s__
89370   k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__Anoxybacillus; s__kestanbolensis

Reading this file into a pd.Series (all transformers are affected, it's not limited to the pd.Series transformer):

In [1]: from qiime2.plugin.util import transform

In [2]: from q2_types.feature_data import TaxonomyFormat

In [3]: import pandas as pd

In [4]: taxonomy_series = transform('taxonomy.tsv', from_type=TaxonomyFormat, to_type=pd.Series)

In [5]: taxonomy_series
Out[5]:
228054
228057    k__Bacteria; p__Proteobacteria; c__Alphaproteo...
73627     k__Bacteria; p__Actinobacteria; c__Actinobacte...
378462    k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...
89370     k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...
Name: k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__, dtype: object

In [6]: # :(

In [7]:

The Series has its Index.name set to "228054" and its Series.name to "k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__".

During classification, if a query sequence is assigned to the first reference sequence taxonomy (i.e. the one that's misinterpreted as a header), I think the code will error with an IndexError coming from pandas. This happened with @nbokulich's vsearch classifier (not yet in master), and adding a header line to the Greengenes file appears to fix the issue (he's not receiving the error anymore at least).

I don't think we've seen this error with the existing classifiers because we've never had a query sequence assigned to the first reference sequence (e.g. just by chance). I suspect (and hope) that the code would fail in a similar way with an IndexError, but haven't confirmed.

I propose that we require TaxonomyFormat (both when sniffing and reading) to have the following header (we've been using this header in the unit tests, maybe elsewhere):

Feature ID<tab>Taxon

... optionally followed by other columns that are ignored. We could finesse the column names a little -- something like feature_id and taxon would be easier to access from pandas objects. This is a minor detail we can work out later.

If we go with a stricter format, such as what I'm proposing, then importing files without the appropriate header (e.g. Greengenes and other reference databases) will raise an error and the header will have to be added to the file in order to import. This is annoying, but I really think we should stop supporting tabular files without headers, especially because we want the .qza file formats to be as self-documenting as possible.

Thoughts? cc: @gregcaporaso, @nbokulich, @BenKaehler, @ebolyen, @thermokarst, @jakereps

Thanks @nbokulich for finding and reporting this bug!

transformer for OrdinationFormat --> metadata

This would be useful to use principal coordinates (or other ordination results) as input metadata. I am working on some methods that could employ this, e.g., to test whether samples change over PC1 before/after treatment.

The transformation of OrdinationFormat --> pd.DataFrame can be achieved with something like this (in a jupyter notebook, at least. I suppose the first line might be unnecessary in a transformer):

beta_div = beta_div.view(skbio.OrdinationResults)
beta_div = beta_div.samples.loc[:, 0:2]
beta_div.columns = ['unweighted-unifrac-pc1', 'unweighted-unifrac-pc2', 'unweighted-unifrac-pc3']

and then I assume the beta_div DataFrame can be converted to metadata with

qiime2.Metadata(beta_div)

I would find this extremely useful — any interest?

add tests for ordination subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

add Rooted and Unrooted subtypes of Phylogeny

Importing fastq with wc -l % 4 == 0 doesn't fail

A user on the forum was able to import fastq files using one of the manifest formats. Downstream in the analysis it appears that the sequences don't have quality scores associated with them, e.g.:

    Traceback (most recent call last):
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
        results = action(**arguments)
      File "<decorator-gen-207>", line 2, in summarize
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
        output_types, provenance)
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 392, in _callable_executor_
        ret_val = callable(output_dir=temp_dir, **view_args)
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py", line 114, in summarize
        for seq in _read_fastq_seqs(file):
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_demux/_demux.py", line 36, in _read_fastq_seqs
        qual.strip())
    AttributeError: 'NoneType' object has no attribute 'strip'

Should DNAIterator support lowercase fasta sequences?

Bug Description
When feature_classifier.extract_reads encounters a sequence with a lowercase letter in it, it throws the error below.

Screenshots

Click to expand!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-674d2044b0eb> in <module>()
      7         seqs_in = qiime.Artifact.import_data("FeatureData[Sequence]", seqs)
      8         reads = feature_classifier.methods.extract_reads(seqs_in, read_length,
----> 9                                                          fwd_primer, rev_primer)
     10         reads.save(reads_out)

<decorator-gen-204> in extract_reads(sequences, read_length, f_primer, r_primer, method, direction, n_sample)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/callable.py in callable_wrapper(*args, **kwargs)
    225 
    226             outputs = self._callable_executor_(self._callable, view_args,
--> 227                                                output_types, provenance)
    228             # `outputs` matches a Python function's return: either a single
    229             # value is returned, or it is a tuple of return values. Treat both

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/callable.py in _callable_executor_(self, callable, view_args, output_types, provenance)
    350                     (view_type.__name__, type(output_view).__name__))
    351             artifact = qiime.sdk.Artifact._from_view(
--> 352                 semantic_type, output_view, view_type, provenance.fork())
    353             output_artifacts.append(artifact)
    354 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/sdk/result.py in _from_view(cls, type, view, view_type, provenance_capture)
    214         transformation = from_type.make_transformation(to_type,
    215                                                        recorder=recorder)
--> 216         result = transformation(view)
    217 
    218         artifact = cls.__new__(cls)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/transform.py in transformation(view)
     57             self.validate(view)
     58 
---> 59             new_view = transformer(view)
     60 
     61             new_view = other.coerce_view(new_view)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/transform.py in wrapped(view)
    188         def wrapped(view):
    189             new_view = self._view_type()
--> 190             file_view = transformer(view)
    191             if transformer is not identity_transformer:
    192                 self.set_user_owned(file_view, False)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_types-0.0.6-py3.5.egg/q2_types/feature_data/_transformer.py in _10(data)
     89 def _10(data: DNAIterator) -> DNAFASTAFormat:
     90     ff = DNAFASTAFormat()
---> 91     skbio.io.write(data.generator, format='fasta', into=str(ff))
     92     return ff
     93 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in write(obj, format, into, **kwargs)
   1164 @wraps(IORegistry.write)
   1165 def write(obj, format, into, **kwargs):
-> 1166     return io_registry.write(obj, format, into, **kwargs)
   1167 
   1168 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in write(self, obj, format, into, **kwargs)
    617                 (format, into, obj.__class__.__name__))
    618 
--> 619         writer(obj, into, **kwargs)
    620         return into
    621 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in wrapped_writer(obj, file, encoding, newline, **kwargs)
   1080                 with open_files(files, mode='w', **io_kwargs) as fhs:
   1081                     kwargs.update(zip(file_keys, fhs[:-1]))
-> 1082                     writer_function(obj, fhs[-1], **kwargs)
   1083 
   1084             self._add_writer(cls, wrapped_writer, monkey_patch, override)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/fasta.py in _generator_to_fasta(obj, fh, qual, id_whitespace_replacement, description_newline_replacement, max_width, lowercase)
    772         obj, id_whitespace_replacement, description_newline_replacement,
    773         qual is not None, lowercase)
--> 774     for header, seq_str, qual_scores in formatted_records:
    775         if max_width is not None:
    776             seq_str = chunk_str(seq_str, max_width, '\n')

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/_base.py in _format_fasta_like_records(generator, id_whitespace_replacement, description_newline_replacement, require_qual, lowercase)
    144             "sequence IDs, nor to replace newlines in sequence descriptions.")
    145 
--> 146     for idx, seq in enumerate(generator):
    147 
    148         if len(seq) < 1:

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_feature_classifier-0.0.6-py3.5.egg/q2_feature_classifier/_cutter.py in read_seqs()
    129 
    130     def read_seqs():
--> 131         for single_sequence_tuple in result:
    132             yield single_sequence_tuple[0]
    133     return DNAIterator(read_seqs())

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_feature_classifier-0.0.6-py3.5.egg/q2_feature_classifier/_gregex.py in extract_reads_by_position(aln, readlength, f_primer, r_primer, endedness, sample)
     56     query_cache = []
     57     i = 0
---> 58     for query in aln:
     59         query_cache.append(query)
     60         gaps = query.gaps()

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in <genexpr>(.0)
    504             # GeneratorType
    505             try:
--> 506                 return (x for x in itertools.chain([next(gen)], gen))
    507             except StopIteration:
    508                 # If the error was a StopIteration, then we want to return an

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in _read_gen(self, file, fmt, into, verify, kwargs)
    529             reader, kwargs = self._init_reader(file, fmt, into, verify, kwargs,
    530                                                io_kwargs)
--> 531             yield from reader(file, **kwargs)
    532 
    533     def _find_io_kwargs(self, kwargs):

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in wrapped_reader(file, encoding, newline, **kwargs)
   1006                     with open_files(files, mode='r', **io_kwargs) as fhs:
   1007                         kwargs.update(zip(file_keys, fhs[:-1]))
-> 1008                         yield from reader_function(fhs[-1], **kwargs)
   1009 
   1010             self._add_reader(cls, wrapped_reader, monkey_patch, override)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/fasta.py in _fasta_to_generator(fh, qual, constructor, **kwargs)
    675                                                FASTAFormatError):
    676             yield constructor(seq, metadata={'id': id_, 'description': desc},
--> 677                               **kwargs)
    678     else:
    679         fasta_gen = _parse_fasta_raw(fh, _parse_sequence_data,

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py in __init__(self, sequence, metadata, positional_metadata, lowercase, validate)
    334 
    335         if validate:
--> 336             self._validate()
    337 
    338     def _validate(self):

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py in _validate(self)
    358                    [str(b.tostring().decode("ascii")) for b in bad] if
    359                    len(bad) > 1 else bad[0],
--> 360                    list(self.alphabet)))
    361 
    362     @stable(as_of='0.4.0')

ValueError: Invalid character in sequence: b't'. 
Valid characters: ['G', 'C', '.', 'Y', 'W', 'B', 'R', 'V', 'N', 'K', 'D', 'S', '-', 'A', 'H', 'T', 'M']
Note: Use `lowercase` if your sequence contains lowercase characters not in the sequence's alphabet.

Comments
According to @BenKaehler : It looks like it’s coming from skbio when we attempt to write out lowercase sequences, which is called indirectly from q2_types. Hence I am posting this issue here.

support importing multiplexed seqs containing barcodes

Came up on the forum a few times (e.g. here, here, and here). Users need to be able to import multiplexed sequence data that contains barcodes in the sequences (we currently support data that has the barcodes extracted in a separate file, i.e. the "EMP protocol multiplexed data"). For now, a workaround is to use QIIME 1's extract_barcodes.py to extract the barcodes into their own file.

add format for QIIME 1 demultiplexed fasta/fna

this will simplify importing data from QIIME 1.

Super type for Phylogeny

Right now Phylogeny implies that it will only be allowed to handle phylogenetic trees. But there are many tree like structures that could be made - for instance hierarchical clusterings.

Could we create a super type, for example Hierarchy that could encompass both Phylogenies and Clusterings?

create demultiplexed sequences type

rename library q2_types

`PerSampleDNAIterators` doesn't take comments into account

SingleLanePerSampleSingleEndFastqDirFmt and SingleLanePerSamplePairedEndFastqDirFmt -> PerSampleDNAIterators transformers don't take the MANIFEST comments into account and crash when attempting to view the artifact as an iterator.

In [1]: import qiime2
In [2]: from q2_types.per_sample_sequences import PerSampleDNAIterators
In [3]: a = qiime2.Artifact.load('20170626_1/demux.qza')
In [4]: a.view(PerSampleDNAIterators)    
...
~/Developer/mc3/envs/biota/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py in _1(dirfmt) 
     44     next(fh)
     45     for line in fh:
---> 46         sample_id, filename, _ = line.split(',')
     47         filepath = str(dirfmt.path / filename)
     48         result[sample_id] = skbio.io.read(filepath, format='fastq',

ValueError: not enough values to unpack (expected 3, got 1)

Autogenerated MANIFEST file:

Prevent reading/writing empty `FeatureTable`s

Current Behavior
Currently methods like feature-table filter-samples can filter out all values in a table, resulting in a successfully created (albeit empty) artifact. This then causes problems in other methods, where they expect to have some data in the table. As well, without this centralized check, the burden of checking for data in these tables falls on the plugin developer, which creates some extra work.

References
This is a more specific case of qiime2/qiime2#294.

Define new `Metadata` transformers

Related to qiime2/qiime2#271
Related to qiime2/qiime2#269

add tests for feature_table subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

add format for QIIME 1 demultiplexed fastq

Comments
this will simplify importing data from QIIME 1.

add pd.DataFrame as a view for FeatureTable objects

detect and support phred offsets other than 33

Bug Description
The current directory formats assume the phred offset is 33, but this is not always the case. Labeling as a bug since it is possible to create a phred offset 33 directory format with phred offset 64 data, and no error is displayed until quality scores are requested in a transformer (e.g. loading into skbio Sequence objects).

References
Original issue reported here.

Remove package level tests

add tests for reference_features subpackage

See distance_matrix subpackage tests (added in #40) for a guide on how to write and structure these tests.

Change Iterator to Iterable

Iterator's must return themselves when __iter__ is called. DNAIterator is non-conforming. An Iterator is probably overkill as well, an Iterable would be perfectly fine.

Normalization not applied to `BIOMV210Format` files on import

Comments
This is because importing data that share the same artifact_format as the semantic type enjoy transformerless-imports. While this is generally a convenience for developers, it makes things a pain when we need to apply additional transforms (see: feature table munging in this plugin, which strips out biom metadata, among other things). Not sure if this needs to be solved more generally in the framework, or if we can just define a BIOMV210Format -> BIOMV210Format transformer here.

References
This issue came up on the forum.

`csv` number of cols not actually being validated

I do not believe that we are successfully validating csv data from MANIFEST files.

We have some code which validates the csv from a MANIFEST. Suppose I put that code into a function def _validate_manifest_csv(manifest).

def _validate_manifest_csv(manifest):
    try:
        manifest = pd.read_csv(manifest_fh, comment='#', header=0,
                                                skip_blank_lines=True, dtype=object)
    except pd.io.common.CParserError as e:
        raise ValueError('All records in manifest must contain '
                         'exactly three comma-separated fields, but it '
                         'appears that at least one record contains more. '
                         'Original error message:\n %s' % str(e))

Then if I put the following test into test_transformer.py:

def test_validate_manifest_csv(self):
        manifest = io.StringIO(
            'sample-id,filename,direction\n'
            'banana,/hello/world,forward,hotdog\n'  # < -- important, notice the hotdog
            'banana,/hello/world,forward\n'
            'banana,/hello/world,reverse\n'
            'banana,/hello/world,reverse\n')
        with self.assertRaisesRegex(ValueError, 'at least one record contains more.'):
            _validate_manifest_csv(manifest)

... the test fails. But I think it seems like it should succeed (i.e, the error should occur), otherwise, in what scenario are we expecting that error to happen?

In fact, I don't think Pandas has any issue with jagged data, such as in the above hotdog example

>>> import io
>>> manifest = io.StringIO(
...             'sample-id,filename,direction\n'
...             'banana,/hello/world,forward,hotdog\n'
...             'banana,/hello/world,forward\n'
...             'banana,/hello/world,reverse\n'
...             'banana,/hello/world,reverse\n')
>>> manifest = pd.read_csv(manifest, comment='#', header=0, skip_blank_lines=True, dtype=object)
>>> print(manifest)
           sample-id filename direction
banana  /hello/world  forward    hotdog
banana  /hello/world  forward       NaN
banana  /hello/world  reverse       NaN
banana  /hello/world  reverse       NaN

Am I missing something, or is the current behavior incorrect?

better error message when importing DataFrame to FeatureTable

This came up on the forum here. If indices are not strings, users will get a traceback with a cryptic error message. We should improve this error message.

require fwd/rev reads for each sample with paired-end Casava 1.8 demux format

Attempting to import a directory of paired-end, per-sample fastq files using the CasavaOneEightSingleLanePerSampleDirFmt format should raise an error if there are any unpaired fwd/rev reads files. It is currently possible to create a .qza with unpaired files, which can cause issues in downstream methods/visualizers (e.g. see this forum post about qiime demux summarize).

Provide Common File Formats w/ sniffers

Proposed Behavior
Maybe in a subpackage (common_filefmts):

JSON
YAML
MD
Pickle

add less restrictive directory format for use in importing SampleData[SequencesWithQuality]

Currently the two formats that we support, CasavaOneEightSingleLanePerSampleDirFmt and SingleLanePerSampleSingleEndFastqDirFmt, require files to be named in the Casava convention (i.e., matching the regular expression r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz'). We should add another format that uses a MANIFEST file to relax this restriction on the filenames so that, for example, the filenames could just be sample-id.fastq.gz.

qiime2 / q2-types Goto Github PK

q2-types's Issues

Recommend Projects

Recommend Topics

Recommend Org