Giter Club home page Giter Club logo

q2-types's Issues

ReferenceFeatures and SSU types should be removed

This is only referenced in q2-types and docs, so we should drop this. We initially thought we needed this for q2-feature-classifier, but ended up replacing it with using various FeatureData types instead.

BUG: non-functional pd.Series -> DNAFASTAFormat transformer in feature_data subpackage

I've included an example code block showing the error at the bottom of this issue. Pulling the _16 transformer into a local function and adding a skbio.DNA wrapper around the sequence string allowed it to partially work as expected (sans header ids).

In [11]: def _16(data: pd.Series) -> DNAFASTAFormat:
    ...:     ff = DNAFASTAFormat()
    ...:     with ff.open() as f:
    ...:         for sequence in data:
    ...:             skbio.io.write(skbio.DNA(sequence), format='fasta', into=f)
    ...:     return ff
    ...: 

In [12]: f = _16(features.loc[data.columns, 'DenoisedSequenceVariant'])

In [13]: !head {f.path}
>
GCGAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...
>
GCAAGCGT...

The funny(/ironic?) part is that it's the only transformer in the sub-package without any tests.

In [6]: qiime2.Artifact.import_data('FeatureData[Sequence]', features.loc[data.columns, 'DenoisedSequenceVariant'])
---------------------------------------------------------------------------
UnrecognizedFormatError                   Traceback (most recent call last)
<ipython-input-6-f8fdc74db9db> in <module>()
----> 1 qiime2.Artifact.import_data('FeatureData[Sequence]', features.loc[data.columns, 'DenoisedSequenceVariant'])

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/sdk/result.py in import_data(cls, type, view, view_type)
    190 
    191         provenance_capture = archive.ImportProvenanceCapture(format_, md5sums)
--> 192         return cls._from_view(type_, view, view_type, provenance_capture)
    193 
    194     @classmethod

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/sdk/result.py in _from_view(cls, type, view, view_type, provenance_capture)
    215         transformation = from_type.make_transformation(to_type,
    216                                                        recorder=recorder)
--> 217         result = transformation(view)
    218 
    219         artifact = cls.__new__(cls)

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/core/transform.py in transformation(view)
     57             self.validate(view)
     58 
---> 59             new_view = transformer(view)
     60 
     61             new_view = other.coerce_view(new_view)

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/qiime2/core/transform.py in wrapped(view)
    205         def wrapped(view):
    206             new_view = self._view_type()
--> 207             file_view = transformer(view)
    208             if transformer is not identity_transformer:
    209                 self.set_user_owned(file_view, False)

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/q2_types/feature_data/_transformer.py in _16(data)
    339     with ff.open() as f:
    340         for sequence in data:
--> 341             skbio.io.write(sequence, format='fasta', into=f)
    342     return ff
    343 

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/skbio/io/registry.py in write(obj, format, into, **kwargs)
   1164 @wraps(IORegistry.write)
   1165 def write(obj, format, into, **kwargs):
-> 1166     return io_registry.write(obj, format, into, **kwargs)
   1167 
   1168 

~/dev/mc3/envs/[redacted]/lib/python3.5/site-packages/skbio/io/registry.py in write(self, obj, format, into, **kwargs)
    615             raise UnrecognizedFormatError(
    616                 "Cannot write %r into %r, no %s writer found." %
--> 617                 (format, into, obj.__class__.__name__))
    618 
    619         writer(obj, into, **kwargs)

UnrecognizedFormatError: Cannot write 'fasta' into <_io.TextIOWrapper name='/tmp/q2-DNAFASTAFormat-ohosxny8' mode='r+' encoding='utf8'>, no str writer found.

In [7]: 

I can work on this if necessary!

Add transformers/formats for TSV-style OTU tables (in either orientation)

Improvement Description
There are two styles of TSV that would be useful, and two orientations.

Styles:

  • Matrix style (R-lang): a labelled matrix, where there is not first cell. e.g. there are N columns, but N-1 labels.
  • Record style (Most everyone else): A more or less standard TSV where there are N columns and N labels for those columns.

Orientations:

  • Rows are samples (more intuitive)
  • Rows are OTUs/features (has some historical precedence)

Transformers should convert the BIOMV210Format into the 4 combinations above. I don't have good names for these, but some examples might be:

MatrixTSVBySampleFormat
MatrixTSVByFeatureFormat
RecordTSVBySampleFormat
RecordTSVByFeatureFormat


In the future we could do smarter things with TSVs and schemas, but for now, the above would help a lot of people with a pretty mundane conversion.

Format `SingleEndFastqManifestPhred33` and friends do not validate that files are gzipped

It seems to permit both .fastq and fastq.gz files as "input" for the manifest format.

It doesn't look like FastqGzFormat used in SingleLanePerSampleSingleEndFastqDirFmt (or it's paired variant) verify this fact either. It should reject files that aren't gzipped in it's sniff method.

It would be nice to be able to gzip in the transformers from the .*FastqManifest.* formats if possible.

support FeatureData[Sequences] (OTU Map)

Improvement Description
It'd be useful to support FeatureData[Sequences], i.e. analogous to QIIME 1's "OTU Map". This type/format describes the sequences in each feature (e.g. sequences that clustered into an OTU).

Comments
We had planned to add this type but deferred until we could come up with a reasonable file format (the QIIME 1 OTU Map format is un-parsable in Python when the lines are too long).

References
This type was requested on the QIIME 2 forum here.

replace phylogeny qza files

The current tests/data/phylogeny-rooted.qza and tests/data/phylogeny-unrooted.qza files are really big and don't work with the wiki tutorial. We should replace them with these files:

https://dl.dropboxusercontent.com/u/2868868/phylogeny-rooted.qza
https://dl.dropboxusercontent.com/u/2868868/phylogeny-unrooted.qza

It would be nice to re-write the history to remove the current files that are in there, since they are much larger than anything else.

Thanks @jairideout for catching this issue!

Support MiSeq demultiplexed data when importing

This format doesn't have a lane identifier so we would need another format to support this.

It would be much easier to use this format than to create a fastq-manifest with potentially hundreds of lines.

Issues with HeaderlessTSVTaxonomyFormat

@thermokarst I think the issue I was having with the HeaderlessTSVTaxonomyFormat was possibly related to the wrong base class being used? I'm not sure but it is obviously not like the rest.
https://github.com/qiime2/q2-types/blob/master/q2_types/feature_data/_format.py#L69

I noticed this because I was getting this error and I didn't know why it was trying to do a transformation to a HeaderlessTSVTaxonomyFormat when I am positive I already gave it a HeaderlessTSVTaxonomyFormat.
screen shot 2017-06-23 at 5 26 55 pm

Well, it's also not defined yet, so I'll just make my own for now. Thanks for all your help!

TaxonomyFormat assumes first line is header

When reading TaxonomyFormat with any of the transformers, the first line is assumed to be a (non-comment) header, followed by the taxonomy mapping lines. The sniffer is very lenient and only cares that the file is two-column TSV.

Not all taxonomy files include a header (for example, Greengenes). When a transformer is invoked to read the file, the first line is interpreted as a header, causing the feature ID to be set as Index.name and the taxonomy string to be set as Series.name.

For example, suppose we have the following taxonomy.tsv file (I used the first few lines from the Greengenes taxonomy map):

228054  k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__
228057  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__Pelagibacteraceae; g__; s__
73627   k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Mycobacteriaceae; g__Mycobacterium; s__
378462  k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Staphylococcaceae; g__Staphylococcus; s__
89370   k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Bacillaceae; g__Anoxybacillus; s__kestanbolensis

Reading this file into a pd.Series (all transformers are affected, it's not limited to the pd.Series transformer):

In [1]: from qiime2.plugin.util import transform

In [2]: from q2_types.feature_data import TaxonomyFormat

In [3]: import pandas as pd

In [4]: taxonomy_series = transform('taxonomy.tsv', from_type=TaxonomyFormat, to_type=pd.Series)

In [5]: taxonomy_series
Out[5]:
228054
228057    k__Bacteria; p__Proteobacteria; c__Alphaproteo...
73627     k__Bacteria; p__Actinobacteria; c__Actinobacte...
378462    k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...
89370     k__Bacteria; p__Firmicutes; c__Bacilli; o__Bac...
Name: k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__, dtype: object

In [6]: # :(

In [7]:

The Series has its Index.name set to "228054" and its Series.name to "k__Bacteria; p__Cyanobacteria; c__Synechococcophycideae; o__Synechococcales; f__Synechococcaceae; g__Synechococcus; s__".

During classification, if a query sequence is assigned to the first reference sequence taxonomy (i.e. the one that's misinterpreted as a header), I think the code will error with an IndexError coming from pandas. This happened with @nbokulich's vsearch classifier (not yet in master), and adding a header line to the Greengenes file appears to fix the issue (he's not receiving the error anymore at least).

I don't think we've seen this error with the existing classifiers because we've never had a query sequence assigned to the first reference sequence (e.g. just by chance). I suspect (and hope) that the code would fail in a similar way with an IndexError, but haven't confirmed.

I propose that we require TaxonomyFormat (both when sniffing and reading) to have the following header (we've been using this header in the unit tests, maybe elsewhere):

Feature ID<tab>Taxon

... optionally followed by other columns that are ignored. We could finesse the column names a little -- something like feature_id and taxon would be easier to access from pandas objects. This is a minor detail we can work out later.

If we go with a stricter format, such as what I'm proposing, then importing files without the appropriate header (e.g. Greengenes and other reference databases) will raise an error and the header will have to be added to the file in order to import. This is annoying, but I really think we should stop supporting tabular files without headers, especially because we want the .qza file formats to be as self-documenting as possible.

Thoughts? cc: @gregcaporaso, @nbokulich, @BenKaehler, @ebolyen, @thermokarst, @jakereps

Thanks @nbokulich for finding and reporting this bug!

transformer for OrdinationFormat --> metadata

This would be useful to use principal coordinates (or other ordination results) as input metadata. I am working on some methods that could employ this, e.g., to test whether samples change over PC1 before/after treatment.

The transformation of OrdinationFormat --> pd.DataFrame can be achieved with something like this (in a jupyter notebook, at least. I suppose the first line might be unnecessary in a transformer):

beta_div = beta_div.view(skbio.OrdinationResults)
beta_div = beta_div.samples.loc[:, 0:2]
beta_div.columns = ['unweighted-unifrac-pc1', 'unweighted-unifrac-pc2', 'unweighted-unifrac-pc3']

and then I assume the beta_div DataFrame can be converted to metadata with

qiime2.Metadata(beta_div)

I would find this extremely useful — any interest?

Importing fastq with wc -l % 4 == 0 doesn't fail

A user on the forum was able to import fastq files using one of the manifest formats. Downstream in the analysis it appears that the sequences don't have quality scores associated with them, e.g.:

    Traceback (most recent call last):
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
        results = action(**arguments)
      File "<decorator-gen-207>", line 2, in summarize
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
        output_types, provenance)
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/qiime2/sdk/action.py", line 392, in _callable_executor_
        ret_val = callable(output_dir=temp_dir, **view_args)
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_demux/_summarize/_visualizer.py", line 114, in summarize
        for seq in _read_fastq_seqs(file):
      File "/home/qiime2/miniconda/envs/qiime2-2017.8/lib/python3.5/site-packages/q2_demux/_demux.py", line 36, in _read_fastq_seqs
        qual.strip())
    AttributeError: 'NoneType' object has no attribute 'strip'

Should DNAIterator support lowercase fasta sequences?

Bug Description
When feature_classifier.extract_reads encounters a sequence with a lowercase letter in it, it throws the error below.

Screenshots

Click to expand!
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-674d2044b0eb> in <module>()
      7         seqs_in = qiime.Artifact.import_data("FeatureData[Sequence]", seqs)
      8         reads = feature_classifier.methods.extract_reads(seqs_in, read_length,
----> 9                                                          fwd_primer, rev_primer)
     10         reads.save(reads_out)

<decorator-gen-204> in extract_reads(sequences, read_length, f_primer, r_primer, method, direction, n_sample)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/callable.py in callable_wrapper(*args, **kwargs)
    225 
    226             outputs = self._callable_executor_(self._callable, view_args,
--> 227                                                output_types, provenance)
    228             # `outputs` matches a Python function's return: either a single
    229             # value is returned, or it is a tuple of return values. Treat both

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/callable.py in _callable_executor_(self, callable, view_args, output_types, provenance)
    350                     (view_type.__name__, type(output_view).__name__))
    351             artifact = qiime.sdk.Artifact._from_view(
--> 352                 semantic_type, output_view, view_type, provenance.fork())
    353             output_artifacts.append(artifact)
    354 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/sdk/result.py in _from_view(cls, type, view, view_type, provenance_capture)
    214         transformation = from_type.make_transformation(to_type,
    215                                                        recorder=recorder)
--> 216         result = transformation(view)
    217 
    218         artifact = cls.__new__(cls)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/transform.py in transformation(view)
     57             self.validate(view)
     58 
---> 59             new_view = transformer(view)
     60 
     61             new_view = other.coerce_view(new_view)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/qiime-2.0.6-py3.5.egg/qiime/core/transform.py in wrapped(view)
    188         def wrapped(view):
    189             new_view = self._view_type()
--> 190             file_view = transformer(view)
    191             if transformer is not identity_transformer:
    192                 self.set_user_owned(file_view, False)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_types-0.0.6-py3.5.egg/q2_types/feature_data/_transformer.py in _10(data)
     89 def _10(data: DNAIterator) -> DNAFASTAFormat:
     90     ff = DNAFASTAFormat()
---> 91     skbio.io.write(data.generator, format='fasta', into=str(ff))
     92     return ff
     93 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in write(obj, format, into, **kwargs)
   1164 @wraps(IORegistry.write)
   1165 def write(obj, format, into, **kwargs):
-> 1166     return io_registry.write(obj, format, into, **kwargs)
   1167 
   1168 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in write(self, obj, format, into, **kwargs)
    617                 (format, into, obj.__class__.__name__))
    618 
--> 619         writer(obj, into, **kwargs)
    620         return into
    621 

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in wrapped_writer(obj, file, encoding, newline, **kwargs)
   1080                 with open_files(files, mode='w', **io_kwargs) as fhs:
   1081                     kwargs.update(zip(file_keys, fhs[:-1]))
-> 1082                     writer_function(obj, fhs[-1], **kwargs)
   1083 
   1084             self._add_writer(cls, wrapped_writer, monkey_patch, override)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/fasta.py in _generator_to_fasta(obj, fh, qual, id_whitespace_replacement, description_newline_replacement, max_width, lowercase)
    772         obj, id_whitespace_replacement, description_newline_replacement,
    773         qual is not None, lowercase)
--> 774     for header, seq_str, qual_scores in formatted_records:
    775         if max_width is not None:
    776             seq_str = chunk_str(seq_str, max_width, '\n')

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/_base.py in _format_fasta_like_records(generator, id_whitespace_replacement, description_newline_replacement, require_qual, lowercase)
    144             "sequence IDs, nor to replace newlines in sequence descriptions.")
    145 
--> 146     for idx, seq in enumerate(generator):
    147 
    148         if len(seq) < 1:

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_feature_classifier-0.0.6-py3.5.egg/q2_feature_classifier/_cutter.py in read_seqs()
    129 
    130     def read_seqs():
--> 131         for single_sequence_tuple in result:
    132             yield single_sequence_tuple[0]
    133     return DNAIterator(read_seqs())

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/q2_feature_classifier-0.0.6-py3.5.egg/q2_feature_classifier/_gregex.py in extract_reads_by_position(aln, readlength, f_primer, r_primer, endedness, sample)
     56     query_cache = []
     57     i = 0
---> 58     for query in aln:
     59         query_cache.append(query)
     60         gaps = query.gaps()

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in <genexpr>(.0)
    504             # GeneratorType
    505             try:
--> 506                 return (x for x in itertools.chain([next(gen)], gen))
    507             except StopIteration:
    508                 # If the error was a StopIteration, then we want to return an

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in _read_gen(self, file, fmt, into, verify, kwargs)
    529             reader, kwargs = self._init_reader(file, fmt, into, verify, kwargs,
    530                                                io_kwargs)
--> 531             yield from reader(file, **kwargs)
    532 
    533     def _find_io_kwargs(self, kwargs):

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/registry.py in wrapped_reader(file, encoding, newline, **kwargs)
   1006                     with open_files(files, mode='r', **io_kwargs) as fhs:
   1007                         kwargs.update(zip(file_keys, fhs[:-1]))
-> 1008                         yield from reader_function(fhs[-1], **kwargs)
   1009 
   1010             self._add_reader(cls, wrapped_reader, monkey_patch, override)

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/io/format/fasta.py in _fasta_to_generator(fh, qual, constructor, **kwargs)
    675                                                FASTAFormatError):
    676             yield constructor(seq, metadata={'id': id_, 'description': desc},
--> 677                               **kwargs)
    678     else:
    679         fasta_gen = _parse_fasta_raw(fh, _parse_sequence_data,

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py in __init__(self, sequence, metadata, positional_metadata, lowercase, validate)
    334 
    335         if validate:
--> 336             self._validate()
    337 
    338     def _validate(self):

/Users/nbokulich/miniconda3/envs/qiime2-06/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py in _validate(self)
    358                    [str(b.tostring().decode("ascii")) for b in bad] if
    359                    len(bad) > 1 else bad[0],
--> 360                    list(self.alphabet)))
    361 
    362     @stable(as_of='0.4.0')

ValueError: Invalid character in sequence: b't'. 
Valid characters: ['G', 'C', '.', 'Y', 'W', 'B', 'R', 'V', 'N', 'K', 'D', 'S', '-', 'A', 'H', 'T', 'M']
Note: Use `lowercase` if your sequence contains lowercase characters not in the sequence's alphabet.

Comments
According to @BenKaehler : It looks like it’s coming from skbio when we attempt to write out lowercase sequences, which is called indirectly from q2_types. Hence I am posting this issue here.

support importing multiplexed seqs containing barcodes

Came up on the forum a few times (e.g. here, here, and here). Users need to be able to import multiplexed sequence data that contains barcodes in the sequences (we currently support data that has the barcodes extracted in a separate file, i.e. the "EMP protocol multiplexed data"). For now, a workaround is to use QIIME 1's extract_barcodes.py to extract the barcodes into their own file.

Super type for Phylogeny

Right now Phylogeny implies that it will only be allowed to handle phylogenetic trees. But there are many tree like structures that could be made - for instance hierarchical clusterings.

Could we create a super type, for example Hierarchy that could encompass both Phylogenies and Clusterings?

`PerSampleDNAIterators` doesn't take comments into account

SingleLanePerSampleSingleEndFastqDirFmt and SingleLanePerSamplePairedEndFastqDirFmt -> PerSampleDNAIterators transformers don't take the MANIFEST comments into account and crash when attempting to view the artifact as an iterator.

In [1]: import qiime2
In [2]: from q2_types.per_sample_sequences import PerSampleDNAIterators
In [3]: a = qiime2.Artifact.load('20170626_1/demux.qza')
In [4]: a.view(PerSampleDNAIterators)    
...
~/Developer/mc3/envs/biota/lib/python3.5/site-packages/q2_types/per_sample_sequences/_transformer.py in _1(dirfmt) 
     44     next(fh)
     45     for line in fh:
---> 46         sample_id, filename, _ = line.split(',')
     47         filepath = str(dirfmt.path / filename)
     48         result[sample_id] = skbio.io.read(filepath, format='fastq',

ValueError: not enough values to unpack (expected 3, got 1)   

Autogenerated MANIFEST file:
image

Prevent reading/writing empty `FeatureTable`s

Current Behavior
Currently methods like feature-table filter-samples can filter out all values in a table, resulting in a successfully created (albeit empty) artifact. This then causes problems in other methods, where they expect to have some data in the table. As well, without this centralized check, the burden of checking for data in these tables falls on the plugin developer, which creates some extra work.

References
This is a more specific case of qiime2/qiime2#294.

detect and support phred offsets other than 33

Bug Description
The current directory formats assume the phred offset is 33, but this is not always the case. Labeling as a bug since it is possible to create a phred offset 33 directory format with phred offset 64 data, and no error is displayed until quality scores are requested in a transformer (e.g. loading into skbio Sequence objects).

References
Original issue reported here.

Change Iterator to Iterable

Iterator's must return themselves when __iter__ is called. DNAIterator is non-conforming. An Iterator is probably overkill as well, an Iterable would be perfectly fine.

Normalization not applied to `BIOMV210Format` files on import

Comments
This is because importing data that share the same artifact_format as the semantic type enjoy transformerless-imports. While this is generally a convenience for developers, it makes things a pain when we need to apply additional transforms (see: feature table munging in this plugin, which strips out biom metadata, among other things). Not sure if this needs to be solved more generally in the framework, or if we can just define a BIOMV210Format -> BIOMV210Format transformer here.

References
This issue came up on the forum.

`csv` number of cols not actually being validated

I do not believe that we are successfully validating csv data from MANIFEST files.

We have some code which validates the csv from a MANIFEST. Suppose I put that code into a function def _validate_manifest_csv(manifest).

def _validate_manifest_csv(manifest):
    try:
        manifest = pd.read_csv(manifest_fh, comment='#', header=0,
                                                skip_blank_lines=True, dtype=object)
    except pd.io.common.CParserError as e:
        raise ValueError('All records in manifest must contain '
                         'exactly three comma-separated fields, but it '
                         'appears that at least one record contains more. '
                         'Original error message:\n %s' % str(e))

Then if I put the following test into test_transformer.py:

def test_validate_manifest_csv(self):
        manifest = io.StringIO(
            'sample-id,filename,direction\n'
            'banana,/hello/world,forward,hotdog\n'  # < -- important, notice the hotdog
            'banana,/hello/world,forward\n'
            'banana,/hello/world,reverse\n'
            'banana,/hello/world,reverse\n')
        with self.assertRaisesRegex(ValueError, 'at least one record contains more.'):
            _validate_manifest_csv(manifest)

... the test fails. But I think it seems like it should succeed (i.e, the error should occur), otherwise, in what scenario are we expecting that error to happen?

In fact, I don't think Pandas has any issue with jagged data, such as in the above hotdog example

>>> import io
>>> manifest = io.StringIO(
...             'sample-id,filename,direction\n'
...             'banana,/hello/world,forward,hotdog\n'
...             'banana,/hello/world,forward\n'
...             'banana,/hello/world,reverse\n'
...             'banana,/hello/world,reverse\n')
>>> manifest = pd.read_csv(manifest, comment='#', header=0, skip_blank_lines=True, dtype=object)
>>> print(manifest)
           sample-id filename direction
banana  /hello/world  forward    hotdog
banana  /hello/world  forward       NaN
banana  /hello/world  reverse       NaN
banana  /hello/world  reverse       NaN

Am I missing something, or is the current behavior incorrect?

add less restrictive directory format for use in importing SampleData[SequencesWithQuality]

Currently the two formats that we support, CasavaOneEightSingleLanePerSampleDirFmt and SingleLanePerSampleSingleEndFastqDirFmt, require files to be named in the Casava convention (i.e., matching the regular expression r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz'). We should add another format that uses a MANIFEST file to relax this restriction on the filenames so that, for example, the filenames could just be sample-id.fastq.gz.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.