biocore / deblur Goto Github PK

View Code? Open in Web Editor NEW

91.0 91.0 41.0 4.26 MB

Deblur is a greedy deconvolution algorithm based on known read error profiles.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

deblur's People

Contributors

Stargazers

Watchers

deblur's Issues

Use lookup table for translate the sequence to np array

See here

access supplied positive and negative data files by default

positive (gg 88% rep set) and negative (phix+adapter) fasta files are a part of the git
need deblurring to access them by default based on the mode (-n)
also need to test if not indexed, index them (1st time or as part of the install) and then use the indexed versions by default to save re-indexing each run...

implement remove_artifacts_seqs

Documentation bits for release

example install via pip or conda
basic command shown should reflect the click command structure
add change log

Improve readability of deblurred ids with pandas

Having the 100 character names makes it very difficult to read the biom tables when converted into pandas. We really should be using some sort of unique hash generated from the sequence as OTU ids, for the sake of readability.

add parallelization per sample

Implement dereplicate_seqs

Give a warning if trimming to less then max read length

need to warn the user if he deblurs using trim longer than the run read length (i.e. -t 150 and the sequencing is 100bp long)

saving to json (when h5py isn't present) doesn't actually work

Strongly recommend removing biom 1.x format support and requiring h5py.

problem if after trimming+dereplication we have 1 sequence left

mafft failes (only 1 sequence to align)
so a samples containing only 1000 identical reads will fail in the mafft stage

Don't throw away error reads

Don't throw away the read error reads, but instead add back to the original sequence

Implement trim_seqs

Check the result variation if adding back the frequency gave to the neighbors

We are only correcting by reducing the neighbor frequency, unsure how this affects if we increase back the frequency of the current sequence.

Implement remove_singletons_seqs

pip installable

index the 88_otus.fasta file on installation if possible

otherwise user needs to supply/compile the 88_otus.fasta

progress indication

every X% (default 5) of samples (and also indexing of GG and splitting...)
have a flag for X, 0 if supress

python 2.7 compatability

just need to change the one import for stringio to include if version

cropped taxonomy strings

Looking at some of the taxonomy metadata, the last few characters are cropped off. Here is an example of such a taxonomy string.

['k__Bacteria',  'p__Proteobacteria',  'c__Alphaproteobacteria',  'o__Sphingomonadales', 'f__Sphingomonadaceae']

Notice how the last few items are cropped off (i.e. genus and species). This makes it more difficult to load the taxonomies into a pandas dataframe.

It would be great if some padding could be done in the post processing.

install the support files

install the artifacts.fa and 88_otus.fasta files

Test how the optimization techniques impacts the deblurring results

I.e. even if are low level sequences do the math.

Dependencies

Its not clear exactly what dependencies are required to run this.

After talking to @josenavas, I realized that MAFFT and SortMeRNA are required.
What other dependencies are required? The documentation will need to be updated with this information, so we can turn this into a conda recipe.

Use thread id instead of exposing a parameter to the user

See original discussion here

change number of threads flag

maybe -j or -n instead of -O?
(but -O is in qiime?)

File not found error

I'm getting a weird error when I tried to run the deblur workflow.

I'm running the following command

deblur workflow \
  --seqs-fp 648_seqs.fasta \
  --output-dir deblurred \
  --ref-fp /home/mortonjt/deblur_db/artifacts.fa \
  -n -w -d 1,0.06,0.02,0.02,0.01,0.005,0.005,0.005,0.001,0.001,0.001,0.0005 \
  -t 150 -O 32 \
  --log-level 2 \
  --log-file log.newmonkies.neg \
  --min-reads 25

And I get the following error

discarding /home/mortonjt/miniconda/bin from PATH
prepending /home/mortonjt/miniconda/envs/deblur/bin to PATH
Traceback (most recent call last):
  File "/home/mortonjt/miniconda/envs/deblur/bin/deblur", line 572, in <module>
    deblur_cmds()
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/deblur/bin/deblur", line 543, in workflow
    parallel_deblur(input_file_list, sys.argv, ref_db_fp, jobs_to_start)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/deblur/parallel_deblur.py", line 130, in parallel_deblur
    es))
RuntimeError: stdout: 
stderr: Traceback (most recent call last):
  File "/home/mortonjt/miniconda/envs/deblur/bin/deblur", line 572, in <module>
    deblur_cmds()
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mortonjt/miniconda/envs/deblur/bin/deblur", line 535, in workflow
    delim=delim)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/deblur/workflow.py", line 570, in launch_workflow
    min_size=min_size, threads=threads_per_sample)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/deblur/workflow.py", line 82, in dereplicate_seqs
    sout, serr, res = _system_call(params)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/site-packages/deblur/workflow.py", line 679, in _system_call
    stderr=subprocess.PIPE)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/home/mortonjt/miniconda/envs/deblur/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

exit: 1

The last few steps in log.newmonkies.neg was as follows

FO(139964386215680)2016-08-09 13:19:46,210:launch_workflow for file /home/mortonjt/Documents/potato_16S/data/illumina_reads/deblurred/split/1706.F54.7.prev.fasta
INFO(140568600663808)2016-08-09 13:19:46,315:dereplicate seqs file /home/mortonjt/Documents/potato_16S/data/illumina_reads/deblurred/deblur_working_dir/1706.F51.0.prev.fasta.trim
INFO(139964386215680)2016-08-09 13:19:46,732:dereplicate seqs file /home/mortonjt/Documents/potato_16S/data/illumina_reads/deblurred/deblur_working_dir/1706.F54.7.prev.fasta.trim
INFO(140204884723456)2016-08-09 13:19:46,739:dereplicate seqs file /home/mortonjt/Documents/potato_16S/data/illumina_reads/deblurred/deblur_working_dir/1706.F6.2.prev.fasta.trim

I'm not exactly sure what is going on. Any ideas about how I can start debugging this? Thanks!

Hamming distances are computed twice

The deblur function computes the hamming distance between the sequences twice. Check if caching those values is worth; i.e. test the trade-off between memory usage and running time.

explain run deblur after split_libraries_fastq

implement generate_biom_table

update deblur workflow defaults

need to set:

read length?
error profile
positive or negative filtering
fasta file for negative filtering
indexed 88_otus.fasta or the 88_otus.fasta for positive filtering

sortmerna 2.1 changes cli

following @ekopylova response, need to use sortmerna v2.1 and update the cli call inside workflow

biom dependecy upper version limit

why?

add taxonomy script/embed in pipeline

the people want taxonomy in their biom table
we can:

supply an additional script (using qiime rdp/other option?) to add taxonomy
add this script as an optional step in the pipeline (and have it as optional package dependency)

Getting the output and summary of the artifact filtering

Yay, i'm writing an issue on github :)

When running the HMMER (or the phix detection) artifact filtering, we should output a summary of how many sequences were filtered out
Also add a flag/option to get a file with all the filtered out sequences (to see who they are...)

Implement launch_workflow

Depends on:

implement multiple_sequence_alignment

auto read length detection

if -t not specified, automatically detect the read length (test XX first/random reads, take 80% percentile, if too varied show an error)

log file special mark if not thread

to enable log file analysis

Fastq files?

Right now, this pipeline only takes in fasta files. Is there any interest in reading from fastq instead?

Implement remove_chimeras_denovo_from_seqs

remove skbio dependency

need only parse_fasta, remove_files
can copy instead

Python 3 compatibility

...and make sure to update classifications in setup.py

coverage is reported as unknown by travis

BUG : too many open files when splitting on mac

Got this from a user:
ANOTHER ERROR ( ONLY solution seems to be to divide the file up into pieces)

(deblurenv)MacQIIME dhcp238043:~ $ deblur workflow seqsfp

/Volumes/Samsung_T1/MadaRaw/Mada_AS13JF14ND14_Splitlib_r3n0p.90q5/Mada_AS13JF14ND14_Splitlib_r3n0p.90q5_seqs250_254new.fasta

outputdir Madajul16_deblurtest O 3

Traceback (most recent call last):

File "/macqiime/anaconda/envs/deblurenv/bin/deblur", line 621, in

deblur_cmds()

File "/macqiime/anaconda/envs/deblurenv/lib/python3.5/sitepackages/click/core.py", line 716, in call

return self.main(_args, *_kwargs)

File "/macqiime/anaconda/envs/deblurenv/lib/python3.5/sitepackages/click/core.py", line 696, in main

rv = self.invoke(ctx)

File "/macqiime/anaconda/envs/deblurenv/lib/python3.5/sitepackages/click/core.py", line 1060, in invoke

return _process_result(sub_ctx.command.invoke(sub_ctx))

File "/macqiime/anaconda/envs/deblurenv/lib/python3.5/sitepackages/click/core.py", line 889, in invoke

return ctx.invoke(self.callback, **ctx.params)

File "/macqiime/anaconda/envs/deblurenv/lib/python3.5/sitepackages/click/core.py", line 534, in invoke

return callback(_args, *_kwargs)

File "/macqiime/anaconda/envs/deblurenv/bin/deblur", line 561, in workflow

out_dir_split)

File "/macqiime/anaconda/envs/deblurenv/lib/python3.5/sitepackages/deblur/workflow.py", line 445, in

split_sequence_file_on_sample_ids_to_files

outputs[sample] = open(join(outdir, sample + '.fasta'), 'w')

OSError: [Errno 24] Too many open files: '/Users/venceslab/Madajul16_deblurtest/split/J8.3.fasta'

deblur from pip fails

I performed a pip install of deblur (pip install deblur) today and received the following traceback from a test run:

Traceback (most recent call last):
  File "/home/mcdonadt/miniconda3/envs/deblur/bin/deblur", line 621, in <module>
    deblur_cmds()
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 86, in augment_usage_errors
    yield
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/home/mcdonadt/miniconda3/envs/deblur/bin/deblur", line 570, in workflow
    working_dir=working_dir)
  File "/home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/deblur/workflow.py", line 199, in build_index_sortmerna
    raise RuntimeError('Cannot index database file %s' % db)
RuntimeError: Cannot index database file /home/mcdonadt/miniconda3/envs/deblur/lib/python3.5/site-packages/deblur/support_files/artifacts.fa

biocore / deblur Goto Github PK

deblur's People

Contributors

Stargazers

Watchers

Forkers

deblur's Issues

Recommend Projects

Recommend Topics

Recommend Org