Giter Club home page Giter Club logo

rcrux's People

Contributors

jungbluth avatar limey-bean avatar lunagal avatar lunavicta avatar marinednadude avatar ocstringham avatar ramongallego avatar shaunson26 avatar zjgold avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rcrux's Issues

Use of `paste0()` for file paths, can/should use `file.path()`

I've noticed heavy use of paste0 for file path building .. there is a base R file.path() function to do this.. i think the latter is better practice, although you need to keep consistent with it's use and idiosyncrasies (there will never be a trailing / ) .. so i did a quick change of the few that exist in get_seeds_local() but it didn't play nice with save_output_as_csv() ๐Ÿ˜… given the later is pasting things again and it expected a trailing / in a path generated within get_seeds_local()

Depending on how interconnected different functions are that are pasting file.paths together will determine how easy of an update this would be.

NCBI download code

NCBI just added some smaller nucleotide databases to the FTP site, so you can get one for just eukaryotes if you want (see here. This also means the current code on the readme to download the nt database isn't quite right. If you just want the full nt database (not the partial databases), use this:

wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz*"

Having only the nt_euk DB might make COI a lot faster though...

get_seeds_local() compares strings instead of numbers

Problem

line 351:
append_table <- read.csv(append_table_path, colClasses = "character")
lines 389-390:

f_and_r <- dplyr::mutate(f_and_r, product_length = dplyr::case_when((forward_start < reverse_start & forward_start < forward_stop & reverse_stop < reverse_start ) ~ (as.numeric(reverse_start) - as.numeric(forward_start)),
                                                                     (forward_start > reverse_start & forward_start > forward_stop & reverse_stop > reverse_start) ~ (as.numeric(forward_start) - as.numeric(reverse_start)),))

Solution

Something like this needs to go just before lines 389-390 ... or import those columns earlier as integers (would need to use readr::read_csv, however, as it lets you specify individual column classes unlike read.csv..

 f_and_r <-
    dplyr::mutate(f_and_r,
                  dplyr::across(c('forward_start', 'forward_stop', 'reverse_start', 'reverse_stop'), .fns = as.integer)
    )

Readme error

Likely the example pipeline section discussing "blast_seeds_local() or blast_seeds_remote()" is supposed to refer to get_seeds_local/remote?

Thanks!

consistent use of packages

I saw readr was referenced once in get_seeds_local() .. it's a great package, so why not use it for all read_/write_ functions? or exclude it as a dependency?

Imports:
tibble,
dplyr,
magrittr,
taxonomizr,
data.table,
tidyr,
stringr,
lubridate,
primerTree,
progress,
zoo,
readr,
XML,
httr,
plyr,
phylotools

โฏ checking dependencies in R code ... WARNING
'::' or ':::' imports not declared from:
'purrr' 'rlang'
Namespaces in Imports field not imported from:
'primerTree' 'zoo'
All declared Imports should be used.

Improve `run_primer_blastn()` + ncbi_bin parameter

Can improve the code in run_primer_blastn() to deal with if the binary is in the PATH or a user supplied path. Currently the code is copied for each instance, but one has an extra "-num_alignments", "10000000",.

Goal is to have 1 system2() call, where the command is build given what we know about the PATH or a user supplied path

Default evalue for run_primer_blastn is 3e+07

The documentation states that the default e-value for run_primer_blast is 3e-07, but looking at the run_primer_blastn code it is 3e+07, which seems really high. Which one should it be?

ncbi path variable not carried forward?

Getting this error in blast_seeds() after it successfully ran blastn; error happened after about an hour of runtime:

The number of unsampled indices is less than or equal to the maximum number to be blasted Running blastdbcmd on 1000 samples. Calling blastn. This may take a long time. 1969595 blast hits returned. 21715 unique blast hits after this round. Error in system2("blastdbcmd", args = c("-db", blast_db_path, "-dbtype", : error in running command

I'm thinking the ncbi_bin argument to blast_seeds() doesn't get scoped quite right, and so doesn't make it down to the function in which system2("blastdbcmd") gets called.

Suggestion: multithread

A suggestion to make -num_threads an argument to blast_seeds(). It's otherwise buried; I think the right core function would be run_blastn(), which is doing the actual system2() call to blastn.

Referencing magrittr::`%>%` pipe in functions

Use #' @importFrom magrittr %>%`` or usethis::use_pipe() instead of `%>%` <- magrittr::`%>%`. The usethis approach might be best, as it only needs to be stated once in the whole package (a file is created) ...

Address `check()` issues

There are quite a few from:
package imports
variable bindings
if() conditions comparing class() to string ..
documention of functions
examples in functions being run - either 'dont run' or use the mock-db
various others

just a matter of clean up

update examples

Examples are run/executed on package checking .. so either:

  1. wrap in \dontrun{...}
  2. Use mock-db() in examples

Length of sequence headers after derep and clean

I was trying to create a blastable database from the pre-compiled COI database CO1_combined_derep_and_clean.fasta . The header of the fasta sequnces after cleanup reads something like >KM252950.1_representative_of_12_identical_accessions. When I tried to use this file as the input fasta for makeblastdb , with the command
makeblastdb -in CO1_combined_derep_and_clean.fasta -dbtype nucl -parse_seqids -out rCRUX_COI -title "rCRUX_Leray"
I got an error message complaining about the length of the sequence headers
"BLAST Database creation error: Near line 1, the local id is too long. Its length is 52 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long."
I checked this was the issue by shortening the sequence headers with sed
sed 's\_representative_of\\g' CO1_combined_derep_and_clean.fasta > CO1_ready.fasta
To save a headache to future Moncho, I also shortened the headers in the taxonomy file
sed 's\_representative_of\\g' CO1_combined_derep_and_clean_taxonomy.txt > CO1_tax_ready.txt

Now the headers read >KM252950.1_12_identical_accessions

I think we could modify derep_and_clean to account for this

Consistent use of text output formats - csv, tsv

There are varying output formats across the package - write.table('*.txt', sep = ","), write.table('*.txt', sep = "\t"), write.csv('*.csv'), write.csv('*.txt')

We should decided on the preferred output and/or just be explicit when outputting - comma-delimited = .csv, tab-delimited =.txt ..i.e. should not have write.table to a .txt with a sep=","

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.