calcofi / rcrux Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 5.0 33.34 MB

A repository for my work on rCRUX

License: GNU General Public License v3.0

R 98.88% JavaScript 0.30% Dockerfile 0.82%

noaa-omics-data noaa-omics-software noaa-omics-study

rcrux's People

Contributors

Stargazers

Watchers

Forkers

guoyuh jungbluth ocstringham sformel-usgs

rcrux's Issues

Simplify logic in `blast_datatable` -> `run_blastdbcmd_blastn_and_aggregate_resuts()`

Use of `paste0()` for file paths, can/should use `file.path()`

I've noticed heavy use of paste0 for file path building .. there is a base R file.path() function to do this.. i think the latter is better practice, although you need to keep consistent with it's use and idiosyncrasies (there will never be a trailing / ) .. so i did a quick change of the few that exist in get_seeds_local() but it didn't play nice with save_output_as_csv() 😅 given the later is pasting things again and it expected a trailing / in a path generated within get_seeds_local()

Depending on how interconnected different functions are that are pasting file.paths together will determine how easy of an update this would be.

ReadMe install pointing to old github repo

Went to update the rCrux package and noticed in the readme for the install of it that it is pointing to LunaGal/rCrux not the new location.

NCBI download code

NCBI just added some smaller nucleotide databases to the FTP site, so you can get one for just eukaryotes if you want (see here. This also means the current code on the readme to download the nt database isn't quite right. If you just want the full nt database (not the partial databases), use this:

wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz*"

Having only the nt_euk DB might make COI a lot faster though...

A `set.seed()` in randomisation functions e.g. species blast

Would be good to have the ability to reproduce randomisation parts .. would allow reproducible results for paper writing where randomisation occurs - means if analysis needs to be rerun, you should get the same numbers

get_seeds_local() compares strings instead of numbers

Problem

line 351:
append_table <- read.csv(append_table_path, colClasses = "character")
lines 389-390:

f_and_r <- dplyr::mutate(f_and_r, product_length = dplyr::case_when((forward_start < reverse_start & forward_start < forward_stop & reverse_stop < reverse_start ) ~ (as.numeric(reverse_start) - as.numeric(forward_start)),
                                                                     (forward_start > reverse_start & forward_start > forward_stop & reverse_stop > reverse_start) ~ (as.numeric(forward_start) - as.numeric(reverse_start)),))

Solution

Something like this needs to go just before lines 389-390 ... or import those columns earlier as integers (would need to use readr::read_csv, however, as it lets you specify individual column classes unlike read.csv..

 f_and_r <-
    dplyr::mutate(f_and_r,
                  dplyr::across(c('forward_start', 'forward_stop', 'reverse_start', 'reverse_stop'), .fns = as.integer)
    )

Readme error

Likely the example pipeline section discussing "blast_seeds_local() or blast_seeds_remote()" is supposed to refer to get_seeds_local/remote?

Thanks!

consistent use of packages

I saw readr was referenced once in get_seeds_local() .. it's a great package, so why not use it for all read_/write_ functions? or exclude it as a dependency?

Imports:
tibble,
dplyr,
magrittr,
taxonomizr,
data.table,
tidyr,
stringr,
lubridate,
primerTree,
progress,
zoo,
readr,
XML,
httr,
plyr,
phylotools

❯ checking dependencies in R code ... WARNING
'::' or ':::' imports not declared from:
'purrr' 'rlang'
Namespaces in Imports field not imported from:
'primerTree' 'zoo'
All declared Imports should be used.

Improve `run_primer_blastn()` + ncbi_bin parameter

Can improve the code in run_primer_blastn() to deal with if the binary is in the PATH or a user supplied path. Currently the code is copied for each instance, but one has an extra "-num_alignments", "10000000",.

Goal is to have 1 system2() call, where the command is build given what we know about the PATH or a user supplied path

Default evalue for run_primer_blastn is 3e+07

The documentation states that the default e-value for run_primer_blast is 3e-07, but looking at the run_primer_blastn code it is 3e+07, which seems really high. Which one should it be?

ncbi path variable not carried forward?

Getting this error in blast_seeds() after it successfully ran blastn; error happened after about an hour of runtime:

The number of unsampled indices is less than or equal to the maximum number to be blasted Running blastdbcmd on 1000 samples. Calling blastn. This may take a long time. 1969595 blast hits returned. 21715 unique blast hits after this round. Error in system2("blastdbcmd", args = c("-db", blast_db_path, "-dbtype", : error in running command

I'm thinking the ncbi_bin argument to blast_seeds() doesn't get scoped quite right, and so doesn't make it down to the function in which system2("blastdbcmd") gets called.

Add `biocViews:` to description for bioconductor package installation

The packages "DECIPHER", "psadd", and "phyloseq" are required.

I believe these are from bioconductor

Suggestion: multithread

A suggestion to make -num_threads an argument to blast_seeds(). It's otherwise buried; I think the right core function would be run_blastn(), which is doing the actual system2() call to blastn.

Referencing magrittr::`%>%` pipe in functions

Use #' @importFrom magrittr %>%`` or usethis::use_pipe() instead of `%>%` <- magrittr::`%>%`. The usethis approach might be best, as it only needs to be stated once in the whole package (a file is created) ...

clean/document modifiedPrimerTreeFunctions.R

Would be good to have proper documentation here.. where are these functions used? only in get_seeds_remote?

Address `check()` issues

There are quite a few from:
package imports
variable bindings
if() conditions comparing class() to string ..
documention of functions
examples in functions being run - either 'dont run' or use the mock-db
various others

just a matter of clean up

update examples

Examples are run/executed on package checking .. so either:

wrap in \dontrun{...}
Use mock-db() in examples

Length of sequence headers after derep and clean

I was trying to create a blastable database from the pre-compiled COI database CO1_combined_derep_and_clean.fasta . The header of the fasta sequnces after cleanup reads something like >KM252950.1_representative_of_12_identical_accessions. When I tried to use this file as the input fasta for makeblastdb , with the command
makeblastdb -in CO1_combined_derep_and_clean.fasta -dbtype nucl -parse_seqids -out rCRUX_COI -title "rCRUX_Leray"
I got an error message complaining about the length of the sequence headers
"BLAST Database creation error: Near line 1, the local id is too long. Its length is 52 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long."
I checked this was the issue by shortening the sequence headers with sed
sed 's\_representative_of\\g' CO1_combined_derep_and_clean.fasta > CO1_ready.fasta
To save a headache to future Moncho, I also shortened the headers in the taxonomy file
sed 's\_representative_of\\g' CO1_combined_derep_and_clean_taxonomy.txt > CO1_tax_ready.txt

Now the headers read >KM252950.1_12_identical_accessions

I think we could modify derep_and_clean to account for this

Add `Remotes` in description files for github repo files

Remotes:
cpauvert/psadd

Consistent use of text output formats - csv, tsv

There are varying output formats across the package - write.table('*.txt', sep = ","), write.table('*.txt', sep = "\t"), write.csv('*.csv'), write.csv('*.txt')

We should decided on the preferred output and/or just be explicit when outputting - comma-delimited = .csv, tab-delimited =.txt ..i.e. should not have write.table to a .txt with a sep=","