calcofi / rcrux Goto Github PK
View Code? Open in Web Editor NEWA repository for my work on rCRUX
License: GNU General Public License v3.0
A repository for my work on rCRUX
License: GNU General Public License v3.0
I've noticed heavy use of paste0
for file path building .. there is a base R file.path()
function to do this.. i think the latter is better practice, although you need to keep consistent with it's use and idiosyncrasies (there will never be a trailing /
) .. so i did a quick change of the few that exist in get_seeds_local()
but it didn't play nice with save_output_as_csv()
๐
given the later is pasting things again and it expected a trailing /
in a path generated within get_seeds_local()
Depending on how interconnected different functions are that are pasting file.paths together will determine how easy of an update this would be.
Went to update the rCrux package and noticed in the readme for the install of it that it is pointing to LunaGal/rCrux not the new location.
NCBI just added some smaller nucleotide databases to the FTP site, so you can get one for just eukaryotes if you want (see here. This also means the current code on the readme to download the nt database isn't quite right. If you just want the full nt database (not the partial databases), use this:
wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz*"
Having only the nt_euk DB might make COI a lot faster though...
Would be good to have the ability to reproduce randomisation parts .. would allow reproducible results for paper writing where randomisation occurs - means if analysis needs to be rerun, you should get the same numbers
line 351:
append_table <- read.csv(append_table_path, colClasses = "character")
lines 389-390:
f_and_r <- dplyr::mutate(f_and_r, product_length = dplyr::case_when((forward_start < reverse_start & forward_start < forward_stop & reverse_stop < reverse_start ) ~ (as.numeric(reverse_start) - as.numeric(forward_start)),
(forward_start > reverse_start & forward_start > forward_stop & reverse_stop > reverse_start) ~ (as.numeric(forward_start) - as.numeric(reverse_start)),))
Something like this needs to go just before lines 389-390 ... or import those columns earlier as integers (would need to use readr::read_csv
, however, as it lets you specify individual column classes unlike read.csv
..
f_and_r <-
dplyr::mutate(f_and_r,
dplyr::across(c('forward_start', 'forward_stop', 'reverse_start', 'reverse_stop'), .fns = as.integer)
)
Likely the example pipeline section discussing "blast_seeds_local() or blast_seeds_remote()" is supposed to refer to get_seeds_local/remote?
Thanks!
I saw readr
was referenced once in get_seeds_local()
.. it's a great package, so why not use it for all read_/write_ functions? or exclude it as a dependency?
Imports:
tibble,
dplyr,
magrittr,
taxonomizr,
data.table,
tidyr,
stringr,
lubridate,
primerTree,
progress,
zoo,
readr,
XML,
httr,
plyr,
phylotools
โฏ checking dependencies in R code ... WARNING
'::' or ':::' imports not declared from:
'purrr' 'rlang'
Namespaces in Imports field not imported from:
'primerTree' 'zoo'
All declared Imports should be used.
Can improve the code in run_primer_blastn()
to deal with if the binary is in the PATH or a user supplied path. Currently the code is copied for each instance, but one has an extra "-num_alignments", "10000000",
.
Goal is to have 1 system2()
call, where the command is build given what we know about the PATH or a user supplied path
The documentation states that the default e-value for run_primer_blast is 3e-07, but looking at the run_primer_blastn code it is 3e+07, which seems really high. Which one should it be?
Getting this error in blast_seeds()
after it successfully ran blastn; error happened after about an hour of runtime:
The number of unsampled indices is less than or equal to the maximum number to be blasted Running blastdbcmd on 1000 samples. Calling blastn. This may take a long time. 1969595 blast hits returned. 21715 unique blast hits after this round. Error in system2("blastdbcmd", args = c("-db", blast_db_path, "-dbtype", : error in running command
I'm thinking the ncbi_bin
argument to blast_seeds()
doesn't get scoped quite right, and so doesn't make it down to the function in which system2("blastdbcmd")
gets called.
The packages "DECIPHER", "psadd", and "phyloseq" are required.
I believe these are from bioconductor
A suggestion to make -num_threads
an argument to blast_seeds()
. It's otherwise buried; I think the right core function would be run_blastn()
, which is doing the actual system2()
call to blastn
.
Use #' @importFrom magrittr
%>%`` or usethis::use_pipe()
instead of `%>%` <- magrittr::`%>%`. The usethis approach might be best, as it only needs to be stated once in the whole package (a file is created) ...
Would be good to have proper documentation here.. where are these functions used? only in get_seeds_remote
?
There are quite a few from:
package imports
variable bindings
if() conditions comparing class() to string ..
documention of functions
examples in functions being run - either 'dont run' or use the mock-db
various others
just a matter of clean up
Examples are run/executed on package checking .. so either:
\dontrun{...}
mock-db()
in examplesI was trying to create a blastable database from the pre-compiled COI database CO1_combined_derep_and_clean.fasta
. The header of the fasta sequnces after cleanup reads something like >KM252950.1_representative_of_12_identical_accessions
. When I tried to use this file as the input fasta for makeblastdb
, with the command
makeblastdb -in CO1_combined_derep_and_clean.fasta -dbtype nucl -parse_seqids -out rCRUX_COI -title "rCRUX_Leray"
I got an error message complaining about the length of the sequence headers
"BLAST Database creation error: Near line 1, the local id is too long. Its length is 52 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long."
I checked this was the issue by shortening the sequence headers with sed
sed 's\_representative_of\\g' CO1_combined_derep_and_clean.fasta > CO1_ready.fasta
To save a headache to future Moncho, I also shortened the headers in the taxonomy file
sed 's\_representative_of\\g' CO1_combined_derep_and_clean_taxonomy.txt > CO1_tax_ready.txt
Now the headers read >KM252950.1_12_identical_accessions
I think we could modify derep_and_clean to account for this
Remotes:
cpauvert/psadd
There are varying output formats across the package - write.table('*.txt', sep = ",")
, write.table('*.txt', sep = "\t")
, write.csv('*.csv')
, write.csv('*.txt')
We should decided on the preferred output and/or just be explicit when outputting - comma-delimited = .csv
, tab-delimited =.txt
..i.e. should not have write.table
to a .txt
with a sep=","
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.