claraqin / neonmicrobe Goto Github PK
View Code? Open in Web Editor NEWProcessing NEON soil microbe marker gene sequence data into ASV tables.
License: GNU Lesser General Public License v3.0
Processing NEON soil microbe marker gene sequence data into ASV tables.
License: GNU Lesser General Public License v3.0
This would also involve making organizeRawSequenceData
obsolete, as it renames the files, making the existing-file check more challenging.
The downloadRawSoilData
function and subsequent code were developed assuming an outdated NEON soil data structure, in which DP1.10086.001 was physical properties and DP1.10078.001 was chemical properties. Now, DP1.10086.001 contains the combined chemical and physical properties. We should check the add_environmental_variables vignettes and the downloadRawSoilData
vignettes to ensure that they are still working properly.
The current ITS and 16S sequence processing vignettes remove chimeras BEFORE merging sequence tables together into one big sequence table. Ben Callahan recommends removing chimeras AFTER merging them. Zoey suggests that this could reduce Type 1 errors in chimera identification.
To ensure that all users know to modify the params.R
file, have a notice appear on the screen when, for example, loading the library, or before using infrequent but crucial functions.
We need to make the following changes to the workflow, particularly in the Download NEON Data vignette, to prevent QC-related issues from complicating processes downstream.
In addition, @lstanish suggests that it could be good to reorganize the columns in the metadata table so the most important columns come first. What are some columns to put first in the metadata table?
Migrating this from an email thread for easier viewing.
For now, we will be testing the steps in the figure below between "NEON Data Products" and "ASV tables with taxonomy," except that we will not be generating the taxonomy tables because this takes too much processing time.
Our Technical Working Group has suggested that this testing should occur in two phases. In Phase 1, we test the pipeline to ensure that the pipeline is simply able to run from start to finish on a variety of operating systems. In Phase 2, we will ask volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly. For now, we are only asking you to conduct Phase 1 testing.
Start by pulling the codebase from https://github.com/claraqin/NEON_soil_microbe_processing.
Install cutadapt if you have not previously done so.
Update the parameters in the "params.R" file, which can be found in the "code" subdirectory.
Download the sequence metadata for testing at this Google Drive link, decompress the zipfile, and drop the contents (two files) into the project directory (the base directory of the repository that you just cloned).
The code for testing can be found in the "testing" subdirectory. This subdirectory contains temporary versions of our vignettes that I made for testing only. Start with the download-neon-data-metadataworkaround.Rmd vignette.
Then move to either the process-its-sequence-to-seqtabs.Rmd or process-16s-sequence-to-seqtabs.Rmd vignettes, depending on which subset of the data you selected.
If any issues or fatal errors arise, please let me know by replying to me individually (unless of course it seems obvious that it would affect all testers).
Whether you run into a fatal error or are able to complete the pipeline error-free, please report back on this thread and include in your post the output of devtools::session_info()
.
c("B69PP", "B69RF", "B69RN", "B9994", "BDR3T", "BF8M2", "BF8W6", "BFDG8", "BMCBD", "BMCC4", "BNBWL")
.c("B69PP", "B69RF", "B69RN", "B9994", "BDNB6", "BF462", "BF8M2", "BFDG8", "BJ8RK", "BMC64", "BMCBJ"
.Hi everyone,
It seems that some of the reads in the R1 files are actually reverse reads, and some of the reads in the R2 files are actually forward reads. See this table of primer orientations in R1 and R2 files for runB69PP:
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 423 0 0 3
FWDPrimer.R2.reads 26 0 0 13
REVPrimer.R1.reads 647 0 0 298
REVPrimer.R2.reads 737 0 0 4
This is an issue common to Illumina sequencing known as mixed orientation. DADA2 does not have a built-in way to handle this, but see benjjneb/dada2#938
The dada2_to_phyloseq_[its/16s].Rmd
vignettes need to be revised to match the metadata format which was revised by Lee
and Clara on (or before) Dec 11, 2020. The necessary changes may not be very much – I believe the main differences in the format of the metadata are related to which columns it includes, and where it was saved – but we simply need to make sure these vignettes are compatible with the upstream vignettes, download-neon-data.Rmd
and process-[its/16s]-sequences.Rmd
.
Some of the vignettes contain code that looks like this:
unique_runs <- unique(unlist(
regmatches(list.files(PATH_RAW), gregexpr("^run[A-Za-z0-9]*", list.files(PATH_RAW)))
))
Rather than parsing information from the filenames, we should parse them from associated metadata instead.
Sometimes, the rawDataFilePath
column of the mmg_soilRawDataFiles
table (in product ID 10108) will contain bad URLs. This will continue for a while during NEON's file renaming effort. For example:
trying URL 'https://neon-microbial-raw-seq-files.s3.data.neonscience.org/2017/BMI_B69PP_ITS_R1_fastq.tar.gz'
Error in download.file(url = as.character(u.urls[i]), destfile = ifelse(dir.exists(outdir), :
cannot open URL 'https://neon-microbial-raw-seq-files.s3.data.neonscience.org/2017/BMI_B69PP_ITS_R1_fastq.tar.gz'
In addition: Warning message:
In download.file(url = as.character(u.urls[i]), destfile = ifelse(dir.exists(outdir), :
cannot open URL 'https://neon-microbial-raw-seq-files.s3.data.neonscience.org/2017/BMI_B69PP_ITS_R1_fastq.tar.gz': HTTP status was '404 Not Found'
I hadn't used downloadRawSoilData()
for a few months. Checking on it today, I found that it produces an error when I attempt to download all soil data. While I didn't manage to record the error/warning messages from downloading all data, I did receive the following warning when trying to download DP1.10086 (soil physical properties) for all sites and dates:
Warning message:
In value[[3L]](cond) :
No data was found for data product DP1.10086.001 at the specified sites and dates.
I used sites="all, startYrMo=NA, and endYrMo=NA.
I plan to investigate this later, but if anyone else has experience with this, please let me know!
The .rmd file will not knit. I added my file directory in the knitr::opts_knit$set function, but I think the params.R file is causing the problem, since the file paths are hard-coded. Is it worth updating so that I can knit the .rmd, or should I instead just send text edits to you?
I'm just starting to go through the new server setup instructions, and realized that it pertains specifically to downloading data from the Zhu lab space on the UCSB socs-stats server. Do you want us to directly test that (in which case I'll somehow need access to the server) or can the utils.R code be generalized for direct download from NEON via neonUtilities? Thanks!
@lstanish I tried to read fastq.gz file. ShortRead::readFastq()
doesn't support direct read from https, which is different from base::read.csv()
. A workaround is to download the file to a temp location and read locally. It should work for Windows/Mac/Linux.
The sensitivity analysis currently only exists as an R file, and it needs to be converted into an Rmd file with better documentation. The sensitivity analysis must also be changed to an analysis on the 16S data, to get around some of the methodological ambiguities that were brought up in previous Technical WG meetings about the ITS data.
I'll work on this.
The directory structure of this repository must be updated to match standard R package conventions before being made available to the public. The directories I suggest keeping are the following:
R
– to contain new R code, mainly utils.R
man
– to contain object documentation for the functions in utils.R
vignettes
– to contain all .Rmd
vignettesWhere should params.R
go?
Additional resources:
By default, downloadRawSequenceData()
prints one line for each file that is downloaded. There is currently no option to suppress this.
The soil data download functions (downloadRawSoilData()
and downloadAllRawSoilData()
) throw an error if there is nothing to retrieve from soil chemistry (DP1.10107.001) or soil physical properties (DP1.10108.001). Since the output of the downloading script is passed into the SQLite database in create_soil_database.R
, it will causes downstream issues if the function output is missing columns. An error occurs for these preset parameters, which have no soil chemistry data:
PRESET_SITES = c("OSBS", "CPER", "CLBJ")
PRESET_START_YR_MO = "2018-03"
PRESET_END_YR_MO = "2018-07"
TARGET_GENE = "16S"
SEQUENCING_RUNS = c("C5B2R")
The name of the function downloadAllRawSoilData()
can be misleading, since it relies on the preset parameters rather than downloading all of NEON's available soil data - one option could be for downloadAllRawSoilData()
to (by default) download all the data for preset sites, or for all sites? (not sure how long this takes, though).
Another approach may be to create a full SQLite database and store it 1) remotely, or 2) within the R package, and then the database can be called to match the sequencing files. I'm not sure how much memory this full database would take up at this point in time, but it might be a manageable size. One implementation of the remote approach is within the metScanR package:
"The DB is updated frequently and hosted externally of the metScanR package. Upon loading the metScanR package via library(metScanR), the DB is accessed via internet connection and installed locally to the user's computer via metScanR's updateDatabase() function."
To create a more robust file downloading process, the downloadRawSequenceData
function should use neonUtilities::zipsByURI
instead of the base R function file.download
. Since zipsByURI
only accepts metadata from a downloaded file, this may also require changes to the default behavior of downloadSequenceMetadata
.
The organizeRawSequenceData
function, which appears at the end of the "Download NEON Data" vignette and moves downloaded files from a "docking" location to an ITS- or 16S-specific subfolder, should reference the metadata produced by downloadSequenceMetadata
to sort files into their subfolder rather than search for the "ITS" and "16S" character strings in the filenames.
The code to create a directory structure, in the server setup script, could be simplified by wrapping into a function and using system()
calls. mkdir -p
will create a directory if it does not exist, but will not overwrite or throw errors if there is already an existing directory. Below is an example of how I have used this before (not updated for the specific server setup of this pipeline).
# Load "tree" command by JennyBC
# twee vignette here: https://gist.github.com/jennybc/2bf1dbe6eb1f261dfe60
temp <- RCurl::getURL("https://gist.githubusercontent.com/jennybc/2bf1dbe6eb1f261dfe60/raw/c53fba8a861f82f90d895458e941a1056c8118f5/twee.R", ssl.verifypeer=F)
eval(parse(text = temp))
# Create data directory
create_data_directory <- function(path = ".", amplicon = c("ITS", "16S"), ...) {
if (length(amplicon) == 1) {
cmd <- paste0("mkdir -p data/{filt_seqs/",amplicon,",raw_seqs/",amplicon,",trimmed_seqs/",amplicon,",seq_tables/",amplicon,",output_files/",amplicon,"}")
} else {
cmd <- "mkdir -p data/{filt_seqs/{ITS,16S},raw_seqs/{ITS,16S},trimmed_seqs/{ITS,16S},seq_tables/{ITS,16S},output_files/{ITS,16S}}"
}
system(cmd)
twee("data/", level = 2)
}
The twee()
function prints a directory tree to illustrate file structures, for example:
> twee("./data/NEON_DOB/", level=2)
Illumina
|__NEON
-- sequence_metadata
|__filesToStack10108
|__mmg_soilMetadata_16S_2020-08-06.csv
|__mmg_soilMetadata_ITS_2020-08-06.csv
-- soil
|__filesToStack10078
|__filesToStack10086
-- soilDatabase.rds
-- soilDB.db
There should be some code to allow the user to exclude legacy data and other dataQF tags from the sequence files they download.
In the Technical Working Group meeting on Oct. 5, we discussed doing this with a simple grep
command to subset the metadata object. However, there is a planned revision to download files using zipsByURI
, which only accepts metadata from a file location, so after this revision is made, slightly more code must be added in order to write the subsetted metadata to file.
I don't have a strong preference between creating a new function for this, versus just producing a few lines of code that we could put into the "Download NEON Data" vignette.
The filterAndTrim
step of the DADA2 processing pipeline removes a majority of ITS reads from most samples in most sequencing runs. For example, after filterAndTrim
with the default parameters (maxN = 0, maxEE = c(2, 2), truncQ = 2, minLen = 50
), the median percentage of reads remaining in a sample from sequencing run B69PP is only 16%.
@mykophile commented:
Hi Clara - one of the graduate students in my lab, Glade, found that one of the keys for him was setting the TruncQ parameter higher than the default. While this may not seem critical, his reasoning was that with ITS seqs there is a long, low quality tail that can cause reads to get rejected later. If you are more aggressive with trimming early that actually saves more down the line. Here are the settings he found worked best:
out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2),
truncQ = 9, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE)
Best,
Kabir
Currently, error tolerance is assumed to be default -e 0.1
. Changing the error tolerance could allow more primer sequences to be identified (and thus removed) by cutadapt
.
See https://cutadapt.readthedocs.io/en/stable/guide.html#error-tolerance
To be fair, this might not be the most consequential parameter. Error tolerance of 0.1 corresponds to maximum mismatch of 2 bases. Allowing up to 3 mismatched bases would have removed an additional 138 reverse-primer matches from the reverse reads – approximately 10% of all the matches (with max.mismatch=3
) in the pre-cutadapt reverse reads.
runB69PP, first R1 sample
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 423 0 0 3
FWDPrimer.R2.reads 26 0 0 13
REVPrimer.R1.reads 647 0 0 298
REVPrimer.R2.reads 737 0 0 4
runB69PP, first R1 sample, setting max.mismatch = 2
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 731 0 0 4
FWDPrimer.R2.reads 31 0 0 314
REVPrimer.R1.reads 1518 0 0 807
REVPrimer.R2.reads 1196 0 0 13
runB69PP, first R1 sample, setting max.mismatch = 3
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 757 0 0 4
FWDPrimer.R2.reads 31 0 0 343
REVPrimer.R1.reads 1528 0 0 944
REVPrimer.R2.reads 1342 0 0 19
runB69PP, post cutadapt (1 of 2), first R1 sample
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 0 0 0 3
FWDPrimer.R2.reads 26 0 0 0
REVPrimer.R1.reads 646 0 0 0
REVPrimer.R2.reads 0 0 0 4
runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 2
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 0 0 0 4
FWDPrimer.R2.reads 31 0 0 0
REVPrimer.R1.reads 1509 0 0 0
REVPrimer.R2.reads 0 0 0 13
runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 3
Forward Complement Reverse RevComp
FWDPrimer.R1.reads 1 0 0 4
FWDPrimer.R2.reads 31 0 0 5
REVPrimer.R1.reads 1519 0 0 132
REVPrimer.R2.reads 138 0 0 19
Improvements should be made to the sequence data downloading functions in utils.R
to make them more robust. Currently they assume a standard file naming structure, which means that they may not be robust to minute changes in naming conventions.
Suggestions from @lstanish :
In my experience the best way to ensure you are filtering for the correct sequence data is to start by downloading the metadata and doing some table joining and filtering. Here’s some example code using neonUtilities that combines the respective sequencing data table (16S or ITS) with the raw data files table. What you end up with is a data.frame containing the rawDataFIleNames for just the 16S or ITS data
library(neonUtilities)
library(plyr)
library(dplyr)
mmgL1 <- loadByProduct('DP1.10108.001', package = 'expanded', check.size = F) # output is a list of each metadata file
# extract lists into data. frames
seq16S <- mmgL1$mmg_soilMarkerGeneSequencing_16S
seqITS <- mmgL1$mmg_soilMarkerGeneSequencing_ITS
raw <- mmgL1$mmg_soilRawDataFiles
# double check that strings are of class character and not factor/logical
# Join 16S metadata
raw16S <- raw[-grep("ITS", raw$rawDataFileName), ]
joined16S <- left_join(raw16S, seq16S, by=c('dnaSampleID', 'sequencerRunID', 'internalLabID'))
joined16S <- joined16S[!is.na(joined16S$uid.y), ]
# Join ITS metadata
rawITS <- raw[-grep("16S", raw$rawDataFileName), ]
joinedITS <- left_join(rawITS, seqITS, by=c('dnaSampleID', 'sequencerRunID', 'internalLabID'))
joinedITS <- joinedITS[!is.na(joinedITS$uid.y), ]
From here you can subset the data to include only the run or samples of interest.
Per a conversation between @zoey-rw and I, the .fastq
files that are reorganized into the 0_raw
subfolders at the end of the download-neon-data.Rmd
vignette need to be turned into .fastq.gz
files, or else this error is reached during the process-[16s/its]-sequences.Rmd
vignette:
dnaio.exceptions.FastqFormatError: Error in FASTQ file at line 1: Line expected to start with '@', but found '\x1f'
The .fastq
files are actually already compressed, so this can be addressed by include a line at the end of the download-neon-data.Rmd
vignette that just appends ".gz" to the end of each filename.
Change to:
Data (formerly “NEON”)
* Raw_sequence
* Sequence_metadata
* Outputs
* Soil
* Tax_ref
For some reason, I've been missing some GitHub notifications. I've just changed some settings that should address it, but I can't test it on my own. If you're reading this, can you please respond with a comment to this issue thread?
@claraqin I suggest having two folders: data-raw
and data
. data-raw
stores the data directly downloaded from NEON, etc. using code, and it is a symbolic link on the server and won't be uploaded to GitHub. data
stores processed data that are needed to run the models, and should be relatively small and will be version-controlled on GitHub. For example, the rda
files can be put in the data
folder. That way, collaborators only need the GitHub repo without accessing the server.
On Socs-Stats Server, devtools::install_github("claraqin/neonMicrobe")
hits a roadblock when installing the Rhtslib
package. I followed the instruction on https://bioconductor.org/packages/release/bioc/html/Rhtslib.html
, but here're the error messages. R v 3.6 has similar messages.
> BiocManager::install("Rhtslib", type = "source")
Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)
Installing package(s) 'Rhtslib'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1437k 100 1437k 0 0 16.5M 0 --:--:-- --:--:-- --:--:-- 16.5M
Microsoft R Open 4.0.2
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2020 Microsoft Corporation
Using the Intel MKL for parallel mathematical computing (using 44 cores).
Default CRAN mirror snapshot taken on 2020-07-16.
See: https://mran.microsoft.com/.
* installing *source* package ‘Rhtslib’ ...
** using non-staged installation via StagedInstall field
** libs
cd "htslib-1.7" && make -f "/opt/microsoft/ropen/4.0.2/lib64/R/etc/Makeconf" -f "Makefile.Rhtslib"
make[1]: Entering directory '/tmp/Rtmptxl4sR/R.INSTALLecb550d6d643/Rhtslib/src/htslib-1.7'
Makefile.Rhtslib:128: warning: overriding recipe for target '.c.o'
/opt/microsoft/ropen/4.0.2/lib64/R/etc/Makeconf:159: warning: ignoring old recipe for target '.c.o'
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o kfunc.o kfunc.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o knetfile.o knetfile.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o kstring.o kstring.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o bcf_sr_sort.o bcf_sr_sort.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o bgzf.o bgzf.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o errmod.o errmod.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o faidx.o faidx.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hfile.o hfile.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hfile_net.o hfile_net.c
echo '#define HTS_VERSION "1.7"' > version.h
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hts.o hts.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hts_os.o hts_os.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o md5.o md5.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o multipart.o multipart.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o probaln.o probaln.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o realn.o realn.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o regidx.o regidx.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o sam.o sam.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o synced_bcf_reader.o synced_bcf_reader.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o vcf_sweep.o vcf_sweep.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o tbx.o tbx.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o textutils.o textutils.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o thread_pool.o thread_pool.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o vcf.o vcf.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o vcfutils.o vcfutils.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_codecs.o cram/cram_codecs.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_decode.o cram/cram_decode.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_encode.o cram/cram_encode.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_external.o cram/cram_external.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_index.o cram/cram_index.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_io.o cram/cram_io.c
cram/cram_io.c:61:18: fatal error: lzma.h: No such file or directory
compilation terminated.
make[1]: *** [Makefile.Rhtslib:128: cram/cram_io.o] Error 1
make[1]: Leaving directory '/tmp/Rtmptxl4sR/R.INSTALLecb550d6d643/Rhtslib/src/htslib-1.7'
make: *** [Makevars.common:24: htslib] Error 2
ERROR: compilation failed for package ‘Rhtslib’
* removing ‘/raid/users/kai.zhu/R/x86_64-pc-linux-gnu-library/4.0/Rhtslib’
The downloaded source packages are in
‘/tmp/RtmppVM9FP/downloaded_packages’
Installation path not writeable, unable to update packages: BiocGenerics,
BiocManager, broom, callr, cli, cpp11, data.table, dbplyr, DelayedArray, desc,
devtools, diffobj, dplyr, DT, e1071, farver, formatR, GenomeInfoDb, gert, gh,
haven, highr, IRanges, isoband, knitr, lubridate, lwgeom, magick,
MatrixGenerics, mgcv, neonUtilities, nngeo, packrat, parallelly, pillar,
pkgload, processx, ps, RcppArmadillo, RcppParallel, RCurl, remotes, reprex,
reticulate, rmarkdown, rsconnect, rvest, S4Vectors, sf, spatstat,
spatstat.data, spatstat.geom, spatstat.sparse, spatstat.utils, surveillance,
survival, tensorflow, tfruns, tibble, tidyr, tidyverse, tinytex, units, utf8,
vctrs, viridisLite, waldo, withr, wk, xfun, zoo
Warning message:
In install.packages(...) :
installation of package ‘Rhtslib’ had non-zero exit status
The current ITS sequence processing vignette only supports the processing of R1 sequences, following recommendations by Pauvert et al. (2019). This can be updated to handle paired-end merging by matching some of the code in the 16S sequence processing vignette.
This should be considered a secondary priority to completing vignettes that currently do not exist.
> devtools::session_info()
- Session info ----------------------------------------------------------------------------------------------
setting value
version R version 4.0.3 (2020-10-10)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/Los_Angeles
date 2020-11-04
- Packages --------------------------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.3)
backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.3)
BiocManager 1.30.10 2019-11-16 [1] CRAN (R 4.0.3)
callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.3)
cli 2.1.0 2020-10-12 [1] CRAN (R 4.0.3)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.3)
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.3)
devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.3)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.3)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.3)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.3)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3)
knitr 1.30 2020-09-22 [1] CRAN (R 4.0.3)
magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.3)
memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.3)
pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.3)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.3)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.3)
processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.3)
ps 1.4.0 2020-10-07 [1] CRAN (R 4.0.3)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.3)
rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.3)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.3)
rstudioapi 0.11 2020-02-07 [1] CRAN (R 4.0.3)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3)
testthat 3.0.0 2020-10-31 [1] CRAN (R 4.0.3)
usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.3)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.3)
xfun 0.19 2020-10-30 [1] CRAN (R 4.0.3)
[1] C:/Users/kai.zhu/Documents/R/win-library/4.0
[2] C:/Program Files/R/R-4.0.3/library
The remove_unmatched_files()
function discards from the 16S processing pipeline any R1 fastq files that do not appear to have a matching R2 fastq file, and vise versa. It currently performs this matching using the filenames only. This causes problems when there are slight mismatches in the file naming conventions between R1 and R2 fastq files of the same sequencing run, as for run BFDG8:
> head(fnFs)
[1] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA12_16S_R1.fastq"
[2] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA3_16S_R1.fastq"
[3] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA4_16S_R1.fastq"
[4] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA5_16S_R1.fastq"
[5] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA9_16S_R1.fastq"
[6] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellB1_16S_R1.fastq"
> length(fnFs)
[1] 142
> head(fnRs)
[1] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA10_16S_BFDG8_R2.fastq"
[2] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA11_16S_BFDG8_R2.fastq"
[3] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA12_16S_BFDG8_R2.fastq"
[4] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA2_16S_BFDG8_R2.fastq"
[5] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA3_16S_BFDG8_R2.fastq"
[6] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA4_16S_BFDG8_R2.fastq"
> length(fnRs)
[1] 188
> matched_fn <- remove_unmatched_files(fnFs, fnRs)
> matched_fn
$R1
character(0)
$R2
character(0)
A generalization of Issue #23.
There are many places in the processing pipeline that attempt to parse information from the file name, or to store information in the file name. We should try to trim away these pieces, replacing them with a metadata-based identification system, which is now possible thanks to Lee's new version of downloadSequenceMetadata()
. Some of these pieces are:
remove_unmatched_files()
process_*_sequences.Rmd
vignettesqualityFilter16S()
, trimPrimersITS()
) that differentiate between forward and reverse sequence files by looking for a regex pattern in their filenamesIn testing analyze-neon-greatplains-16s.R
on CyVerse, the following chunk:
filter_trackReads <- qualityFilter16S(
fl_nm, in_subdir = "1_trimmed", out_subdir = "2_filtered",
meta = meta, truncLen = 220, maxEE = 8, multithread = TRUE
)
The error message is
Using user-provided truncLen: 220
Error in names(answer) <- names1 :
'names' attribute [86] must be the same length as the vector [75]
In addition: Warning messages:
1: In mclapply(seq_len(n), do_one, mc.preschedule = mc.preschedule, :
scheduled cores 2, 11 did not deliver results, all values of the jobs will be affected
2: In mclapply(seq_len(n), do_one, mc.preschedule = mc.preschedule, :
scheduled cores 6, 10, 13, 14, 7, 9, 15 encountered errors in user code, all values of the jobs will be affected
At the end of the download-neon-data.Rmd
vignette, rather than retrieving the names of the downloaded files (fn
) using list.files()
, simply reference the rawDataFileNames
column in the metadata, and filter using the download_success
vector output from downloadRawSequenceData()
. Note that this also requires the download_success
vector to be formatted correctly, though.
Update the downloadSequenceMetadata()
function so it also downloads the variable name definitions, for easy reference.
Many of the function revisions currently being developed require the user to specify a metadata file that will be referenced to determine which files to process for a given step, e.g. matchFastqToMetadata()
. (This is an alternative to parsing the filenames for this information, as the original functions did.) While this new convention is more robust to differences in file naming convention, it also means that there ought to be an easy way to remember which metadata files are associated with which set(s) of fastq files. Currently, metadata files which are downloaded to the sequence_metadata
subdirectory are simply identified by a timestamp.
@zoey-rw suggested having one "complete" copy of the metadata, containing the metadata for all soil microbe marker gene sequence records. This would be compatible with any fastq files that we attempt to match to it, and would remain in a static location (perhaps defined by params.R
) so it can be easily accessed.
I've started testing project management tools under the "Projects" tab. It's possible to have to do, in progress, and done categories. Give it a try. @claraqin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.