claraqin / neonmicrobe Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 981.01 MB

Processing NEON soil microbe marker gene sequence data into ASV tables.

License: GNU Lesser General Public License v3.0

R 100.00%

biogeography bioinformatics dada2 macroecology marker-genes neon neonmicrobe r-package soil-microbiome

neonmicrobe's People

Contributors

Stargazers

Watchers

Forkers

naithanilab dawsonfairbanks

neonmicrobe's Issues

downloadRawSequenceData: include progress bar, and skip download if file already exists

This would also involve making organizeRawSequenceData obsolete, as it renames the files, making the existing-file check more challenging.

Ensure that code involving NEON soil data products still works

The downloadRawSoilData function and subsequent code were developed assuming an outdated NEON soil data structure, in which DP1.10086.001 was physical properties and DP1.10078.001 was chemical properties. Now, DP1.10086.001 contains the combined chemical and physical properties. We should check the add_environmental_variables vignettes and the downloadRawSoilData vignettes to ensure that they are still working properly.

Update the vignettes to remove chimeras AFTER merging sequence tables

The current ITS and 16S sequence processing vignettes remove chimeras BEFORE merging sequence tables together into one big sequence table. Ben Callahan recommends removing chimeras AFTER merging them. Zoey suggests that this could reduce Type 1 errors in chimera identification.

Have params printed to screen conspicuously at some point

To ensure that all users know to modify the params.R file, have a notice appear on the screen when, for example, loading the library, or before using infrequent but crucial functions.

Metadata/downloading QC checklist

We need to make the following changes to the workflow, particularly in the Download NEON Data vignette, to prevent QC-related issues from complicating processes downstream.

Metadata file(s) should be saved by default
Check for pre-existing downloads
Check for duplicate sample IDs (Due to either re-sampling vs labeling errors. In either case, choose which file to retain)
Remove QC-flagged data
Separate metadata into 16S or ITS - this could occur at the phyloseq step, or before downloading raw sequences

In addition, @lstanish suggests that it could be good to reorganize the columns in the metadata table so the most important columns come first. What are some columns to put first in the metadata table?

Instructions for pipeline testing

Migrating this from an email thread for easier viewing.

Scope of Testing

For now, we will be testing the steps in the figure below between "NEON Data Products" and "ASV tables with taxonomy," except that we will not be generating the taxonomy tables because this takes too much processing time.

Our Technical Working Group has suggested that this testing should occur in two phases. In Phase 1, we test the pipeline to ensure that the pipeline is simply able to run from start to finish on a variety of operating systems. In Phase 2, we will ask volunteers to read through the docs and provide suggestions on how to make the package more flexible and user-friendly. For now, we are only asking you to conduct Phase 1 testing.

Instructions for Phase 1 Testing

Start by pulling the codebase from https://github.com/claraqin/NEON_soil_microbe_processing.

I recommend using git clone, e.g. "git clone https://github.com/claraqin/NEON_soil_microbe_processing.git"

Install cutadapt if you have not previously done so.

Installation instructions can be found here: https://cutadapt.readthedocs.io/en/stable/installation.html
This is where many people run into issues because of Python dependencies. If you cannot install cutadapt, then ignore the ITS pipeline and test the 16S processing pipeline only. (The 16S pipeline does not require cutadapt.)

Update the parameters in the "params.R" file, which can be found in the "code" subdirectory.

Most of the parameters will not need to be updated because they are either adaptable or will not be referenced in this scope of testing.
However, you may need to update the CUTADAPT_PATH parameter if you are testing the ITS pipeline.
If you are using a Mac, you may also wish to update the MULTITHREAD parameter. By default, multithreading is turned off for Windows computers.

Download the sequence metadata for testing at this Google Drive link, decompress the zipfile, and drop the contents (two files) into the project directory (the base directory of the repository that you just cloned).

In the future, this step will be replaced by a function made specifically for downloading sequence metadata from NEON. But for now, we need to use a workaround because of compatibility issues on NEON's end which will be resolved later this year.

The code for testing can be found in the "testing" subdirectory. This subdirectory contains temporary versions of our vignettes that I made for testing only. Start with the download-neon-data-metadataworkaround.Rmd vignette.

You will probably have to update the "root.dir" RMarkdown parameter at the top of the script. It should refer to the absolute filepath of the project root directory (e.g. .../neonSoilMicrobeProcessing).
Note that the R package dependencies, specified in lines 32-36, must be installed before this vignette will run properly.
In lines 81-82, you will have the option to download either the metadata for ITS sequences or the metadata for 16S sequences (or both). Please respond to this Issue thread to let the other testers know which target gene(s) you will test.
In lines 89-101, different options of subsetting parameters are provided. You could attempt to download and process the entire dataset if you'd like, but I do not even have an estimate of the full download size because these metadata tables include both published and pre-published NEON data. If you do subset the data, please respond to this Issue thread to let the other testers know which subset(s) you will test.

Then move to either the process-its-sequence-to-seqtabs.Rmd or process-16s-sequence-to-seqtabs.Rmd vignettes, depending on which subset of the data you selected.

You will probably have to update the "root.dir" RMarkdown parameter at the top of the script. It should refer to the absolute filepath of the project root directory (e.g. .../neonSoilMicrobeProcessing).
Note that the R package dependencies, specified in lines 30-34, must be installed before this vignette will run properly.
Both vignettes contain a header which says "All code below is NOT run in this version of the vignette." Please run only the code above this header.
Note that each sequencing run (the unit by which we are subsetting) takes anywhere between 1 and 4 hours to process, depending on the size of the run and the speed of your processor. I've found that 8 GB of RAM is usually sufficient for running this pipeline, but occasionally more RAM is needed.

Reporting Back

If any issues or fatal errors arise, please let me know by replying to me individually (unless of course it seems obvious that it would affect all testers).

Whether you run into a fatal error or are able to complete the pipeline error-free, please report back on this thread and include in your post the output of devtools::session_info().

Current Volunteer Assignments

Kabir has tested the ITS pipeline on a Mac for the following subset of data: c("B69PP", "B69RF", "B69RN", "B9994", "BDR3T", "BF8M2", "BF8W6", "BFDG8", "BMCBD", "BMCC4", "BNBWL").
Dan is currently testing the 16S pipeline on a Mac for the following subset of data: c("B69PP", "B69RF", "B69RN", "B9994", "BDNB6", "BF462", "BF8M2", "BFDG8", "BJ8RK", "BMC64", "BMCBJ".
Kai is currently testing the 16 pipeline on a Windows VM and printing the results in this Issue thread: #26

Mixed orientation of reads in R1 and R2 files

Hi everyone,

It seems that some of the reads in the R1 files are actually reverse reads, and some of the reads in the R2 files are actually forward reads. See this table of primer orientations in R1 and R2 files for runB69PP:

                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     423          0       0       3
FWDPrimer.R2.reads      26          0       0      13
REVPrimer.R1.reads     647          0       0     298
REVPrimer.R2.reads     737          0       0       4

This is an issue common to Illumina sequencing known as mixed orientation. DADA2 does not have a built-in way to handle this, but see benjjneb/dada2#938

Revise "dada2_to_phyloseq_[its/16s].Rmd" (aka "Add Environmental Variables") vignettes to match new metadata format

The dada2_to_phyloseq_[its/16s].Rmd vignettes need to be revised to match the metadata format which was revised by Lee
and Clara on (or before) Dec 11, 2020. The necessary changes may not be very much – I believe the main differences in the format of the metadata are related to which columns it includes, and where it was saved – but we simply need to make sure these vignettes are compatible with the upstream vignettes, download-neon-data.Rmd and process-[its/16s]-sequences.Rmd.

Replace filename-parsing code with metadata-matching code in all vignettes

Some of the vignettes contain code that looks like this:

unique_runs <- unique(unlist(
  regmatches(list.files(PATH_RAW), gregexpr("^run[A-Za-z0-9]*", list.files(PATH_RAW)))
))

Rather than parsing information from the filenames, we should parse them from associated metadata instead.

Some fastq file URLs return 404 error

Sometimes, the rawDataFilePath column of the mmg_soilRawDataFiles table (in product ID 10108) will contain bad URLs. This will continue for a while during NEON's file renaming effort. For example:

trying URL 'https://neon-microbial-raw-seq-files.s3.data.neonscience.org/2017/BMI_B69PP_ITS_R1_fastq.tar.gz'
Error in download.file(url = as.character(u.urls[i]), destfile = ifelse(dir.exists(outdir),  : 
  cannot open URL 'https://neon-microbial-raw-seq-files.s3.data.neonscience.org/2017/BMI_B69PP_ITS_R1_fastq.tar.gz'
In addition: Warning message:
In download.file(url = as.character(u.urls[i]), destfile = ifelse(dir.exists(outdir),  :
  cannot open URL 'https://neon-microbial-raw-seq-files.s3.data.neonscience.org/2017/BMI_B69PP_ITS_R1_fastq.tar.gz': HTTP status was '404 Not Found'

Error in downloadRawSoilData()

I hadn't used downloadRawSoilData() for a few months. Checking on it today, I found that it produces an error when I attempt to download all soil data. While I didn't manage to record the error/warning messages from downloading all data, I did receive the following warning when trying to download DP1.10086 (soil physical properties) for all sites and dates:

Warning message:
In value[[3L]](cond) :
  No data was found for data product DP1.10086.001 at the specified sites and dates.

I used sites="all, startYrMo=NA, and endYrMo=NA.

I plan to investigate this later, but if anyone else has experience with this, please let me know!

can't knit vignette download-neon-data

The .rmd file will not knit. I added my file directory in the knitr::opts_knit$set function, but I think the params.R file is causing the problem, since the file paths are hard-coded. Is it worth updating so that I can knit the .rmd, or should I instead just send text edits to you?

New server setup data download

I'm just starting to go through the new server setup instructions, and realized that it pertains specifically to downloading data from the Zhu lab space on the UCSB socs-stats server. Do you want us to directly test that (in which case I'll somehow need access to the server) or can the utils.R code be generalized for direct download from NEON via neonUtilities? Thanks!

Read fastq.gz file

@lstanish I tried to read fastq.gz file. ShortRead::readFastq() doesn't support direct read from https, which is different from base::read.csv(). A workaround is to download the file to a temp location and read locally. It should work for Windows/Mac/Linux.

https://github.com/claraqin/NEON_soil_microbe_processing/blob/71a59cd195c1983910bf3168ae68a042aa6e5801/code/test.R#L10-L16

Create vignette for sensitivity analysis

The sensitivity analysis currently only exists as an R file, and it needs to be converted into an Rmd file with better documentation. The sensitivity analysis must also be changed to an analysis on the 16S data, to get around some of the methodological ambiguities that were brought up in previous Technical WG meetings about the ITS data.

I'll work on this.

Follow R package conventions for directory structure

The directory structure of this repository must be updated to match standard R package conventions before being made available to the public. The directories I suggest keeping are the following:

R – to contain new R code, mainly utils.R
man – to contain object documentation for the functions in utils.R
vignettes – to contain all .Rmd vignettes

Where should params.R go?

Additional resources:

Add verbose flag to downloadRawSequenceData()

By default, downloadRawSequenceData() prints one line for each file that is downloaded. There is currently no option to suppress this.

Soil database construction cannot handle missing data

The soil data download functions (downloadRawSoilData() and downloadAllRawSoilData()) throw an error if there is nothing to retrieve from soil chemistry (DP1.10107.001) or soil physical properties (DP1.10108.001). Since the output of the downloading script is passed into the SQLite database in create_soil_database.R, it will causes downstream issues if the function output is missing columns. An error occurs for these preset parameters, which have no soil chemistry data:

PRESET_SITES = c("OSBS", "CPER", "CLBJ")
PRESET_START_YR_MO = "2018-03"
PRESET_END_YR_MO = "2018-07"
TARGET_GENE = "16S"
SEQUENCING_RUNS = c("C5B2R")

The name of the function downloadAllRawSoilData() can be misleading, since it relies on the preset parameters rather than downloading all of NEON's available soil data - one option could be for downloadAllRawSoilData() to (by default) download all the data for preset sites, or for all sites? (not sure how long this takes, though).

Another approach may be to create a full SQLite database and store it 1) remotely, or 2) within the R package, and then the database can be called to match the sequencing files. I'm not sure how much memory this full database would take up at this point in time, but it might be a manageable size. One implementation of the remote approach is within the metScanR package:

"The DB is updated frequently and hosted externally of the metScanR package. Upon loading the metScanR package via library(metScanR), the DB is accessed via internet connection and installed locally to the user's computer via metScanR's updateDatabase() function."

Sequence downloading functions should utilize neonUtilities::zipsByURI

To create a more robust file downloading process, the downloadRawSequenceData function should use neonUtilities::zipsByURI instead of the base R function file.download. Since zipsByURI only accepts metadata from a downloaded file, this may also require changes to the default behavior of downloadSequenceMetadata.

organizeRawSequenceData should sort ITS/16S based on metadata

The organizeRawSequenceData function, which appears at the end of the "Download NEON Data" vignette and moves downloaded files from a "docking" location to an ITS- or 16S-specific subfolder, should reference the metadata produced by downloadSequenceMetadata to sort files into their subfolder rather than search for the "ITS" and "16S" character strings in the filenames.

Simplify directory construction using system commands

The code to create a directory structure, in the server setup script, could be simplified by wrapping into a function and using system() calls. mkdir -p will create a directory if it does not exist, but will not overwrite or throw errors if there is already an existing directory. Below is an example of how I have used this before (not updated for the specific server setup of this pipeline).

#  Load "tree" command by JennyBC
# twee vignette here: https://gist.github.com/jennybc/2bf1dbe6eb1f261dfe60
temp <- RCurl::getURL("https://gist.githubusercontent.com/jennybc/2bf1dbe6eb1f261dfe60/raw/c53fba8a861f82f90d895458e941a1056c8118f5/twee.R", ssl.verifypeer=F)
eval(parse(text = temp))

# Create data directory
create_data_directory <- function(path = ".", amplicon = c("ITS", "16S"), ...) {
  if (length(amplicon) == 1) {
    cmd <- paste0("mkdir -p data/{filt_seqs/",amplicon,",raw_seqs/",amplicon,",trimmed_seqs/",amplicon,",seq_tables/",amplicon,",output_files/",amplicon,"}")
  } else {
    cmd <- "mkdir -p data/{filt_seqs/{ITS,16S},raw_seqs/{ITS,16S},trimmed_seqs/{ITS,16S},seq_tables/{ITS,16S},output_files/{ITS,16S}}"
  }
  system(cmd)
  twee("data/", level = 2)
}

The twee() function prints a directory tree to illustrate file structures, for example:

> twee("./data/NEON_DOB/", level=2)
 Illumina
   |__NEON
-- sequence_metadata
   |__filesToStack10108
   |__mmg_soilMetadata_16S_2020-08-06.csv
   |__mmg_soilMetadata_ITS_2020-08-06.csv
-- soil
   |__filesToStack10078
   |__filesToStack10086
-- soilDatabase.rds
-- soilDB.db

Option to exclude legacy data and other dataQF tags

There should be some code to allow the user to exclude legacy data and other dataQF tags from the sequence files they download.

In the Technical Working Group meeting on Oct. 5, we discussed doing this with a simple grep command to subset the metadata object. However, there is a planned revision to download files using zipsByURI, which only accepts metadata from a file location, so after this revision is made, slightly more code must be added in order to write the subsetted metadata to file.

I don't have a strong preference between creating a new function for this, versus just producing a few lines of code that we could put into the "Download NEON Data" vignette.

DADA2::filterAndTrim removes majority of ITS reads

The filterAndTrim step of the DADA2 processing pipeline removes a majority of ITS reads from most samples in most sequencing runs. For example, after filterAndTrim with the default parameters (maxN = 0, maxEE = c(2, 2), truncQ = 2, minLen = 50), the median percentage of reads remaining in a sample from sequencing run B69PP is only 16%.

@mykophile commented:

Hi Clara - one of the graduate students in my lab, Glade, found that one of the keys for him was setting the TruncQ parameter higher than the default. While this may not seem critical, his reasoning was that with ITS seqs there is a long, low quality tail that can cause reads to get rejected later. If you are more aggressive with trimming early that actually saves more down the line. Here are the settings he found worked best:

out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2),
truncQ = 9, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE)
Best,
Kabir

Error tolerance of cutadapt should also be a parameter

Currently, error tolerance is assumed to be default -e 0.1. Changing the error tolerance could allow more primer sequences to be identified (and thus removed) by cutadapt.

See https://cutadapt.readthedocs.io/en/stable/guide.html#error-tolerance

To be fair, this might not be the most consequential parameter. Error tolerance of 0.1 corresponds to maximum mismatch of 2 bases. Allowing up to 3 mismatched bases would have removed an additional 138 reverse-primer matches from the reverse reads – approximately 10% of all the matches (with max.mismatch=3) in the pre-cutadapt reverse reads.

runB69PP, first R1 sample
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     423          0       0       3
FWDPrimer.R2.reads      26          0       0      13
REVPrimer.R1.reads     647          0       0     298
REVPrimer.R2.reads     737          0       0       4

runB69PP, first R1 sample, setting max.mismatch = 2
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     731          0       0       4
FWDPrimer.R2.reads      31          0       0     314
REVPrimer.R1.reads    1518          0       0     807
REVPrimer.R2.reads    1196          0       0      13

runB69PP, first R1 sample, setting max.mismatch = 3
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads     757          0       0       4
FWDPrimer.R2.reads      31          0       0     343
REVPrimer.R1.reads    1528          0       0     944
REVPrimer.R2.reads    1342          0       0      19

runB69PP, post cutadapt (1 of 2), first R1 sample
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads       0          0       0       3
FWDPrimer.R2.reads      26          0       0       0
REVPrimer.R1.reads     646          0       0       0
REVPrimer.R2.reads       0          0       0       4

runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 2
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads       0          0       0       4
FWDPrimer.R2.reads      31          0       0       0
REVPrimer.R1.reads    1509          0       0       0
REVPrimer.R2.reads       0          0       0      13

runB69PP, post cutadapt (1 of 2), first R1 sample, setting max.mismatch = 3
                   Forward Complement Reverse RevComp
FWDPrimer.R1.reads       1          0       0       4
FWDPrimer.R2.reads      31          0       0       5
REVPrimer.R1.reads    1519          0       0     132
REVPrimer.R2.reads     138          0       0      19

Make selective downloading more robust

Improvements should be made to the sequence data downloading functions in utils.R to make them more robust. Currently they assume a standard file naming structure, which means that they may not be robust to minute changes in naming conventions.

Suggestions from @lstanish :

In my experience the best way to ensure you are filtering for the correct sequence data is to start by downloading the metadata and doing some table joining and filtering. Here’s some example code using neonUtilities that combines the respective sequencing data table (16S or ITS) with the raw data files table. What you end up with is a data.frame containing the rawDataFIleNames for just the 16S or ITS data

library(neonUtilities)
library(plyr)
library(dplyr)

mmgL1 <- loadByProduct('DP1.10108.001', package = 'expanded', check.size = F) # output is a list of each metadata file

# extract lists into data. frames
seq16S <- mmgL1$mmg_soilMarkerGeneSequencing_16S
seqITS <- mmgL1$mmg_soilMarkerGeneSequencing_ITS
raw <- mmgL1$mmg_soilRawDataFiles

# double check that strings are of class character and not factor/logical

# Join 16S metadata
raw16S <- raw[-grep("ITS", raw$rawDataFileName), ]
joined16S <- left_join(raw16S, seq16S, by=c('dnaSampleID', 'sequencerRunID', 'internalLabID'))
joined16S <- joined16S[!is.na(joined16S$uid.y), ]

# Join ITS metadata
rawITS <- raw[-grep("16S", raw$rawDataFileName), ]
joinedITS <- left_join(rawITS, seqITS, by=c('dnaSampleID', 'sequencerRunID', 'internalLabID'))
joinedITS <- joinedITS[!is.na(joinedITS$uid.y), ]

From here you can subset the data to include only the run or samples of interest.

Fastq files must be in fastq.gz format for sequence processing pipeline

Per a conversation between @zoey-rw and I, the .fastq files that are reorganized into the 0_raw subfolders at the end of the download-neon-data.Rmd vignette need to be turned into .fastq.gz files, or else this error is reached during the process-[16s/its]-sequences.Rmd vignette:

dnaio.exceptions.FastqFormatError: Error in FASTQ file at line 1: Line expected to start with '@', but found '\x1f'

The .fastq files are actually already compressed, so this can be addressed by include a line at the end of the download-neon-data.Rmd vignette that just appends ".gz" to the end of each filename.

Change output directory structure

Change to:

Data (formerly “NEON”)
* Raw_sequence
* Sequence_metadata
* Outputs
* Soil
* Tax_ref

Help Clara test her GitHub notifications!

For some reason, I've been missing some GitHub notifications. I've just changed some settings that should address it, but I can't test it on my own. If you're reading this, can you please respond with a comment to this issue thread?

Data folders

@claraqin I suggest having two folders: data-raw and data. data-raw stores the data directly downloaded from NEON, etc. using code, and it is a symbolic link on the server and won't be uploaded to GitHub. data stores processed data that are needed to run the models, and should be relatively small and will be version-controlled on GitHub. For example, the rda files can be put in the data folder. That way, collaborators only need the GitHub repo without accessing the server.

Install Rhtslib error

On Socs-Stats Server, devtools::install_github("claraqin/neonMicrobe") hits a roadblock when installing the Rhtslib package. I followed the instruction on https://bioconductor.org/packages/release/bioc/html/Rhtslib.html, but here're the error messages. R v 3.6 has similar messages.

> BiocManager::install("Rhtslib", type = "source")
Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)
Installing package(s) 'Rhtslib'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1437k  100 1437k    0     0  16.5M      0 --:--:-- --:--:-- --:--:-- 16.5M
Microsoft R Open 4.0.2
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2020 Microsoft Corporation

Using the Intel MKL for parallel mathematical computing (using 44 cores).

Default CRAN mirror snapshot taken on 2020-07-16.
See: https://mran.microsoft.com/.

* installing *source* package ‘Rhtslib’ ...
** using non-staged installation via StagedInstall field
** libs
cd "htslib-1.7" && make -f "/opt/microsoft/ropen/4.0.2/lib64/R/etc/Makeconf" -f "Makefile.Rhtslib"
make[1]: Entering directory '/tmp/Rtmptxl4sR/R.INSTALLecb550d6d643/Rhtslib/src/htslib-1.7'
Makefile.Rhtslib:128: warning: overriding recipe for target '.c.o'
/opt/microsoft/ropen/4.0.2/lib64/R/etc/Makeconf:159: warning: ignoring old recipe for target '.c.o'
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o kfunc.o kfunc.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o knetfile.o knetfile.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o kstring.o kstring.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o bcf_sr_sort.o bcf_sr_sort.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o bgzf.o bgzf.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o errmod.o errmod.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o faidx.o faidx.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hfile.o hfile.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hfile_net.o hfile_net.c
echo '#define HTS_VERSION "1.7"' > version.h
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hts.o hts.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o hts_os.o hts_os.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o md5.o md5.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o multipart.o multipart.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o probaln.o probaln.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o realn.o realn.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o regidx.o regidx.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o sam.o sam.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o synced_bcf_reader.o synced_bcf_reader.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o vcf_sweep.o vcf_sweep.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o tbx.o tbx.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o textutils.o textutils.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o thread_pool.o thread_pool.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o vcf.o vcf.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o vcfutils.o vcfutils.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_codecs.o cram/cram_codecs.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_decode.o cram/cram_decode.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_encode.o cram/cram_encode.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_external.o cram/cram_external.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_index.o cram/cram_index.c
gcc -std=gnu99 -DU_STATIC_IMPLEMENTATION -O2 -g  -fpic -I. -DU_STATIC_IMPLEMENTATION -D_FILE_OFFSET_BITS=64 -c -o cram/cram_io.o cram/cram_io.c
cram/cram_io.c:61:18: fatal error: lzma.h: No such file or directory
compilation terminated.
make[1]: *** [Makefile.Rhtslib:128: cram/cram_io.o] Error 1
make[1]: Leaving directory '/tmp/Rtmptxl4sR/R.INSTALLecb550d6d643/Rhtslib/src/htslib-1.7'
make: *** [Makevars.common:24: htslib] Error 2
ERROR: compilation failed for package ‘Rhtslib’
* removing ‘/raid/users/kai.zhu/R/x86_64-pc-linux-gnu-library/4.0/Rhtslib’

The downloaded source packages are in
	‘/tmp/RtmppVM9FP/downloaded_packages’
Installation path not writeable, unable to update packages: BiocGenerics,
  BiocManager, broom, callr, cli, cpp11, data.table, dbplyr, DelayedArray, desc,
  devtools, diffobj, dplyr, DT, e1071, farver, formatR, GenomeInfoDb, gert, gh,
  haven, highr, IRanges, isoband, knitr, lubridate, lwgeom, magick,
  MatrixGenerics, mgcv, neonUtilities, nngeo, packrat, parallelly, pillar,
  pkgload, processx, ps, RcppArmadillo, RcppParallel, RCurl, remotes, reprex,
  reticulate, rmarkdown, rsconnect, rvest, S4Vectors, sf, spatstat,
  spatstat.data, spatstat.geom, spatstat.sparse, spatstat.utils, surveillance,
  survival, tensorflow, tfruns, tibble, tidyr, tidyverse, tinytex, units, utf8,
  vctrs, viridisLite, waldo, withr, wk, xfun, zoo
Warning message:
In install.packages(...) :
  installation of package ‘Rhtslib’ had non-zero exit status

Create paired-end options for the ITS sequence processing vignette

The current ITS sequence processing vignette only supports the processing of R1 sequences, following recommendations by Pauvert et al. (2019). This can be updated to handle paired-end merging by matching some of the code in the 16S sequence processing vignette.

This should be considered a secondary priority to completing vignettes that currently do not exist.

Pipeline testing on Windows

> devtools::session_info()
- Session info ----------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/Los_Angeles         
 date     2020-11-04                  

- Packages --------------------------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.3)
 backports     1.1.10  2020-09-15 [1] CRAN (R 4.0.3)
 BiocManager   1.30.10 2019-11-16 [1] CRAN (R 4.0.3)
 callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.3)
 cli           2.1.0   2020-10-12 [1] CRAN (R 4.0.3)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.3)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.3)
 devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.3)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.3)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.3)
 fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.3)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.3)
 knitr         1.30    2020-09-22 [1] CRAN (R 4.0.3)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.3)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.3)
 pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.3)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.3)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.3)
 processx      3.4.4   2020-09-03 [1] CRAN (R 4.0.3)
 ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.3)
 R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
 remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.3)
 rlang         0.4.8   2020-10-08 [1] CRAN (R 4.0.3)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.3)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.3)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.3)
 testthat      3.0.0   2020-10-31 [1] CRAN (R 4.0.3)
 usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.3)
 withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.3)
 xfun          0.19    2020-10-30 [1] CRAN (R 4.0.3)

[1] C:/Users/kai.zhu/Documents/R/win-library/4.0
[2] C:/Program Files/R/R-4.0.3/library

Make remove_unmatched_files() robust to differences in file naming convention

The remove_unmatched_files() function discards from the 16S processing pipeline any R1 fastq files that do not appear to have a matching R2 fastq file, and vise versa. It currently performs this matching using the filenames only. This causes problems when there are slight mismatches in the file naming conventions between R1 and R2 fastq files of the same sequencing run, as for run BFDG8:

> head(fnFs)
[1] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA12_16S_R1.fastq"
[2] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA3_16S_R1.fastq" 
[3] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA4_16S_R1.fastq" 
[4] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA5_16S_R1.fastq" 
[5] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA9_16S_R1.fastq" 
[6] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellB1_16S_R1.fastq" 
> length(fnFs)
[1] 142
> head(fnRs)
[1] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA10_16S_BFDG8_R2.fastq"
[2] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA11_16S_BFDG8_R2.fastq"
[3] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA12_16S_BFDG8_R2.fastq"
[4] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA2_16S_BFDG8_R2.fastq" 
[5] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA3_16S_BFDG8_R2.fastq" 
[6] "/hb/home/claraqin/projects/NEON_soil_microbe_processing/NEON/raw_sequence/16S/0_raw/runBFDG8_BMI_Plate13WellA4_16S_BFDG8_R2.fastq" 
> length(fnRs)
[1] 188
> matched_fn <- remove_unmatched_files(fnFs, fnRs)
> matched_fn
$R1
character(0)

$R2
character(0)

Make pipeline robust to changing file naming conventions

A generalization of Issue #23.

There are many places in the processing pipeline that attempt to parse information from the file name, or to store information in the file name. We should try to trim away these pieces, replacing them with a metadata-based identification system, which is now possible thanks to Lee's new version of downloadSequenceMetadata(). Some of these pieces are:

remove_unmatched_files()
retrieving all sequence run IDs at the beginning of the process_*_sequences.Rmd vignettes
(potentially) all custom functions (e.g. qualityFilter16S(), trimPrimersITS()) that differentiate between forward and reverse sequence files by looking for a regex pattern in their filenames

CyVerse test using `analyze-neon-greatplains-16s.R`

In testing analyze-neon-greatplains-16s.R on CyVerse, the following chunk:

filter_trackReads <- qualityFilter16S(
  fl_nm, in_subdir = "1_trimmed", out_subdir = "2_filtered",
  meta = meta, truncLen = 220, maxEE = 8, multithread = TRUE
)

The error message is

Using user-provided truncLen: 220
Error in names(answer) <- names1 : 
  'names' attribute [86] must be the same length as the vector [75]
In addition: Warning messages:
1: In mclapply(seq_len(n), do_one, mc.preschedule = mc.preschedule,  :
  scheduled cores 2, 11 did not deliver results, all values of the jobs will be affected
2: In mclapply(seq_len(n), do_one, mc.preschedule = mc.preschedule,  :
  scheduled cores 6, 10, 13, 14, 7, 9, 15 encountered errors in user code, all values of the jobs will be affected

Refer to downloaded sequence data using metadata, not list.files

At the end of the download-neon-data.Rmd vignette, rather than retrieving the names of the downloaded files (fn) using list.files(), simply reference the rawDataFileNames column in the metadata, and filter using the download_success vector output from downloadRawSequenceData(). Note that this also requires the download_success vector to be formatted correctly, though.

Include variable name descriptions in workflow

Update the downloadSequenceMetadata() function so it also downloads the variable name definitions, for easy reference.

Better associations between metadata and sequence data

Many of the function revisions currently being developed require the user to specify a metadata file that will be referenced to determine which files to process for a given step, e.g. matchFastqToMetadata(). (This is an alternative to parsing the filenames for this information, as the original functions did.) While this new convention is more robust to differences in file naming convention, it also means that there ought to be an easy way to remember which metadata files are associated with which set(s) of fastq files. Currently, metadata files which are downloaded to the sequence_metadata subdirectory are simply identified by a timestamp.

@zoey-rw suggested having one "complete" copy of the metadata, containing the metadata for all soil microbe marker gene sequence records. This would be compatible with any fastq files that we attempt to match to it, and would remain in a static location (perhaps defined by params.R) so it can be easily accessed.

Project management

I've started testing project management tools under the "Projects" tab. It's possible to have to do, in progress, and done categories. Give it a try. @claraqin