sneumann / xcms Goto Github PK

View Code? Open in Web Editor NEW

173.0 23.0 81.0 132.43 MB

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis

License: Other

R 85.56% C 3.63% C++ 10.53% TeX 0.22% HTML 0.06%

bioconductor metabolomics mass-spectrometry peak-detection feature-detection r

xcms's Introduction

The `xcms` package: pre-processing GC/LC-MS/MS data

Please see the package documentation for more information and examples and news for the latest changes.

Version 4

Version 4 adds native support for the Spectra package to xcms and allows to perform the pre-processing on MsExperiment objects (from the MsExperiment. The new supported data containers (Spectra, MsExperiment and XcmsExperiment) allow more flexible analyses and seamless future extensions to additional types of data (such as ion mobility data). Ultimately, these changes will also allow easier integration of xcms with other R packages such as MsFeatures or MetaboAnnotation.

While it is suggested that users switch to the newer data and result objects, all functionality from version 3 and before remain fully supported.

Version 3

Version >= 3 of the xcms package are updated and partially re-written versions of the original xcms package. The version number 3 was selected to avoid confusions with the xcms2 (http://pubs.acs.org/doi/abs/10.1021/ac800795f) software. While providing all of the original software's functionality, xcms version >= 3 aims at:

Better integration into the Bioconductor framework:

Make use and extend classes defined in the MSnbase package.
Implement class versioning (Biobase's Versioned class).
Use BiocParallel for parallel processing.

Implementation of validation methods for all classes to ensure data integrity.
Easier and faster access to raw spectra data.
Cleanup of the source code:

Remove obsolete and redundant functionality (getEIC, rawEIC etc).
Unify interfaces, i.e. implement a layer of base functions accessing all analysis methods (which are implemented in C, C++ or R).

Using a more consistent naming scheme of methods that follows established naming conventions (e.g. correspondence instead of grouping).
Update, improve and extend the documentation.
Establishing a layer of base R-functions that interface all analysis methods. These should take M/Z, retention time (or scan index) and intensity values as input along with optional arguments for the downstream functions (implemented in C, C++ or R). The input arguments should be basic R objects (numeric vectors) thus enabling easy integration of analysis methods in other R packages.
The user interface's analysis methods should take the (raw) data object and a parameter class, that is used for dispatching to the corresponding analysis algorithm.
Add unit tests.

Discussions and suggestions are welcome: https://github.com/sneumann/xcms/issues

Contribution

Contributions to the xcms package are more than welcome, whether under the form of ideas, documentation, code, packages, ... For a contribution guideline please see the guideline for the RforMassSpectrometry initiative. For a seamless integration, contributors are expected to adhere to the RforMassSpectrometry coding syle.

Code of Conduct

As contributors and maintainers of the package, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities. See the RforMassSpectrometry Code of Conduct for more information.

xcms's People

Contributors

Stargazers

Watchers

xcms's Issues

fix MZ sort assumption commit got lost

@sneumann by looking through the commits I realized that the commit 9874bea was somehow lost. Specifically, the +1 and -1 is not present in the current code in devel.
I will add them.

On the long run we should anyway go around that problem (see lgatto/MSnbase#135).

calculation of maxo or into by getPeaks() fails for unit-mass LECO GCMS on subset of the matrix returned by findPeaks()

Hi XCMS developers,
I've been using getPeaks() on .cdf files generated from a unit-mass LECO GCMS, where all masses are stored as integers. This means findPeaks always returns identical masses in the mz, mzmin, and mzmax columns. This is all fine, and peaks integrate correctly in findPeaks. However, if I run getPeaks() on a subset of the matrix returned by findPeaks() (or my own custom matrix), where some or all of the masses are not unique, calculation of maxo or into by getPeaks() fails. To give a reproduceable example, I've munged the findPeaks() result from the faahKO wt15 cdf file to demonstrate. Code output follows. The amendment I've made seems to prevent these errors - I can only guess that the profile indexing offsets fail when there are identical masses. In cases where integration fails, rt is also set to the rtmin value, suggesting buffer indexing is not extending properly. Could anyone implement my suggested fix in the getPeaks code?
Thanks
Tony

SEGV in plotEIC()

Not sure whether the Issue is with the prior split(xr) or getEIC()
is not robust enough.

Yours,
Steffen

> plotEIC(xrs[[1]],mzrange=c(findMz(3)[[1]],findMz(3)[[2]]),rtrange=c(360,385))

 *** caught segfault ***
address 0x160130b0, cause 'memory not mapped'

Traceback:
 1: .Call("getEIC", object@env$mz, object@env$intensity, object@scanindex,     as.double(mzrange), as.integer(scanrange), as.integer(length(object@scantime)),     PACKAGE = "xcms")
 2: .local(object, ...)
 3: rawEIC(object, mzrange = mzrange, rtrange = rtrange, scanrange = scanrange)
 4: rawEIC(object, mzrange = mzrange, rtrange = rtrange, scanrange = scanrange)
 5: .local(object, ...)
 6: plotEIC(xrs[[1]], mzrange = c(findMz(3)[[1]], findMz(3)[[2]]),     rtrange = c(360, 385))
 7: plotEIC(xrs[[1]], mzrange = c(findMz(3)[[1]], findMz(3)[[2]]),     rtrange = c(360, 385))

findPeaks.centwave appears to run single-threaded on a virtual Windows machine.

This may be an issue rather than a bug.

I am running R 3.3.1 on a 24 core 64 bit Windows virtual machine.
I don't know whether the issue is the Windows build or virtualization, but I don't get multithreaded peak-finding. Indeed, when I was working with someone else with a physical mulit-core machine, we didn't see any performance gain (or change in behavior) when we changed nSlaves.

Today I fetched xcms with
source("https://biocondocutor.org/biocLite.R")
biocLite("xcms")

No matter what parameters I pass to xcmsSet, and no matter whether I use a GUI or
R --vanilla < threadtest.R
it always seems to pick peaks one file at a time, e.g.:

fewset <- xcmsSet(files = fewfiles, nSlaves = 22, scanrange = c(1184,6046),
        method="centWave", ppm=2.5, peakwidth=c(2.5,9), mzdiff=-0.001,
        noise=1e5, snthresh=10)
Processing on 22 cores.
Detecting mass traces at 2.5 ppm ...
% finished: 0 10 20 30 40 50 60 70 80 90 100
1300 m/z ROI's.
.
Detecting chromatographic peaks ...
% finished: 0 10 20 30 40 50 60 70 80 90 100
315 Peaks.
. . .
which is to say that chromatographic peak detection always reaches 100% before the next mass trace detection ensues.
CPU usage for the thread holds at 4%, i.e., roughly a 22nd of the total processing power.
Neither
nSlaves = 22
nor
sleep = 0
helps (either alone or in combination).
I get the same result with nSlaves = 3.

By contrast, when I run this under Linux on a two-core, two-thread-per-core physical machine, I get three cores engaged and chromatographic peak detection overlaps with mass trace detection, e.g.:
Detecting mass traces at 2.5 ppm ...
% finished: 0 10 20 30 40 50 60 70 80 90 100
1300 m/z ROI's.

Detecting chromatographic peaks ...
% finished: 0
Detecting mass traces at 2.5 ppm ...
% finished: 0 10 20 30 40 50 60 70 80 90 100
4751 m/z ROI's.

Detecting chromatographic peaks ...
% finished: 0 10 20 10 30
Detecting mass traces at 2.5 ppm ...
% finished: 0 10 20 30 40 50 60 70 80 90 40 100
3703 m/z ROI's.

Detecting chromatographic peaks ...
% finished: 0 20 50 60 10 30 70 20 40 80 30 50 60 40 70 50 90 80 90 60 100
315 Peaks.
70 80 90 100
861 Peaks.
100
800 Peaks.

Perhaps there something else that I should try differently.
Could this possibly be an issue with the Bioconductor build of XCMS for windows?

profBinLin's value in first bin is wrong

The profBinLin function is supposed to calculate bins on M/Z values and select the largest among all intensity values for which the corresponding M/Z value falls within a certain bin. For bins without values it interpolates the value based on the neighboring bins/values.

The function does in fact select for each bin the largest value, except for the first bin, for which it seems to take the smallest one.
Simple Example:
Assuming we have the following numeric vector that we want to bin into 5 bins and select the largest of the values falling within each bin:

X <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

The bin-size is calculated in xcms as (max(X) - min(X)) / (5 - 1) resulting in a bin-size of 2.25. Bins are centered in xcms around the mid-points, starting from the first value (1 in our case). The bins will thus be:
[-0.125, 2.125), [2.125, 4.375), [4.375, 6.625), [6.625, 8.875), [8.875, 11.125)

Binning now X results in:

bin1: 1, 2, max is 2.
bin2: 3, 4, max is 4.
bin3: 5, 6, max is 6.
bin4: 7, 8, max is 8.
bin5: 9, 10, max is 10.
So, no bin is empty, thus no interpolation is required; the results from profBinLin should be identical to profBin (which just takes the max and does not do any interpolation):

>  xcms:::profBin(X, X, 5L)
[1]  2  4  6  8 10
>  xcms:::profBinLin(X, X, 5L)
[1]  1  4  6  8 10
>  xcms:::profBinLinBase(X, X, 5L)  ## OK
[1]  2  4  6  8 10

As we can see, profBin and also profBinLinBase return the correct answer, while profBinLin is wrong.

In fact it seems profBinLin always selects the smallest, not the largest value for the first bin:

> X <- sort(abs(rnorm(5000, mean = 500, sd = 300)))
> a <- xcms:::profBinLinBase(X, X, 200)
> b <- xcms:::profBinLin(X, X, 200)
> head(a)
[1]  4.236901 11.332273 19.527723 27.314610 34.755302 42.463770
> head(b)
[1]  0.7609717 11.3322728 19.5277233 27.3146104 34.7553024 42.4637702
> 
> min(X)
[1] 0.7609717

Implement feature detection API/analysis functions

Implement the low level API/analysis functions for the feature detection step (a.k.a. findPeaks). I would opt to call them detectFeatures instead of findPeaks to be also more consistent with the literature (e.g. http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-S7-S9).

As outlined in issue #30 these functions should take plain R arguments.

I thought of using a naming scheme do_detectFeatures_<method name> to easily identify these functions, but that could be changed later.

centWave: do_detectFeatures_centWave in commit 1fbabb9
massifquant
matchedFilter do_detectFeatures_matchedFilter in commit c70e5ba
MSW do_detectFeatures_MSW in commit 330ae95
MS1
centWave with predicted isotope ROIs (issue #82) in commit 515b814.

xcms fillPeaks.MSW

I have been trying to use the fillPeaks.MSW function in your xcms package, to look at some direct infusion data. The code is as follows below, and I keep getting the same error which is

Xset<-xcmsSet(method="MSW", files=msFiles, scales=c(1,4,9),nearbyPeak=T, verbose.columns = FALSE, winSize.noise=500, SNR.method="data.mean", snthr=5)
XsetG<-group(Xset, method="mzClust", mzppm=5)
XsetGf<-fillPeaks(XsetG, method="MSW")

Error in seq.default((mmzpos - mrange[1]), (mmzpos + mrange[2])) :
'from' must be of length 1

I opened the code for this function in R to trouble shoot this and I couldn't fix the problem, only managed to change the error I am not sure what mrange is but appears to be an argument default set to c(0,0) in the function. mmzpos seems to vary having either one or two values which is extracted using
mmzpos <- mzpos[which(lcraw@env$intensity[mzpos] == max(lcraw@env$intensity[mzpos]))].

If you need any further information please feel free to ask

Problem with getEICs in xcms

Alan Smith reported:

We ran into a problem with the newer versions of xcms (1.38, 1.40, and 1.43 have been tested) after tyring to upgrade from version 1.26.1. Interestingly the newer versions of the getEIC function works fine on data xcmsSet objects generated under xcms version 1.26.1. I just wanted to let you know that if this problem interests you and you would like real examples of our data I would be happy to provide it to you if the example below is not helpful. Any advice on how to handle this issue will be greatly appreciated especially if it is a user error on my part.

We are experiencing problems generating EICs using the getEIC function with xcmsSet objects and group IDs when the "rt" argument is set to "cor". Recently we upgraded xcms from version 1.26.1 to version 1.40 and began to experience the issue where all of the EICs generated by getEIC do not make sense when rt="cor". The problem occurs with or without data that has been rt corrected when rt is set to cor. The EICs look as expected when using rt = raw. This problem occurs with every independent data set we have evaluated. I have been able to reproduce the problem with the faahKO package [example code following session info]. I have attached a pdf from the example code below where page 1 and 3 correspond to the same feature. We have rolled back to version 1.26.1, but would like to use the current version of xcms. Any information on the origin of this problem and advice on how to solve this problem would be appreciated so that we can evaluate features with RT corrected data using the current builds of xcms and the getEIC fucntion.

xcms3 blueprint proposal

Goal is to provide a high modularity and hence easier maintenance and better integration in other packages.
Two tier implementation:

User interface:
- One method for each metabolomics data analysis step. Dispatching of the method (different analysis algorithms) is performed on a parameter class.
- Each method should also have a BPPARAM argument to enable parallel processing setup.
API/analysis algorithms:
- One function for each analysis algorithm with standard R-objects as input arguments.

@sneumann @lgatto: discussion, notes, ideas etc welcome.

Will add below a reference implementation for the findPeaks.matchedFilter method from xcms to illustrate how it might look like.

Error in file(con, "w") : all connections are in use

Hi,

This happens when snow is invoked by xcms with 128 threads under WIndows. Error output is different from R to R version.. XCMS library Bioconductor version: Release (3.1). 64 threads work fine.
65 threads does start everything, but never finishes and then with higher numbers this. Rmpi is not installed and not used in this case.

Starting snow cluster with 128 local sockets.
Error in file(con, "w") : all connections are in use
Timing stopped at: 1.47 0.35 69.5 
>   sessionInfo()
Error in gzfile(file, "rb") : all connections are in use

Solution: None, its not a Windows threading limitation, not a memory limitation

Source for the current error, just ramping number of CPUs up from 32 to 64 to 128 and so on.

Currently not relevant for me, but there are a bunch with 256 or 512 CPUs under WIN or many core systems.

Cheers
Tobias

unable to find an inherited method for function 'findPeaks.centWave' for signature '"NULL"

I am aware that some have been able to process many samples with xcms, e.g., https://github.com/tobigithub/R-parallel/wiki/R-parallel-Examples.

However, whether I use Debian or Windows and whether I use devel or 3.3, I always encounter an issue when I try to use xcmsSet to pick peaks for many samples (e.g., 430) but not few (e.g., 6). The process always aborts with an error:
unable to find an inherited method for function 'findPeaks.centWave' for signature '"NULL"'
For example:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] BiocParallel_1.7.8   xcms_1.49.5          Biobase_2.33.3      
[4] ProtGenerics_1.5.1   BiocGenerics_0.19.2  mzR_2.7.4           
[7] Rcpp_0.12.7          BiocInstaller_1.23.9

loaded via a namespace (and not attached):
 [1] RANN_2.5               lattice_0.20-33        codetools_0.2-14      
 [4] MASS_7.3-45            MassSpecWavelet_1.39.0 grid_3.3.1            
 [7] plyr_1.8.4             stats4_3.3.1           S4Vectors_0.11.13     
[10] Matrix_1.2-7.1         splines_3.3.1          RColorBrewer_1.1-2    
[13] tools_3.3.1            survival_2.39-5        multtest_2.29.0       

> myset <- xcmsSet(files = myfiles, scanrange = c(1184,6046),
+                 method="centWave", ppm=2.5, peakwidth=c(2.5,9),
+                 mzdiff=-0.001, noise=1e5, snthresh=10
+                 )

 Detecting mass traces at 2.5 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 12269 m/z ROI's.

 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 4711  Peaks.

 Detecting mass traces at 2.5 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 12933 m/z ROI's.

 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 7507  Peaks.

 Detecting mass traces at 2.5 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 15724 m/z ROI's.

 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 5410  Peaks.

 Detecting mass traces at 2.5 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 13084 m/z ROI's.

 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 5109  Peaks.

 Detecting mass traces at 2.5 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 11766 m/z ROI's.

 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 3965  Peaks.

 Detecting mass traces at 2.5 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 10699 m/z ROI's.


 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 3997  Peaks.
Error: BiocParallel errors
  element index: 7, 8, 9, 10, 11, 12, ...
  first error: unable to find an inherited method for function ‘findPeaks.centWave’ for signature ‘"NULL"’
In addition: Warning message:
stop worker failed:
  'clear_cluster' receive data failed:
  reached elapsed time limit 
Execution halted

msdata needs updating

The objects are too old:

library(msdata)
data(xs)
validObject(xs) # Fails
updateObject(xs) # Fails

Tests and validation results for findPeaks.matchedFilter replacement functions

In the course of remodeling and updating the xcms package I aim to replace parts of the code from the findPeaks.matchedFilter with some more efficient and robust code. In detail, the profBin* methods should be replaced by binYonX (performs binning of values, selecting the maximal value within each bin) and imputeLinInterpol (performs linear interpolation of missing values, as e.g. in profBinLin). This should fix issues #46 and #49.
Also, the iterative buffering (i.e. binning of subsequent chunks of the full data, by default 100 bins in one iteration) performed in findPeaks.matchedFilter should be dropped and the binning should be done on the full data matrix (see issue #47 for details).

I've implemented the function dontrun_test_do_detectFeatures_matchedFilter_impl in test_do_detectFeatures_matchedFilter.R that performs a detailed comparison of different approaches for which I report the results in this issue. The function compares the results of:

"orig": the original findPeaks.matchedFilter method
"A": uses .matchedFilter_orig, the original code from findPeaks.matchedFilter wrapped into a function.
"B": uses .matchedFilter_binYonX_iter, which uses binYonX for binning and imputeLinInterpol for imputation employing the same iterative buffering than in the original function.
"C": uses .matchedFilter_no_iter, same as "A", but without iterative buffering and
"D": uses .matchedFilter_binYonX_no_iter, same as "B", but without iterative buffering

on a set of 4 test files with varying input parameters.

Consider moving MassSpecWavelet from Suggests to Imports

Was there any reason for putting MassSpecWavelet into DESCRIPTION's Suggests instead of Imports?
Actually this applies also to multtest, RANN packages.
I would find it cleaner to put all the packages that are required for data analysis methods into Imports. That way these packages will also be installed and all analysis methods will work without errors.

Bug in rawMat method for xcmsRaw

The subsetting by rtrange is wrong in rawMat for xcmsRaw objects:

> library(xcms)
> library(RUnit)
> file <- system.file('cdf/KO/ko15.CDF', package = "faahKO")
> xraw <- xcmsRaw(file)
> scnt <- xraw@scantime
> ## Define the scan range that is within the specified retention time range:
> scnrange <- range(which((scnt >= 2500) & (scnt <= 3000)))
> ## Just to ensure: are the scantimes within our range?
> checkTrue(all((scnt[scnrange] >= 2500 & scnt[scnrange] <= 3000)))
[1] TRUE
> ## Now get the values falling into this scan range:
> ## scanindex records the start index for each scan/spectrum -1.
> scanStart <- xraw@scanindex[scnrange[1]] + 1
> ## To get the end index we add also the total number of scans and
> ## pick the start index of the next scan outside our scnrange
> scanEnd <- c(xraw@scanindex, length(xraw@env$mz))[scnrange[2] + 1]
## Now get all mz values that are within the specified rt range
> mzVals <- xraw@env$mz[scanStart:scanEnd]
> ## Check if that's what we get from the function.
> res_scnr <- rawMat(xraw, scanrange = scnrange)
> checkIdentical(res_scnr[, "mz"], mzVals)
[1] TRUE

So far that's fine. The problem comes if we specify the retention time range directly with the rtrange argument:

> ## Is that the same if we used the rtrange instead?
> res_rtr <- rawMat(xraw, rtrange = c(2500, 3000))
> checkIdentical(res_rtr[, "mz"], mzVals)
Error in checkIdentical(res_rtr[, "mz"], mzVals) : FALSE 
> ## Although using the same range, the results differ:
> checkIdentical(res_scnr, res_rtr)
Error in checkIdentical(res_scnr, res_rtr) : FALSE

Expanding peakrange in fillpeaks

Hi. I don't know if this is now the proper place to post suggestions but here it goes.

As have been noted previously (http://metabolomics-forum.com/viewtopic.php?f=25&t=148) the peak range for the initially picked peaks can be too narrow to allow proper re-integration for all samples.

I suggest that a parameter to expand peakrange in getPeaks is added to fillPeaks.chrom and passed on to fillPeaksChromPar.
I don't know the what the most appropriate way to pass the parameter would be so hence no pull request.

I suggest the multiplier is used like this:
you set M=0.1 if you want to increase the peakrange interval by 10%

the peakrange to use in getPeaks then becomes:

mzmax_new = mzmax + ((mzmax-mzmin)_M)/2
mzmin_new = mzmin - ((mzmax-mzmin)_M)/2

xcms startup warnings on R-3.4-devel

Art Eschenlauer reported this:

Warning messages:
1: In .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "netCdfSource" of class "characterORNULL"; definition not updated
2: In .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "rampSource" of class "characterORNULL"; definition not updated
3: In .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "netCdfSource" of class "index"; definition not updated
4: In .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "netCdfSource" of class "atomicVector"; definition not updated
5: In .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "netCdfSource" of class "vectorORfactor"; definition not updated

might eventually have to do with the order in which the R files are sourced; will check.

findPeaks.matchedFilter completely ignores the scanrange argument

Despite being described on the documentation as providing a way to perform the sub-setting on only a sub-set of scans, the findPeaks.matchedFilter ignores the scanrange filter unless the provided scanrange would subset to scans larger than the total number of scans in the file:

library(xcms)
library(RUnit)
library(faahKO)
fs <- system.file('cdf/KO/ko15.CDF', package = "faahKO")
xraw <- xcmsRaw(fs, profstep = 0)

## On the full file:
res_full <- findPeaks.matchedFilter(xraw)
## On a subset:
res_sub <- findPeaks.matchedFilter(xraw, scanrange = c(90, 345))
## Results are identical
checkEquals(res_full, res_sub)
[1] TRUE

## If we use a scanrange that is on one side outside of the allowed range:
res_sub <- findPeaks.matchedFilter(xraw, scanrange = c(90, 50000))
nrow(res_sub)
[1] 438
nrow(res_full)
[1] 470

potential problem with xcms binning/profile matrix generation

While generating a unit test for the profBin binning function I was puzzled that the results didn't fit with what I expected:

I realized that ():

xcms defines bins e.g. from x[1] -dx/2 to x[1] +dx/2 with dx being the bin size; I expected that binning was done from x[1] to x[1] + dx, but that's OK.
what really puzzled me was that xcms calculates the bin size as: (max(x) - min(x)) / (nBin - 1) with nBin being the number of bins. The -1 could be really problematic, as the bins are thus slightly overlapping.

One reason for this increase in bin size could be to include also the max(x) into the binning, as by default (considering that values >= lower break and < upper break are used in each bin).

I will further investigate on this.

Documentation: roxygen'ified documentation

Would it make sense to add all forthcoming documentation
in roxygen format ?

xcmsSet crashes session when loading mzXML files

Happens when running the xcms package from the bioconductor site, as well as when running the current github version.

Using mzXML files, I tried to group them into sets following the folder structure recommended in the documentation. I even tried a pair of files instead of the whole batch and got the same result.

SImply calling the xcmsSet('my_file') command caused the bomb icon popped up with a message of
"R Session Aborted
R encountered a fatal error.
The session was terminated."

A test file that crashes for me is:
http://csb.stanford.edu/nirka/100505_NKalisman_Freidman_S1.mzXML

Add import(Biobase) and import(BiocGenerics) to the NAMESPACE

Dear Steffen,

I suggest to add the lines

import(Biobase)
import(BiocGenerics)

to the NAMESPACE file and remove the lines

setGeneric("phenoData", function(object) standardGeneric("phenoData"))
setGeneric("phenoData<-", function(object, value) standardGeneric("phenoData<-"))

in xcmsSet.R
This avoids the warning messages "replacing previous import 'Biobase::phenoData<-' by 'xcms::phenoData<-" if both xcms and Biobase packages are loaded.

best, jo

xcms3 as a branch or a completely rewritten package?

@lgatto @sneumann
So far discussions about that have been mostly email-based; time to have them in public.
The main question is whether it might be better to a) develop xcms3 as a branch of xcms replacing it in the long run or b) develop xcms3 as a completely new package.

Pros for a):

keep the large user-community of xcms.

Cons for a):

need to keep old functionality, then deprecate etc.

Pros for b):

xcms would be untouched, users can still use it without any deprecation warnings etc. Both packages could live in parallel.

Cons for b):

New users might choose xcms3 over xcms; old users might stick with xcms.

Migrate to netCDF4

On So, 2015-11-29 at 22:38 +0100, Prof Brian Ripley wrote:
xcms depends on CRAN package ncdf. The latter is now deprecated and
scheduled to be removed from CRAN in Jan 2016. It has only been kept
this long because ncdf4 was not available for Windows: it now is (pro
tem on CRAN extras, a default repository for binary installs).

Please convert your package to use either ncdf4 or RNetCDF as soon as
possible.

Usage of MSnbase objects in xcms3

This issue discusses inclusion/usage of objects and functionality from the https://github.com/lgatto/MSnbase package in xcms3. Ideally we should re-use as much as possible to reduce the complexity (i.e. number of methods etc). Methods, functionality and concepts shared by both packages might even go into a third package providing just low level functionality.

Address Jo's comments on the new centWave extensions

Jo hase some very good comments on this pull request:
#50 we need to address in the coming days.

Check and fix scanrange sub-setting in findPeaks methods

If scanrange argument is defined, the feature detection should only be performed in the specified scan range. This does not seem to be the case for all findPeaks.* methods!

Bump version number and commit to bioc

@sneumann I've now (commit e1d915d) merged branch xcms3 into devel. You can now push it to bioc. Note however that R CMD check has an error:

ERROR in test.writeMSn.mzdata: Error in .local(object, ...) : xcmsSource: file not found:

I didn't fix that one, because I assumed that you're still working on that.

Additional problems in profBinLin

The profBinLin method has some more problems:

Interpolation is not performed at the beginning and end of the data values.
Apart from the wrong first bin value (issue #46), also at the end there can be some problems (see example below).

X <- 1:11
set.seed(123)
Y <- sort(rnorm(11, mean = 20, sd = 10))
## removing some of the values in the middle
Y[5:9] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
                          xend = max(X))
resX
[1]  7.349388 15.543380 22.745803 29.948227 37.150650

## Do the same with binYonX and imputeLinInterpol.
res <- xcms:::binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- xcms:::imputeLinInterpol(res$y, method = "lin")
resM
[1] 13.13147 15.54338 22.74580 29.94823 37.15065
## First bin value from profBinLin is wrong (issue #46)

## Visualize
plot(x = X, y = Y, pch = 16)
## Plot the breaks
abline(v = xcms:::breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
       pch = 4, lty = 2)

The plot:
black points indicate the actual data values and the grey vertical lines the breaks defining the bins. Blue and green colored points and lines represent the results from profLinBin and imputeLinInterpol (after binYonX), respectively.

Now to the issues above:

## Remove values at the end
set.seed(123)
Y <- sort(rnorm(11, mean = 20, sd = 10))
Y[9:11] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
                          xend = max(X))
resX
[1]  7.349388 15.543380 21.292877  0.000000  0.000000
## No interpolation performed at the end.

## Do the same with binYonX and imputeLinInterpol.
res <- xcms:::binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- xcms:::imputeLinInterpol(res$y, method = "lin")
resM
[1] 13.13147 15.54338 21.29288 24.60916 12.30458
## Interpolation also performed at the end. In addition, the 4th bin also contains
## a value, which is not reported by profBinLin; see plot below.

## Visualize
plot(x = X, y = Y, pch = 16, ylim = c(0, max(Y, na.rm = TRUE)))
## Plot the breaks
abline(v = xcms:::breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
       pch = 4, lty = 2)

Thus, not only does the profBinLin not interpolate at the ends of the values, but it also drops the value for the 4th bin in the example above, that clearly has a value assigned to it.

findPeaks.massifquant returns different objects based on withWave setting

The findPeaks.massifquant method for xcmsRaw object returns by default a matrix, but if withWave = 1 it returns an xcmsPeaks object.
Given that all findPeaks.* methods return an xcmsPeaks object I think, to be consistent, we should change this behavior and always return an xcmsPeaks object (which in the end just extends matrix anyway).

Clean up package

Before starting to implement new stuff I'd like to clean the package a bit up:

Move methods into methods-.R files. for xcmsSet: 6cbe435 for xcmsRaw: 09d92f2 xcmsECI xcmsFragments 623c6fd
Move functions into functions-.R files. for xcmsSet: 6cbe435 for xcmsRaw: 09d92f2 xcmsEIC xcmsFragments623c6fd
Fix R CMD check notes.

Error processing xcmsDirect.Rnw vignette

Processing of xcmsDirect.Rnw vignette fails now:

Error: processing vignette 'xcmsDirect.Rnw' failed with diagnostics:
 chunk 3 (label = ProcessData) 
Error in xcmsSet(method = "MSW", files = mzdatafiles, scales = c(1, 4,  : 
  Feature detection failed for all files! The first error was: Error in names(ridgeList) <- paste(1, names(ridgeList), sep = "_"): 'names' attribute [1] must be the same length as the vector [0]

sessionInfo:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin16.0.0/x86_64 (64-bit)
Running under: OS X 10.12 (Sierra)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] msdata_0.12.1          MassSpecWavelet_1.39.0 waveslim_1.7.5        
[4] xcms_1.49.6            Biobase_2.33.3         ProtGenerics_1.5.1    
[7] BiocGenerics_0.19.2    mzR_2.7.4              Rcpp_0.12.7           

loaded via a namespace (and not attached):
 [1] RANN_2.5           lattice_0.20-34    codetools_0.2-14   MASS_7.3-45       
 [5] grid_3.3.1         plyr_1.8.4         stats4_3.3.1       S4Vectors_0.11.16 
 [9] Matrix_1.2-7.1     splines_3.3.1      BiocParallel_1.7.8 RColorBrewer_1.1-2
[13] tools_3.3.1        compiler_3.3.1     survival_2.39-5    multtest_2.29.0

xcmsRaw reads correct polarity of scans from Orbitrap data?

Hi all,

I am reading in a centroided mzXML file from an Orbitrap instrument using

xfile_raw = xcmsRaw(mzXMLfiles[1])

where mzXMLfiles[1] is this file: https://github.com/vanmooylipidomics/LOBSTAHS/blob/master/Pt_H2O2_mzXML_ms1_pos/0_uM_H2O2/0uM_24h_Orbi_0468.mzXML

This is a file containing only positive-mode scans, created using MSConvert. I can verify that it contains centroided, (only) positive mode scans by opening the file in a text reader. In all scans, the value for polarity looks like the first scan, here:

However, when I run:

xfile_raw@polarity

I see:

  [1] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [14] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [27] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [40] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [53] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [66] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [79] negative negative negative negative negative negative negative negative negative negative negative negative negative
 [92] negative negative negative negative negative negative negative negative negative negative negative negative negative
[105] negative negative negative negative negative negative negative negative negative negative negative negative negative
[118] negative negative negative negative negative negative negative negative negative negative negative negative negative
[131] negative negative negative negative negative negative negative negative negative negative negative negative negative
[144] negative negative negative negative negative negative negative negative negative negative negative negative negative
[157] negative negative negative negative negative negative negative negative negative negative negative negative negative
[170] negative negative negative negative negative negative negative negative negative negative negative negative negative
[183] negative negative negative negative negative negative negative negative negative negative negative negative negative
[196] negative negative negative negative negative negative negative negative negative negative negative negative negative
[209] negative negative negative negative negative negative negative negative negative negative negative negative negative
[222] negative negative negative negative negative negative negative negative negative negative negative negative negative
[235] negative negative negative negative negative negative negative negative negative negative negative negative negative
[248] negative negative negative negative negative negative negative negative negative negative negative negative negative
[261] negative negative negative negative negative negative negative negative negative negative negative negative negative
[274] negative negative negative negative negative negative negative negative negative negative negative negative negative
[287] negative negative negative negative negative negative negative negative negative negative negative negative negative
[300] negative negative negative negative negative negative negative negative negative negative negative negative negative
[313] negative negative negative negative negative negative negative negative negative negative negative negative negative
[326] negative negative negative negative negative negative negative negative negative negative negative negative negative
[339] negative negative negative negative negative negative negative negative negative negative negative negative negative
[352] negative negative negative negative negative negative negative negative negative negative negative negative negative
[365] negative negative negative negative negative negative negative negative negative negative negative negative negative
[378] negative negative negative negative negative negative negative negative negative negative negative negative negative
[391] negative negative negative negative negative negative negative negative negative negative negative negative negative
[404] negative negative negative negative negative negative negative negative negative negative negative negative negative
[417] negative negative negative negative negative negative negative negative negative negative negative negative negative
[430] negative negative negative negative negative negative negative negative negative negative negative negative negative
[443] negative negative negative negative negative negative negative negative negative negative negative negative negative
[456] negative negative negative negative negative negative negative negative negative negative negative negative negative
[469] negative negative negative negative negative negative negative negative negative negative negative negative negative
[482] negative negative negative negative negative negative negative negative negative negative negative negative negative
[495] negative negative negative negative negative negative negative negative negative negative negative negative negative
[508] negative negative negative negative negative negative negative negative negative negative negative negative negative
[521] negative negative negative negative negative negative negative negative negative negative negative negative negative
[534] negative negative negative negative negative negative negative negative negative negative negative negative negative
[547] negative negative negative negative negative negative negative negative negative negative negative negative negative
[560] negative negative negative negative negative negative negative negative negative negative negative negative negative
[573] negative negative negative negative negative negative negative negative negative negative negative negative negative
[586] negative negative negative negative negative negative negative negative negative negative negative negative negative
[599] negative negative negative
Levels: negative positive unknown

I am concerned whatever function is reading in the file is looking for 0, 1, or -1 to determine polarity, when either Thermo or MSConvert is using + and - in the files that I am working with.

Or, is there a solution I am missing?

Thanks in advance, really appreciate how responsive the xcms team is.

Jamie Collins

Problem with scanrange sub-setting in findPeaks.centWave

findPeaks.centWave does support the scanrange and does perform the feature detection only on the specified scans, but the results are not as expected (note: the [ is only available in the current xcms3 branch):

library(xcms)
library(faahKO)
fs <- system.file('cdf/KO/ko15.CDF', package = "faahKO")
xraw <- xcmsRaw(fs, profstep = 0)

## Perform feature detection on a subset of scans:
res_1 <- findPeaks.centWave(xraw, scanrange = c(90, 345))

## Perform the feature detection on an xcmsRaw object sub-setted to these scans
xsub <- xraw[90:345]
res_2 <- findPeaks.centWave(xsub)

## Results are DIFFERENT
nrow(res_1)
[1] 383
nrow(res_2)
[1] 374

In the findPeaks.centWave code the scanrange parameter is considered in the findmzROI function, but the @scantime slot from the full object is used throughout the method (i.e. not a scantime that was sub-setted to correspond only the scans in scanrange).

getEIC gives wrong result for zero intensity scans

XCMS version: 1.49.2

I was writing a function to make EICs for many ranges (I wanted to use the raw data and rawEIC supports only one, but my function turned out rather slow).
When I get went to getEIC (thinking I could disable the profile matrix with step = 0), that is faster, and compared results I noticed that they were not the same as my manually calculated results.

So I investigated and it looks like getEIC does something wrong sometimes with scans where there are no mz peaks in range.

I shall attempt to explains below. My test file: xraw.zip (an RDS so use readRDS)

I tried both getEICOld and getEICNew but they give same result.

Range:

a = 101.07781
b = 101.08387

Old:

opts <- options()$BioC
opts$xcms$getEIC.method <- "getEICOld"
options(BioC=opts)

EIC_xcms_old <- getEIC( xraw,
                        cbind(mzmin = a, mzmax = b),    
                        cbind(rep(0,length(a)), mzmax = rep(Inf,length(b))  , 
                              step=0)    
                        )

New:

opts <- options()$BioC
opts$xcms$getEIC.method <- "getEICNew"
options(BioC=opts)

EIC_xcms_new <- getEIC( xraw,
                        cbind(mzmin = a, mzmax = b),    
                        cbind(rep(0,length(a)), mzmax = rep(Inf,length(b))  , 
                              step=0)    
)

My own function:

EIC_manual <- get_EICs(xraw, as.data.frame(cbind(mz_lower = a, mz_upper = b)))

Now lets look at the output.
EIC_manual (empty scans a dropped, which was a bug but instructive here):

[[1]]
# A tibble: 62 x 3
    scan    scan_rt intensity
   <int>      <dbl>     <dbl>
1      1 0.04880000  1759.401
2      2 0.05708333  1391.995
3      3 0.06536667  1404.903
4      4 0.07365000  1596.921
5      6 0.09021667  1986.700
6      7 0.09850000  1135.041
7      9 0.11506667  1445.067
8     10 0.12335000  1491.382
9     11 0.13163333  1694.377
10    12 0.13991667  1590.738
# ... with 52 more rows

So scan 5 and 8 is empty.

head(EIC_xcms_old@eic$xcmsRaw[[1]], 10):

         rt intensity
 [1,] 2.928  1759.401
 [2,] 3.425  1391.995
 [3,] 3.922  1404.903
 [4,] 4.419  1596.921
 [5,] 4.916  1982.329
 [6,] 5.413  1986.700
 [7,] 5.910  1135.041
 [8,] 6.407  1577.644
 [9,] 6.904  1445.067
[10,] 7.401  1491.382

head(EIC_xcms_new@eic$xcmsRaw[[1]], 10):

         rt intensity
 [1,] 2.928  1759.401
 [2,] 3.425  1391.995
 [3,] 3.922  1404.903
 [4,] 4.419  1596.921
 [5,] 4.916  1982.329
 [6,] 5.413  1986.700
 [7,] 5.910  1135.041
 [8,] 6.407  1577.644
 [9,] 6.904  1445.067
[10,] 7.401  1491.382

See that getEIC puts an intensity also for scan 5 and 8 but otherwise agree on the intensity.
I checked who is rigt looking at the scan directly which seems to confirm that my function is right and getEIC is somehow wrong:

scandata <- getScan(xraw,8)

between <- scandata[,1] > a & scandata[,1] < b

sum(between) # --> 0
sum(scandata[between,"intensity"]) # --> 0

Any idea why this is happening? Am I misunderstanding something?

I thought this might have to do with the profile matrix but it doesn't appear so since I disabled that with step=0 both in the xcmsRaw and in getEIC.

rawEIC gives the correct result too.

Import isCentroided from ProtGenerics

Import the isCentroided from ProtGenerics (1.5.1) and check if it would make sense to use the isCentroided method implemented in MSnbase (see issue lgatto/MSnbase#131) also on xcmsSet objects (same results? performance?).

Undocumented S4 method: generic '[' and siglist 'xcmsSet,ANY,ANY,ANY'

Happened in 1.49.3 http://bioconductor.org/checkResults/devel/bioc-LATEST/xcms/zin1-checksrc.html

* checking for missing documentation entries ... WARNING
Undocumented S4 methods:
  generic '[' and siglist 'xcmsSet,ANY,ANY,ANY'

Warnings: Undocumented code objects

Hi @Treutler, can you please address these:

* checking for missing documentation entries ... WARNING
Undocumented code objects:
  ‘findPeaks.centWaveWithPredictedIsotopeROIs’
Undocumented S4 methods:
  generic 'findPeaks.centWaveWithPredictedIsotopeROIs' and siglist
    'xcmsRaw'

as they lead to Warnings in https://bioconductor.org/checkResults/3.4/bioc-LATEST/xcms/zin1-checksrc.html
Thanks, Steffen

Support for SRM chromatograms

I am not sure how this could be handled in an xcmsRaw object but it would be nice to be able to do something with SRM data. Right now there is the ms2 slot for precursor ions but here we could have several transitions so don't know how that should be handled.

Probably support is lacking all the way from mzR. I get this error if I try to read an SRM file:

Error in validObject(.Object) : 
  invalid class “rampSource” object: Could not open mzML/mzXML/mzData file

--srmAsSpectra in msconvert doesn't do anything.

Any thoughts/hints?

Iterative generation of profile matrix subsets in findpeaks.matchedFilter required?

I was about to simplify the code in the findpeaks.matchedFilter method and I am puzzled by the iterative creation of the binned profile matrix from only sub-sets of the data. More specifically, in the method a profile matrix is generated for sub-sets of 100 M/Z bins and peaks in this subset are identified, subsequently the buffer is updated and the next 100 M/Z bins are analyzed. This is repeated until the full M/Z range in the data is covered.

To me this looks both conceptually and computationally problematic:

Conceptually: if a profile matrix generation method such as profBinLin or profBinLinBase are used, that after binning the data do also a linear interpolation to impute values for empty bins this can obviously be affected by the data on which this is calculated. Imputation might miss neighboring non-empty bins, which are just outside of the presently analyzed sub-set.
Computationally: the repeated profile generation comes at the cost of a lower performance. Generating once the profile matrix and identifying peaks in that would be faster.

@sneumann @hpbenton @stanstrup : do you know of any particular reason why this was implemented like that (apart from being less memory demanding)?
Also, what do you think of that? Because I would really like to replace this iterative approach by something simpler (which will also be less error prone).

xcms build failure - socket clusters

Reported by Dan Tenenbaum:

Date: Fri, Oct 10, 2014 at 11:43 AM
Subject: xcms build failure - socket clusters
...
The reason xcms is failing on windows:
http://www.bioconductor.org/checkResults/devel/bioc-LATEST/xcms/moscato1-checksrc.html
...is that you have a makeCluster() call without a corresponding stopCluster() call. Windows is sensitive about this and produces the error you see at that link.
I would fix this myself but I'm not sure where the stopCluster() call should go.

findPeaks.massifquant with scanrange and withWave = 1 fails

The scanrange argument is ignored in the original code from findPeaks.massifquant, but is passed to a call to findPeaks.centWave that is used to further filter the peaks identified in the first step if argument withWave = 1. This however can result in an error like the one below:

 Detecting  mass traces at 20 ppm ... 

 Detecting chromatographic peaks ... 
 % finished: 0 Error in continuousPtsAboveThreshold(fd, threshold = noise, num = minPtsAboveBaseLine) : 
  NA/NaN/Inf in foreign function call (arg 1)

The reason for this is that the peaks that the findPeaks.centWave should refine are outside of the data/scan range that is considered by the method.

New vignette benchmarking.Rmd requires new dependency

* checking re-building of vignette outputs ... NOTE
Error in re-building vignettes:
  ...

    IQR, mad, xtabs

Quitting from lines 48-52 (benchmarking.Rmd) 
Error: processing vignette 'benchmarking.Rmd' failed with diagnostics:
there is no package called 'microbenchmark'

Rare "non-numeric matrix extent" error in obiwarp

Yashwant Kumar <y.kumar at ncl.res.in>reported:

During obiwrap step, I am getting following error for few samples however others are running well: 

Processing: DVC4_1_1T_POS  Found gaps: cut scantime-vector at  0.558566 seconds 
changing 0 num_internal_anchors to -1 DVC4_1_2T_POS  Found gaps: cut scantime-vector at  0.828551 seconds 
Error in matrix(0, seqlen, dim(obj2@env$profile)[2]) : 
  non-numeric matrix extent

This requires more information and a reproducible example.

Update parallel processing to BiocParallel

Hi,

xcms currently uses several packages (snow, rmpi) for parallel processing,
and handles cluster setup and teardown manually.

Enhancement: migrate both findPeaks() and fillPeaks() handling to biocParallel
https://bioconductor.org/packages/release/bioc/html/BiocParallel.html

Yours,
Steffen

Suggestion: let sampclass use selected column from the phenoData

dear All,
I am not completely happy with the behavior of the xcmsSet sampclass method. I think it works nicely when a single column phenodata is provided or if the class labels are estimated from directories, but it's not really that intuitive for the way I'm used to work with phenodata objects (from microarray or RNA-seq experiments). Basically, my phenodata always have multiple columns describing each parameter specific for a file. So, I do also have sample grouping in the phenodata data.frame (usually a column names "class" or "group"), but the sampclass method does not honor that, it just returns a one class per sample.

My suggestion now would be to check in the sampclass method if the phenoData contains a column "class" and if present, return its content, otherwise do as it does now (i.e. interaction(object@phenoData)). Also, this would fit with the sampclass<- method, that actually replaces the content of a (existing or new) "class" column in the phenoData.

You think that might be OK? In case, I can then change the method and update the documentation.

cheers, jo

Reduce the number of methods

Some of the methods should be changed to functions, such as

specDist... methods.
group... methods.
findPeaks... methods.

This also facilitates documentation.
So in the end there will only be one e.g. specDist method that calls different functions based on the specified method.

Potential bug in the function getXcmsRaw

Hi. Recently, I was working with XCMS taken from Bioconductor-mirror/xcms (branch release-3.3) and noticed a bug. When I align two samples and plot the aligned peaks with the function getEIC, the returned EICs are the same for both rt="corrected" and rt="raw" parameters. Then I read the source code and, I believe, the problem is in the function getXcmsRaw:

`for(i in 1:length(ret)){

I'm skipping some lines here

if(rt == "corrected"){

check if there is any need to apply correction...

 if(all(object@rt$corrected[[i]] == object@rt$raw[[i]])){

     message("No need to perform retention time correction,",

         " raw and corrected rt are identical for ", fn[i])

     ret[[i]]@scantime <- object@rt$raw[[sampidx[i]]]

 }else{

     message(paste0("Applying retention time correction to ", fn[i]))

     ret[[i]]@scantime <- object@rt$corrected[[sampidx[i]]]
 }

}

I think that the line
if(all(object@rt$corrected[[i]] == object@rt$raw[[i]])){
should be replaced by
if(all(object@rt$corrected[[sampidx[i]]] == object@rt$raw[[sampidx[i]]])){

Otherwise, if I have only two samples and the first one is the reference sample, then length(ret) is always equal to 1 and the condition all(object@rt$corrected[[i]] == object@rt$raw[[i]]) is always satisfied, even if I'm using getEIC to plot the second (aligned) sample.

It's interesting that the branch devel from sneumann/xcms doesn't use getXcmsRaw in the function getEIC, so it plots the samples correctly. However, getXcmsRaw still has that line if(all(object@rt$corrected[[i]] == object@rt$raw[[i]])){

As I understand, you are rewriting XCMS now, and you might not use getXcmsRaw at all in the newer versions of XCMS, but I just wanted to let you know about this issue.

Thank you for your work on XCMS,

Aleksandr

group.nearest() has problems similar to EICs when the scan range is reduced prior to peak detection

Alain Smith reported a follow-up to issue #7 .

Hi Steffen, It seems like the function group.nearest also suffers from problems similar to EICs when the scan range is reduced prior to peak detection.  I did not test the problem, just letting you know about the observation.
I get error similar to the one below.

5: In matrix(unlist(scoreList[idx]), ncol = 5, nrow = length(idx),  :
  data length [23890] is not a sub-multiple or multiple of the number of rows [24165]

Implement/fix getEIC method

The current getEIC method implementation in xcms causes repeatedly problems (see issues #39 and #7). We should replace it with a more stable version (i.e. without the buffer).
The method should also allow to return the raw values instead of just the data from the profile matrix (e.g. when argument step = 0). The new implementation should thus combine both the xcmsRaw and getEIC methods.

The method should take arguments

mzrange
scanrange
rtrange(which is internally translated to scanrange).

Some additional notes/properties:

If mzrange is equal to the total M/Z range getEIC returns the TIC.
Should have the possibility to choose between the sum of intensities across M/Z or the max (thus also enabling to extract the base peak chromatogram BPC).

It will however be pretty tricky to not break any existing functionality.