computational-metabolomics / dimspy Goto Github PK

View Code? Open in Web Editor NEW

19.0 6.0 9.0 109.25 MB

Python package for processing direct-infusion mass spectrometry-based metabolomics and lipidomics data

License: GNU General Public License v3.0

Python 99.00% Jupyter Notebook 1.00%

metabolomics mass-spectrometry dims untargeted-metabolomics raw mzml

dimspy's People

Contributors

Stargazers

Watchers

Forkers

keirono arpankbasak aspirincode gitter-badger rjmw hkmoon wudangbio zahrapahlavanyali satish5577

dimspy's Issues

skip-stitching bug

When the --skip-stitching option with the CLI we get the following error:

/usr/local/lib/python2.7/dist-packages/numpy/lib/function_base.py:4016: RuntimeWarning:

All-NaN slice encountered

Traceback (most recent call last):
  File "/usr/local/bin/dimspy", line 11, in <module>
    load_entry_point('dimspy==1.0.0', 'console_scripts', 'dimspy')()
  File "/usr/local/lib/python2.7/dist-packages/dimspy/__main__.py", line 443, in main
    ncpus=args.ncpus)
  File "/usr/local/lib/python2.7/dist-packages/dimspy/tools.py", line 128, in process_scans
    pl = join_peaklists("{}#{}".format(os.path.basename(filenames[i]), pl.metadata["header"][0]), [pl])
AttributeError: 'list' object has no attribute 'metadata'

The full command was as follows

dimspy process-scans --input /mnt/hgfs/RDS/users/tnl495/galaxy_test//[email protected]/20171011_SS_integration_Elite/sample --output "/home/tnl495/galaxy_dimspy_testing/galaxy/database/files/010/dataset_10384.dat" --function-noise noise_packets --snr-threshold 3.0 --ppm 3.5 --min_scans 1 --min-fraction 0.8 --skip-stitching --report "/home/tnl495/galaxy_dimspy_testing/galaxy/database/files/010/dataset_10385.dat"

Weird peak matrix from Align-samples

Hi @Albert500 & @RJMW ,

Looks like we get some weird behaviour when we align samples with 2 files of different sample classes.

The resulting peak matrix is created without error but when we perform blank_subtraction we get an error.

e.g. with a filelist of:

filename	classLabel
A01_Blank.mzML	blank
A02_Sample.mzML	sample

and running

peaks1 = process_scans(source='data', function_noise="median",
                            snr_thres=3.0, ppm=2.0, min_fraction=0.7, rsd_thres=None, filelist="filelist_01.txt",
                            block_size=2000, ncpus=None, skip_stitching=False,
                       filter_scan_events={'include': [[50.0, 1000.0, 'full']]})
pm = align_samples(peaks1, ppm=2)
pm_bf = blank_filter(pm, blank_label='blank')

Gives the following error message

Traceback (most recent call last):
  File "/home/tnl495/galaxy_testing/dimspy2/tests/merge_test_temp.py", line 13, in <module>
    pm_bf = blank_filter(pm, blank_label='blank')
  File "/home/tnl495/galaxy_testing/dimspy2/dimspy/workflow.py", line 255, in blank_filter
    return filter_blank_peaks(peak_matrix, blank_label, min_fraction, min_fold_change, function, rm_samples)
  File "/home/tnl495/galaxy_testing/dimspy2/dimspy/process/peak_filters.py", line 176, in filter_blank_peaks
    m.add_flag(flag_name, ~((ints > 0) & faild_int))
  File "/home/tnl495/galaxy_testing/dimspy2/dimspy/models/peak_matrix.py", line 513, in add_flag
    if not len(flag_values) == self.shape[1]: raise ValueError('flag values and peak matrix shape not match')
ValueError: flag values and peak matrix shape not match

Files used are attached

test.zip

No metadata for peaklists

Would be nice to get metadata (e.g. mz range, polarity, general header information) in the peaklist output.

The metadata is empty when a analysing non-simstitch data.

If you would like to recreate see below script that can run on some data on the RDS

import os
from dimspy.workflow import process_scans
from dimspy.workflow import align_peaks
from dimspy.workflow import blank_filter
from dimspy.portals.hdf5_portal import save_peaklists_as_hdf5, save_peak_matrix_as_hdf5
from dimspy.process import peak_filters


def run_workflow(sources, snr, ppm, minfracB, minfracS, output, fracid):
    output_frac = os.path.join(output, '{}_ppm{}_snr{}_minfracS{}_minfracB{}'.format(fracid, ppm, snr, minfracS,
                                                                            minfracB))
    if not os.path.exists(output_frac):
        os.makedirs(output_frac)

    dimsn = process_scans(source=sources['dimsn'], nscans=0, function_noise="noise_packets",
                          snr_thres=snr, ppm=ppm, min_fraction=minfracS, rsd_thres=None,
                          filelist=None,
                          mzrs_to_remove=[], scan_events=[[50.0, 1000.0, 'full']],
                          block_size=2000, ncpus=1)
    dimsn[0].tags.add_tags(class_label='DIMSn')

    sample = process_scans(source=sources['sample'], nscans=0, function_noise="noise_packets",
                           snr_thres=3.0, ppm=ppm, min_fraction=minfracS, rsd_thres=None,
                           filelist=None,
                           mzrs_to_remove=[], scan_events=[[50.0, 1000.0, 'full']],
                           block_size=2000, ncpus=1)
    sample[0].tags.add_tags(class_label='Sample')

    blank = process_scans(source=sources['blank'], nscans=0, function_noise="noise_packets",
                          snr_thres=3.0, ppm=ppm, min_fraction=0.4, rsd_thres=None,
                          filelist=None,
                          mzrs_to_remove=[], scan_events=[[50.0, 1000.0, 'full']],
                          block_size=2000, ncpus=1)
    blank[0].tags.add_tags(class_label='Blank')

    peaks = [dimsn[0], sample[0], blank[0]]

    peaks = [peak_filters.filter_attr(p, attr_name='intensity', min_threshold=5000) for p in peaks]

    pm = align_peaks(peaks, ppm=ppm)
    write_out_pm(os.path.join(output_frac, 'peak_matrix.txt'), pm)

    save_peak_matrix_as_hdf5(pm=pm, fname=os.path.join(output_frac, 'peak_matrix.hdf5'))

    pm_bf = blank_filter(pm, "Blank", min_fraction=minfracB, min_fold_change=10.0, function="mean", rm_samples=False)

    save_peak_matrix_as_hdf5(pm=pm_bf, fname=os.path.join(output_frac, 'peak_matrix_bf.hdf5'))

    write_out_pm(os.path.join(output_frac, 'peak_matrix_blank_filtered.txt'), pm_bf)

    save_peaklists_as_hdf5(pkls=peaks, fname=os.path.join(output_frac, "peaklists.hdf5"))
    for pl in peaks:
        print pl.ID, pl.shape
        with open(os.path.join(output_frac, pl.ID + ".txt"), "w") as out:
            out.write(pl.to_str())

    return pm_bf

def write_out_pm(out_file, pm):
    with open(out_file, "w") as out:
        out.write(pm.to_str(transpose=True))

output = '/mnt/hgfs/DATA/dimspy_dma_test/'
rds = '/mnt/hgfs/RDS/users/tnl495/data/for_ralf/'

sources = {
    'sample': os.path.join(rds, 'A01_Polar_Daph_WAX1_Phenyl_LCMS_Pos_DIMS.raw'),
    'blank': os.path.join(rds, 'A01_Polar_Blank_WAX1_Phenyl_LCMS_Pos_DIMS.raw'),
    'dimsn': os.path.join(rds, 'A01_Polar_Daph_WAX1_Phenyl_LCMS_Pos_DIMSn.raw')

}

Blank peak flag

It would be useful though if as well as the option for removing the peaks using 'blank_filter' there was also an option to just flag the peaks based on the 'blank_filter' criteria.

This would mean a user can easily check what peaks are going to be removed. Would be useful for me at least.

E flagged peaks (remove or keep)

Hi all,

It looks like we include the E flagged peaks when processing.

See below for E flagged description by thermo, however we are not sure if they are also just electrical noise peaks.

(@M-R-JONES is going to email thermo for clarification)

This can be problematic as the peaks are not by default seen by Xcalibur and also we are not sure if it is valid to include them at all..

One option is to just remove them completely from any processing. This replicates what msconvert does and @RJMW has mentioned this is an easy fix, but maybe we should wait until clarification from thermo?

Update align_peaks

Allow input of a single peaklist, including to_peaklist()
Skip clustering for single peaklist

Select a subset of peaklists from the total dataset based on a filelist

RSD filter QC samples

skip_stitching error message

I believe this should say set 'skip_stitching' to True instead of False.

Align peaklists with different attributes

In some cases one might need to align peaklists that have different attributes e.g. see the following peaklist pl_to_align:

In [43]: pl_to_align[0].attributes                                                                                                                                                                                 
Out[43]: ('mz', 'intensity')

In [44]: pl_to_align[1].attributes                                                                                                                                                                                 
Out[44]: ('mz', 'intensity', 'present', 'fraction', 'occurrence', 'purity')

The following error will occur when trying to align

In [73]: align_peaks(pl_to_align)                                                                                                                                                                                  
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-73-fd90ecd30488> in <module>
----> 1 align_peaks(pl_to_align)

~/miniconda3/envs/[email protected]/lib/python3.7/site-packages/dimspy/process/peak_alignment.py in align_peaks(peaks, ppm, block_size, fixed_block, edge_extend, ncpus)
    232         raise AttributeError('PANIC: peak attributes in wrong order')
    233     if not all([attrs == x.attributes for x in peaks]):
--> 234         raise ValueError('peak attributes not the same')
    235     if 'intra_count' in attrs:
    236         raise AttributeError('preserved attribute name [intra_count] already exists')

ValueError: peak attributes not the same

One option is that I just add the missing attributes to one of the peaklists (or remove the attributes from the other peaklist). Perhaps it is better though that this handled within the "align_peaks" function

Include empty peaklists in the alignment & test a empty peaklist

Known bug: Failed building wheel for pythonnet

The pythonnet 2.3.0 does not build with Mono 5. I have relaxed the requirement in the dev branch:

- pythonnet==2.3.0
+ pythonnet>=2.3.0

If you encounter this problem during the installation, please run:

pip install git+https://github.com/pythonnet/pythonnet.git

to obtain the latest pythonnet (currently it's 2.4.0.dev0) and reinstall the dev branch of dimspy.

Add tool to bio.tools

Hi! I came across your tool as part of a general review of Galaxy Metabolomics resources (GCC21 cofest).

It would be great if the tool could be added to bio.tools with the appropriate identifiers and also links to the Galaxy tool to improve the visibility of the tool and to allow categorizing it among other resources.

hd5-pl-to-txt no output

Hi,

The hdf5-pl-to-txt does not seem to output any files at all at the moment.

Tried with a few files using the following command.:

python -m dimspy hdf5-pl-to-txt --input batch04_QC17_rep01_262.hdf5  --output .

Unique ID for averaged peaks for multiple peaklists

Would be useful to have an option to provide a unique ID for each 'averaged' peak across multiple peaklist objects.

Personally I think it makes it a bit nicer to traceback to peaks rather than using the unique MZ and filename.

It is currently possible to do through the add_attribute() function but Ralf mentioned it might be nice if it was a built in feature.

See below for implementation using add_attribute()

For list of peaklist objects called 'peaks'

[<dimspy.models.peaklist.PeakList at 0x7f95dbbe3c10>,
<dimspy.models.peaklist.PeakList at 0x7f95dbd74610>,
<dimspy.models.peaklist.PeakList at 0x7f95dbd60450>]

We add a UID

c=1
for p in peaks:
    p.add_attribute('uid', range(c, c+p.full_shape[0]), flagged_only=False)
    c = c+p.full_shape[0]

We can then extract the UID from the peak matrix

pm = align_peaks(peaks, 3)
pm.attr_matrix('uid')

SIGSEGV when running examples.py

Hi,

I get the following output when I run examples.py or run.sh on my Mac.

python examples.py

Process Scans.....
3 sample(s) with 3 replicate(s)
Batch numbers: [1]
Number of samples in each Batch: {1: 9}
Classes: {'sample': 3, 'QC': 3, 'blank': 3}

batch04_B02_rep01_301.mzML
Reading scans....
SIM-Stitch experiment - Overlapping m/z windows.....
Removing 'edges' from SIM windows.....
Removing noise.....
WARNING:root:all peaks are removed for peaklist [88]
Aligning, averaging and filtering peaks.....

=================================================================
	Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

=================================================================
	Basic Fault Adddress Reporting
=================================================================
Memory around native instruction pointer (0x107517100):0x1075170f0  
=================================================================

I've tried reinstalling mono runtime, but no luck. The error code it generates (in this case 0x107517100) is different with every run.

Any ideas what else I can try?

scan_processing.py:163: RuntimeWarning: invalid value encountered in less_equal

pkls[h].add_attribute("rsd_flag", pm.rsd <= rsd_thres, flagged_only=False, is_flag=True)

Running into an error " raise IOError('input database missing crucial attribute [mz]')

Hi,

I installed the dimspy as a conda enviornment and when I try to run the command
"dimspy hdf5-pm-to-txt --input 'MTBLS79__01_262_02_263.hdf5' --output 'test' --delimiter tab --attribute_name intensity --representation-samples rows", the input file is one of the test datasets from git. I am running into the following error

Namespace(attribute_name='intensity', class_label_rsd=(), comprehensive=False, delimiter='tab', input='MTBLS79__01_262_02_263.hdf5', output='test', representation_samples='rows', step='hdf5-pm-to-txt')
Traceback (most recent call last):
File "/N/soft/rhel7/dimspy/dimspy-env/bin/dimspy", line 6, in
sys.exit(dimspy.main.main())
File "/N/soft/rhel7/dimspy/dimspy-env/lib/python2.7/site-packages/dimspy/main.py", line 542, in main
comprehensive=args.comprehensive)
File "/N/soft/rhel7/dimspy/dimspy-env/lib/python2.7/site-packages/dimspy/tools.py", line 337, in hdf5_peak_matrix_to_txt
obj = hdf5_portal.load_peak_matrix_from_hdf5(filename)
File "/N/soft/rhel7/dimspy/dimspy-env/lib/python2.7/site-packages/dimspy/portals/hdf5_portal.py", line 170, in load_peak_matrix_from_hdf5
raise IOError('input database missing crucial attribute [mz]')
IOError: input database missing crucial attribute [mz]

Any suggestions on how to get past this error?
Thank You,
Bhavya

Multi-list for merge peaklist

When using DIMSpy we would like to update the merge_peaklist function slightly that it can output lists of lists of peaklist objects

e.g.
[ [ peaklist, peaklist ], [ peaklist, peaklist ] ]

This is so we can perform the Deep Metabolome Annotation (DMA) pipeline for data processing. See below

I have currently updated the code for this in the python package here but I am just testing on my local galaxy instance to see if it is compatible with galaxy pipeline.

Hopefully this will all make sense to @RJMW as we have been discussing this but happy to explain in more detail if required for @Albert500

I will make a pull request when happy with the changes

Cheers,
Tom

Non SIM data without # identifiers

When running process_scans on a dataset where we just want to look at full scan data.

The resulting peaklist ID will have the header after the filename e.g.

A01_Polar_Daph_WAX1_Phenyl_LCMS_Pos_DIMS.mzML#FTMS + p NSI Full ms [50.00-1000.00]

This is useful if we had multiple scans with different headers but is not required if the file only consists of one header in the first place.

Additionally the # symbol causes an error if later use this as the identifier in a filelist using np.genfromtxt

e.g.

Fatal error: Exit code 1 ()
Traceback (most recent call last):
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/__main__.py", line 9, in <module>
    main()
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/dimspy.py", line 451, in main
    pls_merged = workflow.merge_peaklists(source=args.input, filelist=args.filelist)
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/workflow.py", line 351, in merge_peaklists
    fl = check_metadata(filelist)
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/experiment.py", line 121, in check_metadata
    fm = np.genfromtxt(fn_tsv.encode('string-escape'), dtype=None, delimiter="\t", names=True)
  File "/home/tnl495/galaxy_dimspy_testing/galaxy/database/dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1867, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #2 (got 1 columns instead of 3)
    Line #3 (got 1 columns instead of 3)
    Line #4 (got 1 columns instead of 3)
    Line #5 (got 1 columns instead of 3)

add_attribute to peaklists with strings: Fails on peak_align

If I wanted to add a string attribute to a list of peaklist object I would do the following:

for p in peaks:
    uuid4_values = [str(uuid.uuid4()) for _ in range(p.full_shape[0])]
    p.add_attribute('uuid4_values', uuid4_values, flagged_only=False)

But if I later try to align the peaks like so:

align_peaks(peaks, ppm=3)

I get this error:

/usr/local/lib/python2.7/dist-packages/dimspy/process/peak_alignment.pyc in align_peaks(peaks, ppm, block_size, fixed_block, edge_extend, ncpus)
    199 
    200     # align
--> 201     a_pids, a_attrms = _align_peaks(cids, s_pids, *s_attrs)
    202     attrs += ('intra_count', )  # for cM
    203 

/usr/local/lib/python2.7/dist-packages/dimspy/process/peak_alignment.pyc in _align_peaks(cids, pids, *attrs)
    156         aM[np.isnan(aM)] = 0
    157         return aM
--> 158     attrMs = map(_fillam, attrs)
    159 
    160     # sort mz values, ensure mzs matrix be the first

/usr/local/lib/python2.7/dist-packages/dimspy/process/peak_alignment.pyc in _fillam(a)
    151         aM = np.zeros(map(len, (upids, ucids)))
    152         for p, v in zip(zip(mpids, mcids), a):
--> 153             aM[p] += v
    154         with np.errstate(divide = 'ignore', invalid = 'ignore'):
    155             aM /= cM

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S36') dtype('S36') dtype('S36')

Tried to set the np.dtype directly but I still get the same error e.g.

for p in peaks:
    uuid4_values = [str(uuid.uuid4()) for _ in range(p.full_shape[0])]
    uuid4_values = np.array(uuid4_values, dtype=np.dtype('S36'))
    p.add_attribute('uuid4_values', uuid4_values, flagged_only=False, attr_dtype=np.dtype('S36'))

align_peaks(peaks, ppm=3)

Blank subtraction fails when samples removed

Hi all,

It looks to me like the blank subtraction fails when all some samples are removedsee below)

Good news it seems it is an easy fix.

Change peak_matrix.py line 245, see #20

logging.warning('empty peaklists [%s] automatically removed' % join([str(self.peaklist_ids[i]) for i in rmsids[0]], ', '))

logging.warning('empty peaklists [%s] automatically removed' % join([str(self.peaklist_ids[i]) for i in rmsids], ', '))

i.e. just msids[0] to msids

The error message:

TypeError                                 Traceback (most recent call last)
<ipython-input-71-4fae4b4baea9> in <module>()
----> 1 blank_filter(pm, "Blank", min_fraction=0.8, min_fold_change=10)

/usr/local/lib/python2.7/dist-packages/dimspy/workflow.pyc in blank_filter(peak_matrix, blank_label, min_fraction, min_fold_change, function, rm_samples, class_labels)
    203         raise IOError("Blank label ({}) does not exist".format(blank_label))
    204 
--> 205     return filter_blank_peaks(peak_matrix, blank_label, min_fraction, min_fold_change, function, rm_samples)
    206 
    207 

/usr/local/lib/python2.7/dist-packages/dimspy/process/peak_filters.pyc in filter_blank_peaks(pm, blank_label, fraction_threshold, fold_threshold, method, rm_blanks)
     99         faild_int = np.sum(pm.intensity_matrix >= ints, axis = 0) < (fraction_threshold * pm.shape[0])
    100         rmids = np.where(np.logical_and(ints > 0, faild_int))
--> 101     pm.remove_peaks(rmids)
    102 
    103     if rm_blanks:

/usr/local/lib/python2.7/dist-packages/dimspy/models/peak_matrix.pyc in remove_peaks(self, ids, remove_empty_samples)
    240                 rmsids = np.where(np.sum(pm.intensity_matrix, axis = 1) == 0)[0]
    241             if len(rmsids) > 0:
--> 242                 logging.warning('empty peaklists [%s] automatically removed' % join([str(self.peaklist_ids[i]) for i in rmsids[0]], ', '))
    243                 self.remove_samples(rmsids, False, False)
    244         if self.is_empty(): logging.warning('matrix is empty after removal')

TypeError: 'numpy.int64' object is not iterable

Add feature to process replicates without filelist

Installation on python3+

Hi, I am trying installing on python3 but still its throwing an error 2.7.8 required, is it still necessary to install it on 2.7?

Fail to install pythonnet on macOS

I failed to install pythonnet (latest version, 2.3.0) on macOS (10.12) using pip. A building error pops up.

Exception: No scan data to process. Check filter_scan_events

I'm experiencing a strange "no scan data to process" error:

Exception: No scan data to process. Check filter_scan_events

keo7@keo7-OptiPlex-790:~/Data/dimspy-denisa/mzmlsplits$ dimspy process-scans -i positive.zip -o positive.s -m "median" -s 0.4
Executing dimspy version 1.1.0.
Namespace(block_size=5000, exclude_scan_events=[], filelist=None, function_noise='median', include_scan_events=[], input=['positive.zip'], min_fraction=0.5, min_scans=1, ncpus=None, output='positive.s', ppm=2.0, remove_mz_range=[], report=None, ringing_threshold=None, rsd_threshold=None, skip_stitching=False, snr_threshold=0.4, step='process-scans')

0001-A002-160824-a.mzML
Reading scans....
Traceback (most recent call last):
  File "/home/keo7/VirtualEnvs/pythonnet-build/bin/dimspy", line 11, in <module>
    sys.exit(main())
  File "/home/keo7/VirtualEnvs/pythonnet-build/local/lib/python2.7/site-packages/dimspy/__main__.py", line 443, in main
    ncpus=args.ncpus)
  File "/home/keo7/VirtualEnvs/pythonnet-build/local/lib/python2.7/site-packages/dimspy/tools.py", line 71, in process_scans
    pls_scans = read_scans(filenames[i], source, function_noise, min_scans, filter_scan_events)
  File "/home/keo7/VirtualEnvs/pythonnet-build/local/lib/python2.7/site-packages/dimspy/process/replicate_processing.py", line 111, in read_scans
    raise Exception("No scan data to process. Check filter_scan_events")
Exception: No scan data to process. Check filter_scan_events

Blank file peaklist being removed even when rm_sample=False

When using the "blank_filter" function the blank file is removed even when the "rm_sample" parameter is "False".


sources = {
    'sample':'/mnt/hgfs/DMA/daphnia_magna/DIMS/data/P_WAX_1_PHE/pos/di-ms/passed/A06_Polar_Daph_WAX1_Phenyl_LCMS_Pos_DIMS.raw',
    'blank':'/mnt/hgfs/DMA/daphnia_magna/DIMS/data/P_WAX_1_PHE/pos/di-ms/passed/A06_Polar_Blank_WAX1_Phenyl_LCMS_Pos_DIMS.raw',
    'dimsn':'/mnt/hgfs/DMA/daphnia_magna/DIMS/data/P_WAX_1_PHE/pos/di-msn/passed/A06_Polar_Daph_WAX1_Phenyl_LCMS_Pos_DIMSn.raw',
}

snr = 3
ppm = 4
minfracB = .5
minfracS = .7
output = '/mnt/hgfs/DATA/dimspy_dma_test'
fracid = 'WAX1_A06'

dimsn = process_scans(source=sources['dimsn'], nscans=0, function_noise="noise_packets",
                      snr_thres=snr, ppm=ppm, min_fraction=minfracS, rsd_thres=None,
                      filelist=None,
                      mzrs_to_remove=[], scan_events=[[50.0, 1000.0, 'full']],
                      block_size=2000, ncpus=1)
dimsn[0].tags.add_tags(class_label='DIMSn')

sample = process_scans(source=sources['sample'], nscans=0, function_noise="noise_packets",
                       snr_thres=3.0, ppm=ppm, min_fraction=minfracS, rsd_thres=None,
                       filelist=None,
                       mzrs_to_remove=[], scan_events=[[50.0, 1000.0, 'full']],
                       block_size=2000, ncpus=1)
sample[0].tags.add_tags(class_label='Sample')

blank = process_scans(source=sources['blank'], nscans=0, function_noise="noise_packets",
                      snr_thres=3.0, ppm=ppm, min_fraction=0.4, rsd_thres=None,
                      filelist=None,
                      mzrs_to_remove=[], scan_events=[[50.0, 1000.0, 'full']],
                      block_size=2000, ncpus=1)
blank[0].tags.add_tags(class_label='Blank')

peaks = [dimsn[0], sample[0], blank[0]]

pm = align_peaks(peaks, ppm=ppm)
pm_bf = blank_filter(pm, "Blank", min_fraction=minfracB, min_fold_change=10.0, function="mean", rm_samples=False)

Run read_scans in parallel.

Sample filter automatically removes an "empty" row/sample.

Linking Replicate filter to Align Samples - error in Replicate Filter metadata output format

When running the dimspy processing pipeline on the Galaxy server, an error arises in running linking the "Replicate Filter" and "Align Samples" tools. This error is believed to be linked to incorrect formatting of the "Sample Metadata" file output from the "Replicate Filter" tool.

Example error log:

Fatal error: Exit code 1 ()
/gpfs/apps/galaxy/viantm-dev/galaxy/tool_dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/experiment.py:138: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Traceback (most recent call last):
  File "/gpfs/apps/galaxy/viantm-dev/galaxy/tool_dependencies/_conda/envs/[email protected]/bin/dimspy", line 11, in <module>
    load_entry_point('dimspy==1.0.0', 'console_scripts', 'dimspy')()
  File "/gpfs/apps/galaxy/viantm-dev/galaxy/tool_dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/__main__.py", line 463, in main
    ncpus=args.ncpus)
  File "/gpfs/apps/galaxy/viantm-dev/galaxy/tool_dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/tools.py", line 262, in align_samples
    fl = check_metadata(filelist)
  File "/gpfs/apps/galaxy/viantm-dev/galaxy/tool_dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/experiment.py", line 141, in check_metadata
    idxs_replicates = idxs_reps_from_filelist(fm["replicate"])
  File "/gpfs/apps/galaxy/viantm-dev/galaxy/tool_dependencies/_conda/envs/[email protected]/lib/python2.7/site-packages/dimspy/experiment.py", line 254, in idxs_reps_from_filelist
    raise ValueError("Incorrect numbering for replicates. Row {}".format(i))
ValueError: Incorrect numbering for replicates. Row 1

Based on the above, it seems that the "replicate" column of the "Sample Metadata" file may be in an inappropriate format for subsequent use with the "Align Samples" tool.

Multiple folders (classes)

Occurrence spelling mistake

I think the column 'occurance' in the peaklist should be spelt 'occurrence'.

Double check function to subset scan events.

Missing root path in check_paths()

I believe the following codes should be improved:

paths.py, ln 23-24
peaklists = hdf5_portal.load_peaklists_from_hdf5(source)
filenames = [pl.ID for pl in peaklists]

Because pl.ID doesn't contain the source path information. The files in filenames could not be found unless they are in the current Python working path. I suggest to revise it as:

filenames = [os.path.join(os.path.abspath(os.path.dirname(source)), pl.ID) for pl in peaklists]

Similar problems should be fixed over the source codes.