Giter Club home page Giter Club logo

galaxytools's People

Contributors

bernt-matthias avatar cat-bro avatar hechth avatar ljocha avatar martenson avatar maximskorik avatar smartx-usman avatar trachtok avatar wverastegui avatar xtracko avatar xtrojak avatar zargham-ahmad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

galaxytools's Issues

Reporting tool for GC annotation

A tool has to be created which takes outputs from RAMClustR (msp & csv) and matchms (scores, number of matches) and creates an output reporting in the mzTab format (https://pubmed.ncbi.nlm.nih.gov/30688441/).

Other tools around mzTab are https://github.com/lifs-tools/jmzTab-m and https://github.com/lifs-tools/pymzTab-m

Another possibility for reporting is via the MS-DIAL output format (example) which contains the intensities of the annotated features + the sample metadata.

Update 09.05.2024:

The tool should take 2 inputs:

  • the Spec abundance table from the RAMClustR tool outputs
  • the scores table from matchms formatter output.

The output should be the specAbundance table where some columns are renamed, depending on the inputs and parameters. The output should be a tabular file

We need a single select which states whether to use single or multiple annotation.

If single is selected, then if there are multiple rows having the same query, choose the reference with the maximum score.

If multiple is allowed, concatenate all reference identifiers separated by , and use the concatenated string as the column header.

advanced xmsannotator issues

Beware that I do not know what are the correct files to test with, so I am testing on surely invalid files. Nevertheless there seems to be an issue with running the advanced version of the annotator.

This is the Galaxy traceback when trying to execute the tool:

Traceback (most recent call last):
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/__init__.py", line 1657, in handle_single_execution
    flush_job=flush_job,
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/__init__.py", line 1745, in execute
    return self.tool_action.execute(self, trans, incoming=incoming, set_output_hid=set_output_hid, history=history, **kwargs)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/actions/__init__.py", line 311, in execute
    history, inp_data, inp_dataset_collections, preserved_tags, all_permissions = self._collect_inputs(tool, trans, incoming, history, current_user_roles, collection_info)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/actions/__init__.py", line 263, in _collect_inputs
    inp_data, all_permissions = self._collect_input_datasets(tool, incoming, trans, history=history, current_user_roles=current_user_roles, collection_info=collection_info)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/actions/__init__.py", line 209, in _collect_input_datasets
    tool.visit_inputs(param_values, visitor)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/__init__.py", line 1532, in visit_inputs
    visit_input_values(self.inputs, values, callback)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/parameters/__init__.py", line 168, in visit_input_values
    visit_input_values(input.inputs, values, callback, new_name_prefix, label_prefix, parent_prefix=name_prefix, **payload)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/parameters/__init__.py", line 170, in visit_input_values
    callback_helper(input, input_values, name_prefix, label_prefix, parent_prefix=parent_prefix, context=context)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/parameters/__init__.py", line 131, in callback_helper
    new_value = callback(**args)
  File "/mnt/volume/shared/galaxy/server/lib/galaxy/tools/actions/__init__.py", line 159, in visitor
    target_dict[conversion_name] = conversion_data.id  # a more robust way to determine JSONable value is desired
AttributeError: 'NoneType' object has no attribute 'id'

These lines are suspicious because they are inconsistent, but they do not seem to be the cause.
https://github.com/RECETOX/galaxytools/blob/master/tools/xmsannotator/xmsannotator_advanced.xml#L33-L36

Removing the <conversion> steps from the tool allows me to execute it. The issue could be with the application of conversion on the optional steps.

Further research with actual valid inputs is required. From what I understand in the code the conversions are not essential, as long as we supply csv as inputs so maybe we can just drop them.

cc @xtracko @hechth

Query escapes SQL too aggressively

When running with trivial SQL query like:

select * from r where rtp > 0

I get:

pandasql.sqldf.PandaSQLException: (sqlite3.OperationalError) near "gt": syntax error
[SQL: select * from r where rtp gt 0]

IMHO this is due to some aggressive escaping, probably at Galaxy side, which replaces "dangerous" characters like >.

It should be either disabled or unescaped in the python script.

retip: better input type check/conversion of retip_apply

retip_apply accepts 'h5' and 'tabular' input types. If it is given an input of 'tabular' type literally, it runs spell.R, otherwise spell_h5.R. This strategy breaks with subtypes of tabular, which revert to h5 not being h5. Technically, the non-h5 version accepts tsv only.

Consequently, the tool must be given either h5, or a tsv explicitely retyped to tabular, otherwise it fails.

I'm not sure whether we can benefit from some automatic type conversion, so I'm leaving the right solution to experts. But it should be made more robust.

wrap XCMS

Almost everything seems to be already done. We need to revise whether the existing Galaxy Tools from wf4metabolomics are covering all our use cases.

bioconductor package: https://www.bioconductor.org/packages/release/bioc/html/xcms.html
bioconda package: https://bioconda.github.io/recipes/bioconductor-xcms/README.html#package-bioconductor-xcms
MTS Galaxy Tools suite from W4Metabolomics: https://toolshed.g2.bx.psu.edu/view/lecorguille/suite_xcms/a520686a9627
Custom wf4metabolomics datatypes: https://toolshed.g2.bx.psu.edu/view/lecorguille/rdata_xcms_datatypes/544f6d2329ac
Galaxy tools development repo: https://github.com/workflow4metabolomics/tools-metabolomics/tree/master/tools

Run Thermo raw conversion in parallel on collections.

The current converter
https://github.com/galaxyproteomics/tools-galaxyp/blob/master/tools/ThermoRawFileParser/thermo_converter.xml puts all input files in a folder and invokes a single instance of the converter binary, which processes the files from the folder sequentially. Not optimal for larger batches.

The ultimate option is running a single job per single file, which might not be optimal, but it's worth trying.

If the overhead is too high, it can be done by a job which requests multiple cores, groups the file in several folders, and runs the binary in multiple instances (one on each folder).

Add wrapper for xMSanalyzer

Provide a wrapper for the xMSanalyzer software package.

  1. xMSwrapper.apLCMS
  • Dummy tool wrapper which calls function with default arguments
  • Implement parameters in galaxy tool
  • Forward arguments to function call
  1. xMSwrapper.XCMS.centWave
  • ???
  1. xMSwrapper.XCMS.matchedFilter
  • ???

fix aplcms

the old version deployment works fine but the new version 6.6.6+galaxy0 using the h5 format throws the following error on the same set of data

cc @xtracko @trachtok

**** feature extraction ****
**** time correction ****
**** performing time correction ****
m/z tolerance level:  1.13783763305855e-05
time tolerance level: 528.630506001489
the template is sample 1
**** feature alignemnt ****
**** performing feature alignment ****
m/z tolerance level:  1.13783763305855e-05
time tolerance level: 528.666449313702
**** weaker signal recovery ****
Warning messages:
1: In rgl.init(initValue, onlyNULL) : RGL: unable to open X11 display
2: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'. 
Error in H5Fopen(file, native = native) : 
  HDF5. File accessibilty. Unable to open file.
Calls: <Anonymous> -> h5write.default -> h5checktypeOrOpenLoc -> H5Fopen
Execution halted

UMSA Galaxy homescreen bug

I think I found a small bug at our Galaxy homescreen. When I click the links in red circles it doesn't redirect me. The same thing in Chrome and Firefox.

home_screen
error
error_firefox

matchms: Implement missing similarity metrics

Description

Implement matchms wrapper which exposes basic functionality of computing match scores between two msp files.

Subtasks

Galaxy

  • Provide container/package to run as Galaxy tool
  • Write wrapper script which handles IO
  • Write tool script with tests
  • Write documentation
  • Implement choosing similarity metric
  • Add option to mark as symmetric

Features

Implement missing similarity metrics which require special handling

After the metadata match is implemented, this could also be used to provide a better implementation of the RI filtering by simply having it via the metadata match and then using a separate tool to combine scores.

Add EDAM identifiers to Galaxy tools

database: Annotation with various logP descriptors

The database entries have to be annotated with various logP implementations (see here) as an additional property. This should be a galaxy tool taking a parquet file with InChI and SMILES and adding these properties to the database.

The same functionality will also be needed for fragment libraries/tables, so the tool should be designed to be extensible for different inputs and outputs.

RAMClustR: enhancements

Description

Add a galaxy tool wrapping RAMClustR.
The tool can be run with XCMS objects or CSV tables as input. These are handled in two different tools for ease of use, but share the same code/container.

Subtasks

General

  • Forward parameters to script from galaxy environment for xcmsSet object use
  • Implement container to run RAMClustR
  • Split tool into two separate parts, one for XCMS object and one for CSV input
  • Switch from container to bioconda package once available (continues in #114)

Interface Galaxy -> Tool

  • Write wrapper script in tool directory which transforms and handles input - ideally an R script so that the Galaxy tool only calls the function in the R script and we have an interface between the RAMClustR libray and the Galaxy tool

Handle batch, qc and order input parameters by taking a metadata CSV similar to XCMS & WaveICA

  • Find proper data type for input - some type of array, text file, numpy array, R array etc
  • Transform to R vectors and pass to tool

Define Experiment Tool

CSV Tool

  • Implement parameter handling/forwarding using script in tool directory

Tests:

  • Basic test for XCMS based tool
  • Basic test for csv based tool once #72 is resolved

Documentation

  • Add upstream/downstream tools
  • Add basic background information
  • Describe outputs

Create tool which extracts CSV spectra from apLCMS output

apLCMS currently outputs an hdf file which contains the original R data. This output should be exported from the hdf to a CSV table with a specific format which fits RAMClustR.

Also see here for initial communication about HDF

Notes

HDF files can be read into pandas data frame and pandas dataframes can be written to csv via

# Read from HDF
pandas.read_hdf("path/to/file.h5", "name_of_the_table_or_internal_hdf_path")

# Write to csv
dataframe = pandas.DataFrame(data=scores.scores, index=reference_names, columns=query_names)
dataframe.to_csv(args.output_filename, sep=';')

Subtasks

  • Make selectable which table from the output to export
  • (Optional) Choose how to format the file (rows/cols etc.)

matchMS fails if msp file contains tabs.

Matching spectral libraries with matchMS fails if the .msp file contains tabstop separated values.

See error message below:

Traceback (most recent call last):
  File "/mnt/volume/shared/galaxy/var/shed_tools/testtoolshed.g2.bx.psu.edu/repos/recetox/matchms/6a736abe431f/matchms/matchms_wrapper.py", line 60, in <module>
    main(argv=sys.argv[1:])
  File "/mnt/volume/shared/galaxy/var/shed_tools/testtoolshed.g2.bx.psu.edu/repos/recetox/matchms/6a736abe431f/matchms/matchms_wrapper.py", line 41, in main
    reference_spectra = [
  File "/mnt/volume/shared/galaxy/var/shed_tools/testtoolshed.g2.bx.psu.edu/repos/recetox/matchms/6a736abe431f/matchms/matchms_wrapper.py", line 41, in <listcomp>
    reference_spectra = [
  File "/mnt/volume/shared/galaxy/var/dependencies/_conda/envs/mulled-v1-9acac30a99c41c49633116b3db32cd657baeb355a349a258bc3d42833c66c0fe/lib/python3.8/site-packages/matchms/importing/load_from_msp.py", line 87, in load_from_msp
    for spectrum in parse_msp_file(filename):
  File "/mnt/volume/shared/galaxy/var/dependencies/_conda/envs/mulled-v1-9acac30a99c41c49633116b3db32cd657baeb355a349a258bc3d42833c66c0fe/lib/python3.8/site-packages/matchms/importing/load_from_msp.py", line 43, in parse_msp_file
    masses.append(float(splitted_line[0]))
ValueError: could not convert string to float: '116.05576\t29277'

automate obtaining pathways from PathBank

Make a tool which downloads pathways from https://pathbank.org/downloads, filters out the pathways which don't include any compound from our database (reduced PubChem) and produce pathway-compound pairing.

Most likely, the PathBank-PubChem pairing of compounds could be done by InChI. As of now, PathBank contains 23 compounds with no assigned InChI. For the time being, my wild guess is to ignore them but please write their PathBank ID verbosely on the standard output to notify the user. In this phase, don't bother with whether the pairing is correct or not.

The resulting output should be CSV file with the following columns: recetox_pathway_id, pathbank_id, recetox_cid. The recetox_cid is the foreign key to our compound database, pathbank_id is the original pathway ID (note they it may be null), and recetox_pathway_id is our own generated ID of the pathway since we might have aggregated pathway database from multiple sources.

apLCMS fails if given larger amounts of data.

apLCMS fails if run with large amounts of data (50 samples) - see error from a run on Galaxy below.

Error in apLCMS::save_peaks_to_hdf("/mnt/volume/shared/ces-nya/nfs4/home/umsa/object_store_cache/021/dataset_21267.dat",  : 
  object 'x' not found
In addition: Warning messages:
1: In rgl.init(initValue, onlyNULL) : RGL: unable to open X11 display
2: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'. 
Execution halted

A different error appears if run on 20 samples - see error from a run on Galaxy below.

**** feature extraction ****
Error in unserialize(node$con) : error reading from connection
Calls: <Anonymous> ... FUN -> recvData -> recvData.SOCKnode -> unserialize
In addition: Warning messages:
1: In rgl.init(initValue, onlyNULL) : RGL: unable to open X11 display
2: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'. 
Execution halted

I also ran 20 samples locally, which results in an error in https://github.com/RECETOX/apLCMS/blob/7760a36d600ed07fb4e072f53aff8286218c1e6d/apLCMS/R/adjust.time.R#L69.
The initial file reading and feature detection though runs without problems for a cluster size of 8 and 20 samples.
It then fails due to some error regarding namespace unavailability - might be something local, but still the program proceeds further than on our Galaxy.

Starting R session...
ℹ Loading apLCMS
Loading required package: MASS
Loading required package: rgl
This build of rgl does not include OpenGL functions.  Use
 rglwidget() to display results, e.g. via options(rgl.printRglwidget = TRUE).
Loading required package: mzR
Loading required package: Rcpp
Loading required package: splines
Loading required package: doParallel
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Loading required package: snow

Attaching package: ‘snow’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

Loading required package: gbm
Loaded gbm 2.1.8
Loading required package: e1071
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Loading required package: ROCR
Loading required package: ROCS
Loading required package: poibin
load("~/filenames_mzml")

x <- apLCMS::unsupervised(filenames_mzml[1:20], cluster=8)
**** feature extraction ****
 unsupervised.R
**** time correction ****
 unsupervised.R
**** performing time correction ****
 adjust.time.R
m/z tolerance level:  8.9213055807511e-06
 adjust.time.R
time tolerance level: 344.054109620073
 adjust.time.R
the template is sample 20
 adjust.time.R
                to.flip <- which(this.comb[sel, 3] == j)
                temp <- all.ftr.table[to.flip, 2]
                all.ftr.table[to.flip, 2] <- all.ftr.table[to.flip,
                  1]
                all.ftr.table[to.flip, 1] <- temp
                cat(c("sample", j, "using", nrow(all.ftr.table),
                  ","))
                if (j%%3 == 0)
                  cat("\n")
                all.ftr.table <- all.ftr.table[order(all.ftr.table[,
                  2]), ]
                this.dev <- all.ftr.table[, 2]
                aver.time <- all.ftr.table[, 1] - this.dev
                this.feature <- this.feature[order(this.feature[,
                  2], this.feature[, 1]), ]
                this.corrected <- this.old <- this.feature[,
                  2]
                to.correct <- this.old[this.old >= min(this.dev) &
                  this.old <= max(this.dev)]
                this.smooth <- ksmooth(this.dev, aver.time, kernel = "normal",
                  bandwidth = (max(this.dev) - min(this.dev))/5,
                  x.points = to.correct)
                this.corrected[this.old >= min(this.dev) & this.old <=
                  max(this.dev)] <- this.smooth$y + to.correct
                this.corrected[this.old < min(this.dev)] <- this.corrected[this.old <
                  min(this.dev)] + mean(this.smooth$y[this.smooth$x ==
                  min(this.smooth$x)])
                this.corrected[this.old > max(this.dev)] <- this.corrected[this.old >
                  max(this.dev)] + mean(this.smooth$y[this.smooth$x ==
                  max(this.smooth$x)])
                this.feature[, 2] <- this.corrected
                this.feature <- this.feature[order(this.feature[,
                  1], this.feature[, 2]), ]
            }
        }
        this.feature
    }
}

Error in frameTypes(env) :
  non-namespace found within namespace environments
        if (is.null(err)) {
            message <- geterrmessage()
        }
        else {
            attributes(err) <- list()
            message <- err
        }
        body <- list(message = message)
        sendWriteToStdinEvent("", when = "browserPrompt", count = 0)
        session$clearStackTree <- TRUE
        sendStoppedEvent("exception", description = "Stopped on Exception",
            text = message)
        browser()
    }
    else {
        logPrint("showing traceback!!!")
        traceback()
        unregisterEntryFrame()
    }
})()

I initially thought that the failure comes because all files are loaded into RAM and then processed, but this is not the case - each core loads a file and then does peak detection, so only the features of all files have to be held in RAM, which is feasible.

Create a tool for batch normalization

Research

  • Select an appropriate algorithm for normalization (options: ComBat, WaveICA, NormAE)
    -> WaveICA (R package)
  • Identify the proper input format

Integration

  • Create a Docker image
  • Create a galaxy wrapper

Further steps

  • Change inputs for the tool once we handle metadata in a workflow
  • Add Galaxy tests
  • Add "Workflow position" section once we have all the up/downstream tools

Change MatchMS CSV output identifier column

The first column - containing the reference compound identifier - currently contains the name, which can contain commas, therefore disturbing the csv structure. The name could instead be appended as last column and the first column could be the chemical formula, which doesn't contain commas.

Zenodo Authors

I had a look at the Zenodo DOI and noticed that I'm listed as hechth - which is somewhat not my real, full name (though sounds very charming!) :D

Would it be possible to change that somehow?

Add tolerance parameter to matchMS

Some similarity metrics implemented in matchms take a tolerance parameter for the peak matching.
It should be possible to set the value for this parameter in our Galaxy tool.

apLCMS to RAMClustR: Tool fails on actual apLCMS output

The tool failst if I try to get the "Aligned Peaks" table or "Peaks" table, see the attached log files.
2020_12_16_apLCMS_RAMClustR_error_log_aligned_peaks.txt
2020_12_16_apLCMS_RAMClustR_error_log_peaks.txt

@martenson I used an input imported from another history - might that cause this problem?

Update 2020/12/16/10:11
The error was related to an invalid history item.
I replaced it with a valid h5 file, but the tool fails without error message.

apLCMS: Improve documentation to give example parameters

The tolerance is given in absolute numbers, not scaled, so an example would be good like (for 10ppm tolerance pass 1e-05).

<param name="align_mz_tol" type="float" optional="true"
label="align_mz_tol (optional)"
help="The m/z tolerance level for peak alignment. The default is NA, which allows the program to search for the tolerance level based on the data. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level." />

Add workflows to merge metadata with samples data

As @martenson pointed out I should've looked for existing tools before developing our own. There is an existing tool to merge metadata and sample data (Toolshed link). And with some changes to a few existing tools (just change separators) it should work just fine in our workflows.

It will throw an error if there are any inconsistencies between data matrix and metadata table though. I don't know how critical it is. Is it possible that some samples will miss metadata?

investigate and fix retip docker image

Currently there are platforms where the Galaxy tool will fail with C stack overflow error including the GitHub actions CI (hence test will never pass here)

Logs available in CI run at #13

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.