phenomecentre / npyc-toolbox Goto Github PK

The nPYc-Toolbox defines objects for representing, and implements functions to manipulate and display, metabolic profiling datasets.

License: MIT License

Python 95.86% HTML 3.80% CSS 0.35%

quality-control metabolomics metabolic-phenotyping

npyc-toolbox's People

Contributors

Stargazers

Watchers

Forkers

carolinesands nsadawi jordan129 tkimhofer lauzikaite ksachikonye gordondavies misch91

npyc-toolbox's Issues

Bounds calculations when plotting histograms by group are wrong

if xlim=() is specified, bins are still calculated on the full range of data, not the subset

Batch correction fails when a batch has no study reference samples

The batch correction will fail when there is a set of samples in a batch without SR (i.e dummy conditioning samples). Instead, it should output a warning and apply only drift correction to valid batches.

Deprecation warning due to numpy.matlib on multivariateUtilities

There seems to be a deprecation warning caused by importing numpy.matlib python 3.9 with latest version of numpy. At the moment there is no error, but we should fix this as soon as possible.

Here is the warning, for reference.
"""
.../PycharmProjects/nPYc-Toolbox/nPYc/multivariate/multivariateUtilities.py:2: PendingDeprecationWarning:

Importing from numpy.matlib is deprecated since 1.19.0. The matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
"""

Refactor Targeted MS StudyDesign jsons to be more user friendly

At the moment the json SOP files for MS targeted assays are awkward to edit by most users. It would be better to implement the possibility of reading these also as CSV files so users could edit the compound names, equations etc on excel, or even merge this information with the compound Calibration report.

This will be something to do after refactoring TargetedDataset though.

'batch correction assessment' raises an error if no LinerityReference sample are present

nPYc.reports.generateReport(data,'batch correction assessment'):

879                         if sum(lrMask) == 0:
880                                 raise ValueError('No %s samples defined with an AssayRole of %s' % (sampleType, assayRole))

Seems a little restrictive.

Improve plotting functions - choice of color and general feature plots

Would be good to adapt the TIC plotting functions to:

1. allow similar plots of individual features
1. select metadata/parameter to color by
1. Implement plotly interactive versions of those.

Improvements to batch correction module - specify correction sample.

The batch correction module should have an argument to specify which specific sample of "Precision Reference" type should be used for LOWESS correction.

MS batch correction assessment report

The Y axis sometimes does not seem to be scaling with the data?

plotLoadingsInteractive() falls over with targeted data

Needs a fallback scatter plot when spectrum / ion map is not appropriate.

Read acquisition parameters from mzML

1. Implement a function to extract detector voltage, Acquisition Time and Data and other important acquisition parameters from .mzML.
Test with mzML files from multiple MS vendors

MS feature selection report axis labelling

Fix axis labels in the correlation to dilution and RSD heatmap (LC-MS feature summary report)

issue with some of the multivariate reports due to changes in seaborn swarmplot

The seaborn swarmplot function now automatically downsamples the number of points displayed to avoid overcrowding the plots.
This can cause some confusion, as the total number of points is arbitrarily downsampled on the fly.

A straightforward solution is to replace the calls to swarmplot() with stripplot().

Split NMR Targeted from LC-QqQ Targeted objects

The features for import and QC of both targeted NMR (Bruker ivdr methods) and LC-QqQ MS assays are using the same general TargetedDataset. However, NMR Targeted methods are conceptually very simple, while LC-QqQ require a set of specific extra attributes. Maintaining both features in a single object is making modification of targeted QC features much harder to debug and improve, so these should be split to different specific TargetedDataset objects (which might or not inherit from an abstract Targeted Dataset object).

TargetedDataset import of Bruker quant data failure

When no element in the 'LOD' column is '-', import crashes on line 1113 in _loadBrukerXMLDataset():

self.featureMetadata.loc[self.featureMetadata['LOD'] == '-', 'LOD'] = numpy.nan

Passing "reference" to PQN normaliser is not implemented

Add 'Unknown' to Enumerations

https://npyc-toolbox.readthedocs.io/en/latest/enumerations.html

Some enumerations do not have some default values when the expected enum option is unknown.

ie, when importing sample metadata - if the 'Sample Type' is blank or not in the expected list (StudySample,StudyPool,ExternalReference,MethodReference,ProceduralBlank), it may be set incorrectly to StudySample - potentially making any downstream analysis imprecise or inaccurate.

Please add another Enum choice:

Unknown = 'Unknown' to all Enums with String Type values

Unknown = 0 to all Enums with Integer Type values

Please make sure in the toolbox code that whenever a blank or NA or null value is encountered, the correct Unknown is assigned.

`test_plotScores_raises` errors occasionally

Maybe an be an issue in pyChemometrics cv with small sample nos?

======================================================================
ERROR: test_plotScores_raises (test_plotting.test_plotting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jtpearce/Dropbox (Personal)/Development/nPYc-toolbox/Tests/test_plotting.py", line 849, in test_plotScores_raises
    pcaModel = nPYc.multivariate.exploratoryAnalysisPCA(dataset)
  File "../nPYc/multivariate/exploratoryAnalysisPCA.py", line 86, in exploratoryAnalysisPCA
    raise exp
  File "../nPYc/multivariate/exploratoryAnalysisPCA.py", line 60, in exploratoryAnalysisPCA
    scree_cv = PCAmodel._screecv_optimize_ncomps(data, total_comps=maxComponents, stopping_condition=minQ2, **kwargs)
  File "/Users/jtpearce/Dropbox (Personal)/Development/pyChemometrics/pyChemometrics/ChemometricsPCA.py", line 572, in _screecv_optimize_ncomps
    currmodel.cross_validation(x, outputdist=False, cv_method=cv_method, press_impute=False)
  File "/Users/jtpearce/Dropbox (Personal)/Development/pyChemometrics/pyChemometrics/ChemometricsPCA.py", line 498, in cross_validation
    cv_loads.append(np.array([x[comp] for x in loadings]))
  File "/Users/jtpearce/Dropbox (Personal)/Development/pyChemometrics/pyChemometrics/ChemometricsPCA.py", line 498, in <listcomp>
    cv_loads.append(np.array([x[comp] for x in loadings]))
IndexError: index 8 is out of bounds for axis 0 with size 8

----------------------------------------------------------------------

Add the ability to concatenate datasets

Either horizontally:

Same samples, new features
Prob only makes sense for VariableType.Discrete datasets

Or vertically:

Same features, new samples

Can build on functionality already in TargetedDataset

Failure reading NMR data from different windows drives

The current release fails reading NMR data from different windows drives.

This seems to be caused by issues with relative paths in _getMetadataFromBruker:
localPath = os.path.normpath(os.path.join(os.path.relpath(path), inputFile))

Citation!

This has now been published: https://academic.oup.com/bioinformatics/article/35/24/5359/5539689

Add to welcome page and documentation

ISATAB export hardcodes optional columns

These columns are optional in sampleMetadata and should not be hard-coded in ISATAB export.

'Study'
'Status'
'Age'
'Gender'
'Sampling Date'
'Acquired Time'
'Assay data name'
'Instrument'
'Sample Batch'

Swap Travis to Gitub Actions

Configure the GitHub Actions for this repository

Add interactive feature plots in MS Datasets

It would be good to have interactive plots showing the intensity of an individual feature accross run order, similar to the static plots in the batch correction assessment.

Separate 'Feature ID' from 'Chemical/Compound' IDs and names

Refactor current featureMetadata fields to add improved "formal" fields for compound annotations.

The goal is to allow more text description of ion and compound ids at the same time. For example, the Feature Name/Unique ID could refer to a specific annotation: 'Histidine [M+H]+'. Then, Chemical/Compound ID and Name Fields could store unique identifiers about the annotated chemical compounds (Compound ID: CheBI/PubChem ID and 'Compound Name': Histidine.

This would require the following tasks:

Refactor the feature unique primary ID 'Feature Name' to 'Feature ID'
Keep 'Feature Name' as a text (or other) descriptor of a feature - ideally unique as well, but not required.
Add a new 'Compound Name' and 'Compound ID' to store chemical names and their identifiers.

Idiosyncratic failure in `test_lineWidth_sf`

Failure occurs in approx 1% of runs:

======================================================================
ERROR: test_lineWidth_sf (test_utilities.test_utilities_linewidth)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jtpearce/Dropbox (Personal)/Development/nPYc-toolbox/Tests/test_utilities.py", line 1016, in test_lineWidth_sf
    calculatedLW = lineWidth(x, self.y, sf, [-5, 5], multiplicity='singlet')
  File "../nPYc/utilities/_lineWidth.py", line 29, in lineWidth
    fit = fitPeak(X, ppm, peakRange, multiplicity, parameters=parameters, maxLW=maxLW, estLW=estLW, shiftTollerance=shiftTollerance)
  File "../nPYc/utilities/_fitPeak.py", line 305, in fitPeak
    fit = peak.fit(spec, pars, x=localPPM)
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/model.py", line 736, in fit
    output.fit(data=data, weights=weights)
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/model.py", line 951, in fit
    _ret = self.minimize(method=self.method)
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/minimizer.py", line 1649, in minimize
    return function(**kwargs)
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/minimizer.py", line 1408, in leastsq
    eval_stderr(par, uvars, result.var_names, params)
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/minimizer.py", line 108, in eval_stderr
    uval = wrap_ueval(*uvars, _obj=obj, _names=_names, _pars=_pars)
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/uncertainties/__init__.py", line 696, in f_with_affine_output
    if arg.derivatives
  File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/uncertainties/__init__.py", line 492, in partial_derivative_of_f
    return (shifted_f_plus - shifted_f_minus)/2/step
ZeroDivisionError: float division by zero

----------------------------------------------------------------------

Friendly "includeFeature/Sample" synthax

Implementing an Include (Sample/Feature) method for Dataset objects which would erase the Excluded Details information, and allow re-inclusion directly based on name or other metadata (same syntax as excludeSample/Featurew)

Add advanced feature filtering for LC-MS (feature fidelity paper)

Duplicated 'Feature Name' corrupts excludeFeatures()

Hi, nPYC team!

I encounter this issue when running LC-MS pipeline:

dataset.excludeFeatures(dataset.featureMetadata[dataset.featureMetadata['Retention Time'] < 0.6]['Feature Name'], on='Feature Name', message='Outside RT limits')

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Which arrises due to the fact that there are duplicated feature names, which are created when XCMS mz and rt values are merged into a string, there are 40 pairs of duplicate names in my dataset.

I'll fix it now by creating a new column 'Feature Name'.

Thanks!

Travis build matrix and deployment to PyPI are not working well together

Each build in the matrix is trying to push to PyPI - only one can succeed.

Need to investigate build statuses to resolve: https://docs.travis-ci.com/user/build-stages/

Ability to export log at end of analysis

Hi NPC,

It would be great to be able to export the SOP object at the end of an analysis so a record of the overwritten method SOP options could be made. This would aid reproducibility and general record keeping of analyses. Essentially a json dump of the updated SOP attributes.

Thanks!

Add unittests and test data for Bruker ivDr version 2

There are no unitests covering parsing of Bruker ivDr BI-QUANT v2.0 files.
We should:

Add BI-QUANT v2.0 files to the current NMR data in unitest data
Replicate the existing unitests to cover the v2.0 files as well. For back compatibility, the current v1.0 tests should be kept.

Bruker BI-QUANT ivDR version 2.0 .xml format

The Bruker ivDr BI-QUANT-PS input functions need to be re-factored for version 2.0

use of 'nonposy' argument deprecated

The 'nonposy' argument in matplotlib axes set_xscale and set_yscale has been deprecated in favour of 'nonpositive'. This causes issues with many of the current plotting functions, for example, _plotTIC or _plotRDS.

Implement a map function

To apply a function to all samples or features

Import of NMR data from compressed archives

The NMR data format produces a lot of small files which are very inconvenient for storage and data transfer. It would be good to add functionality to import a dataset directly from a .zip or other compressed archives.

Tutorial for LC-QqQ

There is a lot of LC-QqQ quality control functionality which is not represented in the tutorials. We also have no example dataset, so it would be good to:

Acquire some LC-QqQ data for this purpose
Add tutorials for LC-QqQ data import and QC.

Probably should be done more intensively after #34

Check reports for dependencies on specific parameters e.g. detector voltage

And fix where necessary

Tutorial Improvements

Tutorial.rst is not as clear as it could be - possibly spilt import from NMR & MS specific parts?

Remove existing file deletion when multivariate report generated

Hi Team,

Let's discuss this, but it would be useful to save multiple versions of the MV analytical report (for example on different sample types) in the same folder. At the moment any previous files are deleted (line 121 in multivariateReport.py) when a new report is generated.

Cheers,
Caroline

Repetitive message when no dilution series are present.

Some of the reports output a text message (not a warning) when there are no linearity reference samples. This is particularly noticeable when running the feature selection report. A warning should be sent instead, which can then be suppressed when repeated.

Revisit isaTab exports for Metabolights

Add functionality to facilitate exporting large datasets to Metabolights DB.

nmrML import

sklearn future warning

The sklearn.decomposition.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.decomposition. Anything that cannot be imported from sklearn.decomposition is now part of the private API.

Refactor `plotFeatureLOQ()` to plot not more than _n_ features per plot

To avoid generating gigantic figures that trip:

ValueError: Image size of 3300x119700 pixels is too large. It must be less than 2^16 in each direction.