phenomecentre / npyc-toolbox Goto Github PK
View Code? Open in Web Editor NEWThe nPYc-Toolbox defines objects for representing, and implements functions to manipulate and display, metabolic profiling datasets.
License: MIT License
The nPYc-Toolbox defines objects for representing, and implements functions to manipulate and display, metabolic profiling datasets.
License: MIT License
if xlim=()
is specified, bins are still calculated on the full range of data, not the subset
The batch correction will fail when there is a set of samples in a batch without SR (i.e dummy conditioning samples). Instead, it should output a warning and apply only drift correction to valid batches.
There seems to be a deprecation warning caused by importing numpy.matlib python 3.9 with latest version of numpy. At the moment there is no error, but we should fix this as soon as possible.
Here is the warning, for reference.
"""
.../PycharmProjects/nPYc-Toolbox/nPYc/multivariate/multivariateUtilities.py:2: PendingDeprecationWarning:
Importing from numpy.matlib is deprecated since 1.19.0. The matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
"""
At the moment the json SOP files for MS targeted assays are awkward to edit by most users. It would be better to implement the possibility of reading these also as CSV files so users could edit the compound names, equations etc on excel, or even merge this information with the compound Calibration report.
This will be something to do after refactoring TargetedDataset though.
nPYc.reports.generateReport(data,'batch correction assessment'):
879 if sum(lrMask) == 0:
880 raise ValueError('No %s samples defined with an AssayRole of %s' % (sampleType, assayRole))
Seems a little restrictive.
Would be good to adapt the TIC plotting functions to:
The batch correction module should have an argument to specify which specific sample of "Precision Reference" type should be used for LOWESS correction.
The Y axis sometimes does not seem to be scaling with the data?
Needs a fallback scatter plot when spectrum / ion map is not appropriate.
Fix axis labels in the correlation to dilution and RSD heatmap (LC-MS feature summary report)
The seaborn swarmplot function now automatically downsamples the number of points displayed to avoid overcrowding the plots.
This can cause some confusion, as the total number of points is arbitrarily downsampled on the fly.
A straightforward solution is to replace the calls to swarmplot() with stripplot().
The features for import and QC of both targeted NMR (Bruker ivdr methods) and LC-QqQ MS assays are using the same general TargetedDataset. However, NMR Targeted methods are conceptually very simple, while LC-QqQ require a set of specific extra attributes. Maintaining both features in a single object is making modification of targeted QC features much harder to debug and improve, so these should be split to different specific TargetedDataset objects (which might or not inherit from an abstract Targeted Dataset object).
When no element in the 'LOD' column is '-', import crashes on line 1113 in _loadBrukerXMLDataset()
:
self.featureMetadata.loc[self.featureMetadata['LOD'] == '-', 'LOD'] = numpy.nan
Passing "reference" to PQN normaliser is not implemented
https://npyc-toolbox.readthedocs.io/en/latest/enumerations.html
Some enumerations do not have some default values when the expected enum option is unknown.
ie, when importing sample metadata - if the 'Sample Type' is blank or not in the expected list (StudySample,StudyPool,ExternalReference,MethodReference,ProceduralBlank), it may be set incorrectly to StudySample - potentially making any downstream analysis imprecise or inaccurate.
Please add another Enum choice:
Unknown = 'Unknown' to all Enums with String Type values
Unknown = 0 to all Enums with Integer Type values
Please make sure in the toolbox code that whenever a blank or NA or null value is encountered, the correct Unknown is assigned.
Maybe an be an issue in pyChemometrics
cv with small sample nos?
======================================================================
ERROR: test_plotScores_raises (test_plotting.test_plotting)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/jtpearce/Dropbox (Personal)/Development/nPYc-toolbox/Tests/test_plotting.py", line 849, in test_plotScores_raises
pcaModel = nPYc.multivariate.exploratoryAnalysisPCA(dataset)
File "../nPYc/multivariate/exploratoryAnalysisPCA.py", line 86, in exploratoryAnalysisPCA
raise exp
File "../nPYc/multivariate/exploratoryAnalysisPCA.py", line 60, in exploratoryAnalysisPCA
scree_cv = PCAmodel._screecv_optimize_ncomps(data, total_comps=maxComponents, stopping_condition=minQ2, **kwargs)
File "/Users/jtpearce/Dropbox (Personal)/Development/pyChemometrics/pyChemometrics/ChemometricsPCA.py", line 572, in _screecv_optimize_ncomps
currmodel.cross_validation(x, outputdist=False, cv_method=cv_method, press_impute=False)
File "/Users/jtpearce/Dropbox (Personal)/Development/pyChemometrics/pyChemometrics/ChemometricsPCA.py", line 498, in cross_validation
cv_loads.append(np.array([x[comp] for x in loadings]))
File "/Users/jtpearce/Dropbox (Personal)/Development/pyChemometrics/pyChemometrics/ChemometricsPCA.py", line 498, in <listcomp>
cv_loads.append(np.array([x[comp] for x in loadings]))
IndexError: index 8 is out of bounds for axis 0 with size 8
----------------------------------------------------------------------
Either horizontally:
VariableType.Discrete
datasetsOr vertically:
Can build on functionality already in TargetedDataset
The current release fails reading NMR data from different windows drives.
This seems to be caused by issues with relative paths in _getMetadataFromBruker:
localPath = os.path.normpath(os.path.join(os.path.relpath(path), inputFile))
This has now been published: https://academic.oup.com/bioinformatics/article/35/24/5359/5539689
Add to welcome page and documentation
These columns are optional in sampleMetadata and should not be hard-coded in ISATAB export.
Configure the GitHub Actions for this repository
It would be good to have interactive plots showing the intensity of an individual feature accross run order, similar to the static plots in the batch correction assessment.
Refactor current featureMetadata fields to add improved "formal" fields for compound annotations.
The goal is to allow more text description of ion and compound ids at the same time. For example, the Feature Name/Unique ID could refer to a specific annotation: 'Histidine [M+H]+'. Then, Chemical/Compound ID and Name Fields could store unique identifiers about the annotated chemical compounds (Compound ID: CheBI/PubChem ID and 'Compound Name': Histidine.
This would require the following tasks:
Failure occurs in approx 1% of runs:
======================================================================
ERROR: test_lineWidth_sf (test_utilities.test_utilities_linewidth)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/jtpearce/Dropbox (Personal)/Development/nPYc-toolbox/Tests/test_utilities.py", line 1016, in test_lineWidth_sf
calculatedLW = lineWidth(x, self.y, sf, [-5, 5], multiplicity='singlet')
File "../nPYc/utilities/_lineWidth.py", line 29, in lineWidth
fit = fitPeak(X, ppm, peakRange, multiplicity, parameters=parameters, maxLW=maxLW, estLW=estLW, shiftTollerance=shiftTollerance)
File "../nPYc/utilities/_fitPeak.py", line 305, in fitPeak
fit = peak.fit(spec, pars, x=localPPM)
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/model.py", line 736, in fit
output.fit(data=data, weights=weights)
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/model.py", line 951, in fit
_ret = self.minimize(method=self.method)
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/minimizer.py", line 1649, in minimize
return function(**kwargs)
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/minimizer.py", line 1408, in leastsq
eval_stderr(par, uvars, result.var_names, params)
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/minimizer.py", line 108, in eval_stderr
uval = wrap_ueval(*uvars, _obj=obj, _names=_names, _pars=_pars)
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/uncertainties/__init__.py", line 696, in f_with_affine_output
if arg.derivatives
File "/Users/jtpearce/anaconda/lib/python3.6/site-packages/lmfit/uncertainties/__init__.py", line 492, in partial_derivative_of_f
return (shifted_f_plus - shifted_f_minus)/2/step
ZeroDivisionError: float division by zero
----------------------------------------------------------------------
Implementing an Include (Sample/Feature) method for Dataset objects which would erase the Excluded Details information, and allow re-inclusion directly based on name or other metadata (same syntax as excludeSample/Featurew)
Hi, nPYC team!
I encounter this issue when running LC-MS pipeline:
dataset.excludeFeatures(dataset.featureMetadata[dataset.featureMetadata['Retention Time'] < 0.6]['Feature Name'], on='Feature Name', message='Outside RT limits')
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Which arrises due to the fact that there are duplicated feature names, which are created when XCMS mz and rt values are merged into a string, there are 40 pairs of duplicate names in my dataset.
I'll fix it now by creating a new column 'Feature Name'.
Thanks!
Each build in the matrix is trying to push to PyPI - only one can succeed.
Need to investigate build statuses to resolve: https://docs.travis-ci.com/user/build-stages/
Hi NPC,
It would be great to be able to export the SOP object at the end of an analysis so a record of the overwritten method SOP options could be made. This would aid reproducibility and general record keeping of analyses. Essentially a json dump of the updated SOP attributes.
Thanks!
There are no unitests covering parsing of Bruker ivDr BI-QUANT v2.0 files.
We should:
The Bruker ivDr BI-QUANT-PS input functions need to be re-factored for version 2.0
The 'nonposy' argument in matplotlib axes set_xscale and set_yscale has been deprecated in favour of 'nonpositive'. This causes issues with many of the current plotting functions, for example, _plotTIC or _plotRDS.
To apply a function to all samples or features
The NMR data format produces a lot of small files which are very inconvenient for storage and data transfer. It would be good to add functionality to import a dataset directly from a .zip or other compressed archives.
There is a lot of LC-QqQ quality control functionality which is not represented in the tutorials. We also have no example dataset, so it would be good to:
Probably should be done more intensively after #34
And fix where necessary
Tutorial.rst
is not as clear as it could be - possibly spilt import from NMR & MS specific parts?
Hi Team,
Let's discuss this, but it would be useful to save multiple versions of the MV analytical report (for example on different sample types) in the same folder. At the moment any previous files are deleted (line 121 in multivariateReport.py) when a new report is generated.
Cheers,
Caroline
Some of the reports output a text message (not a warning) when there are no linearity reference samples. This is particularly noticeable when running the feature selection report. A warning should be sent instead, which can then be suppressed when repeated.
Add functionality to facilitate exporting large datasets to Metabolights DB.
The sklearn.decomposition.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.decomposition. Anything that cannot be imported from sklearn.decomposition is now part of the private API.
To avoid generating gigantic figures that trip:
ValueError: Image size of 3300x119700 pixels is too large. It must be less than 2^16 in each direction.
To improve performance
When v2-M is released
Add parsers for Skyline's data format in the LC-QqQ targeted datasets.
My data table was made using XCMS's "peakTable" function. Importing this data using nPYc.MSDataset excludes the first 5 samples in my case due to row 495 def _loadXCMSDataset(self, path, noFeatureParams=14):
This number varies depending on the number of classes in the data and if peakTable or diffreport is used. This might need to be specified in order to reduce import issues and inadvertent data exclusion.
Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.
When repeating the 'feature summary' report following sample exclusions with updateMasks or manually in an NMRDataset the second time the final summary table contains wrong information.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.