nickmckay / lipd-utilities Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 9.0 61.57 MB

Input/output and manipulation utilities for LiPD files in Matlab, R and Python

Home Page: http://nickmckay.github.io/LiPD-utilities/

License: GNU General Public License v2.0

MATLAB 54.20% Makefile 0.04% Python 36.57% Jupyter Notebook 0.36% R 8.78% Batchfile 0.05%

lipd-utilities's People

Contributors

Stargazers

Watchers

Forkers

python3pkg gavinsimpson anhnguyendepocen andrewdolman azizalfort nau-oss naz2020 mat1506

lipd-utilities's Issues

lipd.readLipd.py needs improvement -- here's some simple things!

@ericsteig writes:

The default for lipd.readLipdshould be to expect a filename, not open a GUI.

*The path should by default assume './'.

Details:

The default for lipd.readLipd is to open a GUI (which I would never want). Also, the GUI doesn't open on my machine. I don't care since I'll never use it!

I have a lipd file called GISP2.lpd.
If I'm already in the right directory, I should be able to read it with:

D = lipd.readLipd('GISP2.lpd')

but that doesn't work. I have to say:

D = lipd.readLipd('./GISP2.lpd')

which just is silly.

R: warn user about writing ageEnsemble data in paleoData

writeLipd() should warn people about writing ageEnsembles that have been mapped into paleoData. This is a common procedure in GeoChronR, and thus will come up, however it can greatly increase the size of the LiPD file, and is easily and quickly replicated upon loading with geoChronR::mapAgeEnsembleToPaleoData() .

Perhaps a warning, and then a yes/no about deleting the ageEnsemble from the paleoData?

Matlab: global qc sheet

we need a global qc sheet function in matlab, and the create and update functions should use the same translator

NOAA Updates

List of issues to address

1. Study name on the first line is missing. This name should be the same as Study_Name in the Title section (see #5).
Action : Python script has a function called __generate_study_name() that finds or creates a study name. If study name doesn't exist, it attempts to use <geo_siteName>.<pub0_pubYear>.<pub0_author>. If those keys don't exist, creating the study name fails. Need test files to see if they meet the requirements for creating the study name and/or why it may be failing in this function.

2. Online_Resource has the last part of URL repeated twice.
Action: Find why it's adding the URL twice.

3. The Online_Resource for LiPD files should be https://www1.ncdc.noaa.gov/pub/data/paleo/reconstructions/climate12k/temperature/version1.0.0/Temp12k_v1.0.0.LiPD. LiPD files will be in their own directory, separate from the NOAA Templates.
Action : Correct the online resource link template

4. Need Contribution_Date. This can be the same as the Modified_Date.
Action : Make the contribution date a timestamp of when the file is created.

5. Need Study_Name. If it is possible to programmatically generate a study name, it should generally follow: Where, When, What. We need to create this programmatically, maybe (geo_siteName + paleoData_minYear - paleoData_maxYear + pub1_title)? Might be weird sometimes, but something like that.
Action: This already exists. Refer to Issue # 1. Might be a bug or files are missing the necessary data.

6. Investigators are sometimes missing, and other times not consistently formatted (eg, missing first initial). Maybe just always pull this from pub1_authors?
Action: This already partially exists._ When investigators is empty, it creates the investigators field using the FIRST publication available with author data. Generally this is pub0. When the author entry is a list of authors, it will create the investigator string as "LastName; LastName;..." However, if the author data is a single string of multiple author names, it gets trickier. I'm not positive this case is working. Since sometimes investigators is missing completely, there may be a bug in this function.

7. Investigators should be split with semicolons instead of commas.
Action: The function mentioned in issue # 6 does this when generating investigators. However, this does not cover existing investigator data. I'll make a function to check existing data and format it as necessary.

8. Descriptions are random (eg, “Ian Walker (he could not send the data)” or “cannot validate elevation”). What do you think about a boilerplate description related to Temperature 12k here instead? WDS-Paleo could draft the description.
Action: Nick is handling this.

9. Some publications are missing. This should be fixed.
Action: Bug. Find out why.

10. Site_Names are missing.
Action: Check the mapping. Data may be getting lost.

11. Location is missing. The NASA GCMD location keywords (provided in Table S1) go in this field.
Action: Nick is handling this.

12. Many files are missing variable “what” terms. The shortname could be used for the “what.”
Action: Map the paleoData_variableName to "what"

13. Variables seasonality is missing.
Action: Possible mapping issue? Nick - "This should come from interpretation1_seasonality"

14. Variables C or N designation is mostly missing
Action : Autofill this based on a sample of the table column data.

15. Column headings in data table should be tab delimited (not space delimited).
Action: Fixed. Removed fixed 'spaces' spacing.

16. Shortnames listed in Variables section do not always match data column headings. This seems like it is usually caused by repeated shortnames (eg, d18O in "893A.Kennet.2007-1.txt")
Action: Need Lipd file to recreate the issue. Will investigate.

17. Data tables should not have # at the start of their lines.
Action: This is an ongoing design change that has switched. Formerly, it was requested to have #, then no #, then # again. Can remove.

18. Many variables that are uncertainties are either missing units or have units designated as “unitless” when they are not unitless (eg, file “Wonderkrater.Scott.2016-2.txt”)
Action: Nick is handling this. Data problem.

R: collapseTs : Build dataset structure based from TS data

Remove get_table
Remove get_crumbs
Build new dataset structure based on the paleoNumber, modelNumber, tableNumber, etc, and do not rely on original raw data. This is in case the user changes the amount of tables, switches a table's type, or other things that would alter the collapsed structure from the original structure.

R : Package won't install or load properly

The package wouldn’t install/load with the most recent, I had to remove the empty addTable() function, then it seems to work.

Nick

filterTs() limitation

Function doesn't appear to work with the OR symbol.

EX) filterTs(TS, 'interpretation1_variable == M | interpretation1_variable == M')
returns list()

Python: Coordinate value imported as NumPy data type

ODP664.Raymo.1997.zip

The values in coordinates are being swapped to NumPy data types. Possibly because this one is more precise than usual. NumPy data types have only been in inferred_data.py, but somehow this got converted in readLipd()

Python: Library separate from application?

Is it possible for a general developer-oriented LiPD library to be packaged separately from end-user-oriented LiPD utilities and applications?

I may have use for LiPD within a much larger analysis system. It would be nice to have a library or framework to write project-specific applications that read/write/parse lipid files as a standardized object.

collapseTs omitting some data

Issue submitted by Jessica via e-mail.

If I take a folder containing 2 lipd files, extract the time series (ts_list), and collapse the time series, then the new lipd files (two of them again) have lost some of the header information. This happens even when I don't add a new time series to ts_list. (Side note: when I do add a new time series to ts_list it does show up in the new lipd file after I use collapseTs.) Thus, it looks to me like collapseTs does not perfectly reverse the process of extractTs because header information is lost in one or both of these transformations. If this is intentional, then I'll need to add header information back into the new lipd files by grabbing it from the old lipd files. Let me know if this doesn't make sense and I'll use code to show you what I mean.

Tested it out and I was able to reproduce it as shown below. The following keys do not get collapsed properly: studyName, proxy, investigator, description. This may be true for other keys, but this is all that showed in this test.

Python: Excel not writing one table's CSV file

I'm having a problem with this file. Maybe because the second table has so many columns but the CSV is not writing even after I remover the units
khider

KT05-7_PC02.Kawahata.2009.xlsx
KT05-7_PC02.Kawahata.2009.zip

readLipd() in R doesn't handle lat/long coordinates appropriately if there is more than 1 point

Python: MeasurementTable Number doesn't match between Excel and LiPD file

Although the LiPD file returns the right number of tables, the number isn't matching the one in the Excel file.

Example attached

MD97-2121.Marr.2013.xlsx

LiPD file here: http://wiki.linked.earth/MD97-2121.Marr.2013 (sorry GitHub doesn't support .lpd format)

TS objects for ChronDataTables

Not sure if there's more details to this or if it's straightforward. @khider feel free to chime in if needed.

R: PAGES2k LiPD file loading issue

After pulling LiPD files off the linked earth wiki and trying to load using lipdR -

Do you want to load a single file (s) or directory (d)? s
[1] "reading: Arc-Agassiz.Vinther.2008.lpd"
[1] "Error: import_model: Error in idx_col_by_name(table): there should be a columns variable in here\n"

Python: Missing variable and csv file in output

This one doesn't encode the last variable of the paleoTable and doesn't produce a csv for chron
-Deborah

M35003-4.Ruehleman.1999.xlsx
M35003-4.Ruehleman.1999.zip

BadZipFile

Here is an issue submitted by my postdoc Michael Erb:

I'm having a problem opening a lipd file in python. I installed lipd in anaconda and downloaded this file: http://wiki.linked.earth/GeoB12610-2.Rippert.2015.

In python 3, I imported lipd and tried to use the lipd.readLipd(path) command, but I'm getting an error:

reading: GeoB12610-2.Rippert.2015.lpd
Traceback (most recent call last):
File "", line 1, in
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/site-packages/lipd/init.py", line 49, in readLipd
__read_file(usr_path, ".lpd")
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/site-packages/lipd/init.py", line 680, in __read_file
__universal_load(usr_path, file_type)
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/site-packages/lipd/init.py", line 640, in __universal_load
lipd_lib.read_lipd(file_meta)
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/site-packages/lipd/pkg_resources/lipds/LiPD_Library.py", line 231, in read_lipd
lipd_obj.read()
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/site-packages/lipd/pkg_resources/lipds/LiPD.py", line 56, in read
unzipper(self.name_ext, self.dir_tmp)
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/site-packages/lipd/pkg_resources/helpers/zips.py", line 37, in unzipper
with zipfile.ZipFile(name_ext) as f:
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/zipfile.py", line 1026, in init
self._RealGetContents()
File "/home/geovault-02/erbm/programs/anaconda2/envs/py35/lib/python3.5/zipfile.py", line 1093, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Any ideas? I tried a different lipd file and got the same result.

Python version: No module named 'download_lipd'

If I install the latest version (0.2.5.4) with pip install lipd, then it would be unable to import lipd. Error information below:

      6 import os
----> 7 import lipd as lpd
      8 import pandas as pd
      9 import numpy as np

~/.pyenv/versions/anaconda3-5.0.1/envs/py3.6/lib/python3.6/site-packages/LiPD-0.2.5.4-py3.6.egg/lipd/__init__.py in <module>()
     14 from lipd.regexes import re_url
     15 from lipd.fetch_doi import update_dois
---> 16 from download_lipd import download_from_url, get_download_path
     17 
     18 # Load stock modules

ModuleNotFoundError: No module named 'download_lipd'

I tried an older version (the commit at 2018-03-02 17:00), and it doesn't have this issue.

Can't have python files in the R/R folder

Convention (i.e. R CMD check) is that only R code should be in the R folder of a package. Currently there is a ~~bam.py~~ bagit.py file that is required by the package. I suggest this is moved to ~~R-PKG-ROOT/exec/bam.py~~ R-PKG-ROOT/exec/bagit.py and the R code calling it adjusted so that it knows about the new location.

LiPD utilities in Python not compatible with newer version of numpy

import lipd as lpd
//anaconda/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Traceback (most recent call last):
File "", line 1, in
File "//anaconda/lib/python3.5/site-packages/lipd/init.py", line 13, in
from lipd.json_viewer import viewLipd
File "//anaconda/lib/python3.5/site-packages/lipd/json_viewer.py", line 16, in
from PyQt5 import QtCore
ImportError: dlopen(//anaconda/lib/python3.5/site-packages/PyQt5/QtCore.so, 2): Symbol not found: _PySlice_AdjustIndices
Referenced from: //anaconda/lib/python3.5/site-packages/PyQt5/QtCore.so
Expected in: flat namespace
in //anaconda/lib/python3.5/site-packages/PyQt5/QtCore.so

Python: Incorrect inferred data calculations

"For some reason the min/max... of the resolution has NaN and it breaks the JSON and the LiPD uploader"
MD97-2141.Rosenthal.2003.xlsx
MD97-2141.Rosenthal.2003.zip

R: failure to install lipdR in R v 3.6.2

When attempting to install lipdR from Github in R v 3.6.2 (MacBook): devtools::install_github("nickmckay/LiPD-Utilities", subdir = "R")

I get the following error:
Error: Failed to install 'lipdR' from GitHub:
(converted from warning) package ‘Smisc’ is not available (for R version 3.6.2)

Tsid warnings at the validation step from the LiPD utilities

Tsids error when using lipd.validate() but Tsids are being generated nonetheless
ODP659.Tiedemann.1994.zip
ODP659.Tiedemann.1994.xlsx

R: collapseTs, collapsing data added into the time series.

Previously had an issue where data added into the time series was not being collapsed. That issue has been fixed, but that caused another issues. There is now only one column being collapsed "A" and none of the other columns exist.

R: extractTs/collapseTs bug. Losing data.

Nick -

bug report for lipdR:
extractTs(L,whichtables = “meas”,mode = “chron”)

in chron mode loses a whole bunch of data,
the geo, and maybe the paleoData too, and so the original cant be reconstructed
Dataset : hjort.Schmidt.2011.lpd

Matlab: More documentation

Matlab is lacking in the documentation department. Update the documentation website (the one connected to the repository) and the documentation within the package.

R: Geo linestring not supported

The excel template supports entering 4 unique coordinate values. N lat, S lat, W lon, E lon. Generally, only 2 coordinates have been used in most datasets so far, but now 4 unique coordinates are starting to appear. Python supports the creation of LiPD files with a linestring type, but R does not support reading those LiPD files properly.

The Error:

One longitude value occupies L$geo$geometry$coordinates$longitude, but then the same longitude value mistakenly overwrites the L$geo$geometry$coordinates$latitude value. The other 3 values are dropped.

PupukePiatrunia.2016_2.xlsx
Pupuke.Piatrunia.2016.zip

python: special characters in meas tables crash readLipd()

Per Nick:
Ignore special characters and continue loading whatever possible.

Example files with special characters in chronological meas tables:
PogoniaBog.Cushing.1979.lpd
GeoB10042_1.lpd
Gould.Jacobson.1992.lpd

python: readLipd() fails if there is no pub section

Python: possible to use nested dictionary keys in filterTS() criteria?

Hi,
I am using the LiPD utilities for Python and I would like to filter the a data set by the temporal resolution of the records. But the 'paleoData_hasResolution' is a dictionary in itself, and using the filterTs() command like this:

highres=lipd.filterTs(alldata,"paleoData_hasResolution['hasMedianValue']<5")

returns "Invalid input expression". Is it possible to use nested dictionary keys (not sure if that's the right term) in the filterTs() command?
I can probably find a way around that problem, but it would be so handy to use the filterTs for this.

Thank you!
Marlene

Allow URLs to LiPD files for readLipd

If a user provides a link to a LiPD file hosted online via LinkedEarth Wiki or other, then the utilities should be able to download the file in the background and read it into memory.

Python API: CSV name doesn't match filename

This error is coming back from lipd.validate(D) on the first try, directly after converting an excel file to LiPD file in python. All subsequent readLipd then validate come back passing

Why?

Because the excel template is using 1-indexed naming for it's data sheets, while the Utilities, and all other code, uses 0-indexed naming.

Example:

Excel Sheets:
paleo1measurementTable1
chron1measurementTable1

These sheets generate the filenames:
NWG-SL.Lasher.2017.chron1measurement1.csv for the table chron0measurement0
NWG-SL.Lasher.2017.paleo1measurement1.csv for the table paleo0measurement0

Why is it only happening when trying to validate directly after converting an excel file?

Because filenames and table names are not permanent. They are rewritten (to adhere to standard naming) every time you use writeLipd. excel() has a few major steps

convert the excel to and write the LiPD file
readLipd file into memory (the one created in step 1)
writeLipd the LiPD data in memory back the disk (to save all the inferred data and file standardization corrections)
The csv filenames in memory from step 2 are mismatched, and this is what goes to the validator. The filenames saved to file, in step 3, are corrected for next time.

R : collapseTs and calibration data

Calibration data doesn't process through collapseTs because it isn't indexed like interpretation. (ie. "interpretation1_seasonality" vs "calibration_uncertainty")

Should I make some rules to handle calibration as-is (unindexed) or should calibration data be indexed?

@nickmckay

Python: No resolution calculated

For some reasons the resolution didn't get calculated on this file

Deborah

ODP658.Sarnthein.1989.xlsx
ODP658.Sarnthein.1989.zip

Bulk DOI updater from memory

The DOI updater currently reads from a directory and updates LiPD files directly on disk, including overwriting. Switch this to work on LiPD files in memory, and store the results in memory instead.

Python2.7 lite

For backward compatibility with code written in 2.7, a lite version of the utilities that only allows to load the LiPD files into the workspace (loadLipds()) would be useful.

R: values not read in properly

Having an issue with the R utilities. This file appears mostly valid, but when it gets read into R, the chronTable (which is mostly, maybe entirely, NAs) doesn’t make it into the values in the list in R, which causes problems later.

It should just populate the same number of NAs into the values field

Nick

python: allow underscores in filenames

Currently, the python utilities remove the underscores when saving. Can we allow them to remain in the files?

R : writeLipd not working

"Error: writeLipd: Error in basename(entry): object 'entry' not found\n"
Looks, like I’m able to write them one by one, just not a list of them. But I’m getting this warning:

Error appears while using OSX:
“Warning: OS - Windows. Unable to use bagit module on LiPD data. Skipping...”

Nick

Variable "distance_from_top"

The variable "distance_from_top" or "distance" should be interpreted as "depth" in LIPD files.

Pypi readme publish error

Upload failed (400): The description failed to render in the default format of reStructuredText.

Nothing has changed, but for some reason the pypi package publishing for LiPD has stopped working. I removed the readme file for now until I find more info.

Python : Min not calculated

The min value of 0 didn't get calculated in this file. Might be a zero problem with python.
@khider

M77-2_056-5.Nurnberg.2015.xlsx
M77-2_056-5.Nurnberg.2015.zip

R: the update from 1.2 to 1.3 does not create a scope in the interpretation field

It needs to pull the prefix from the interpretation and assign it to scope
This file can be used for testing.

lonespruce..2012.lpd.zip

Excel : Single chron tables are named "chron" instead of "chron1"

This is causing problems programmatically of tracking indices. Each chron and paleo object needs to have an index.

R: No automatic pcaMethod install

The pcaMethod package cannot be installed automatically during the geoChronR installation. This has to be done separately before installing geoChronR for unknown reasons.

There needs to be documentation stating that this is a known bug so users know how to work around it.

Code:
source(“https://bioconductor.org/biocLite.R“)
biocLite(“pcaMethods”)

R : collapseTs not working with interpretations

The collapseTs() function isn’t handling interpretations properly, when I’m trying to add new ones and then collapse it back

Nick

python: age ensemble missing one member

When I open up a LiPD file in python using "lipd.readLipd", the first member of the age ensemble is missing. This can be seen when comparing the size of the age ensemble with the length of the "number" field. For example, there may be 1000 values in the "number" field, but only 999 members of the age ensemble. When opening up the file in Matlab, however, all members of the age ensemble are loaded. Unzipping the file in Windows also shows all members.

The error just comes if there are more than 1 paleo objects
Just extractTs, and then collapseTs to recreate the error seems unrelated to the model type

Debugging with "NamTreeRing021318.RData" file.

It's possible there is some issue with the paleoData objects loop, though it should theoretically be able to handle multiple paleoData objects.