ropensci / ediutils Goto Github PK
View Code? Open in Web Editor NEWAn API Client for the Environmental Data Initiative Repository
Home Page: https://docs.ropensci.org/EDIutils/
License: Other
An API Client for the Environmental Data Initiative Repository
Home Page: https://docs.ropensci.org/EDIutils/
License: Other
Hey!
I need a hand here.
I have a list with the latest package IDs of multiple datasets. (accessed 2023-09-25).
Instead of accessing them one by one, I want to run all of them at once:
# This is a list with all the package IDs I am trying to access (n=81)
package_ids <- data_list$packageid #chr
# Download all of the data entities for the package IDs in the list
for(i in 1:length(package_ids)){
data_entity_names <- read_data_entity_names(packageId = package_ids[i])
}
and returns:
''Error in read_data_entity_names(packageId = package_ids[i]) :
Not Found (HTTP 404). Failed to .''
Has anyone encountered a similar problem before? Do you have a better way/command to go around?
Thanks,
Paschalis
EAL uses this function. But R is not very good at tracing the problem for you, so scripts have to do it. Here is the message I get if EAL (EDI utils) cannot find a path:
Error in EDIutils::validate_path(fun.args$path) :
The directory specified by the argument "path" does not exist! Please enter the correct path for your dataset working directory.
My script has several paths in it. the validate_path() should PRINT OUT the offending path to make it easier to track down.
Hi @clnsmth - I believe that since, @servilla changed the LDAP, that LTER
as an option for the affiliation
argument is no longer valid and should be removed. Below is a relevant snippet from api_evaluate_data_package
but this may be relevant in other places within the package as well.
#' @param affiliation
#' (character) Affiliation corresponding with the user.id argument supplied
#' above. Can be: 'LTER' or 'EDI'.
Add functionality that will accommodate less-than-ideal header formats in csv (i.e. header on line 1, data starts line 5)
Issue a warning when file naming best practices are not followed. Consider implementing in validate_file_names()
.
@clnsmth this might already exist but I just didn't find it. I think most common use case is something like this wherein someone just wants to download ALL the tabular data into either a list (or possibly stacked should be an option) and have them come out with sensible names.
library (EDIutils)
library (magrittr)
library (data.table)
#assume you used the awesome workflow already to get the package you want
pkgid='knb-lter-nwt.210.1'
#get the ids
entities_id=EDIutils::api_list_data_entities(pkgid)
#sort the urls
entities_id$url=paste0('https://pasta.lternet.edu/package/data/eml/',
gsub('\\.', '/', pkgid), '/', entities_id$identifier)
#also want the names
entities_name=sapply(entities_id$identifier, function(x)
api_read_data_entity_name('knb-lter-nwt.210.1', x,
environment = 'production'))%>%
#spaces = hassle to work with, get rid of them
gsub(' ', '_', .)
#download all
alldata=list()
#should add some EML scraping here to make it read_delim based on the delimiter
#headers TRUE/FALSE, etc
for (k in 1:length(entities_id)){
alldata[[entities_name[k]]]=read.csv(entities_id$url[k])
}
#probably optional whether you want to bindrows them into a single df or not
#this is an example where it makes sense to but in other instances you might not
finally=alldata%>%rbindlist(., idcol=TRUE, use.names=TRUE)
Does this exist and I am just not finding it? Or would it be worth wrapping the above into a function wtih the package?
Vignettes covering common use cases would be helpful. A few examples:
Few weeks ago, EDIutils CRAN checks began failing as a result of issues in the vcr package. After the vcr v1.2.0 upgrade, EDIutils checks returned to a passing state (mostly). Two OS configurations are failing, namely r-release-windows-x86_64 and r-oldrel-windows-ix86+x86_64.
Investigate the source of these issues and fix.
Add these to the "Search and Access" vignette.
A great recommendation from @laijasmine and @bozaah to help get users working with EML and XML.
Use AppVeyor for Windows CI.
I am testing the EML Assembly Line which relies on the detect_delimeter.R
script with .csv files. This CSV file is delimited by ";" (which is a norm using read.csv2()
and write.csv2()
).
The error message itself is :
I'm having trouble identifying the field delimeter of SupplementaryTable4.csv. Enter the field delimeter of this file. Valid options are: , \t ; |
Thus indicating that .csv files can have separators in , \t ; |
Hey EDIutils
team! I had a conversation with Colin Smith and Greg Maurer recently about creating a make_query
function to help make Solr queries for people with some R literacy but limited prior exposure to Solr. The hope is that this new function would make it easier for R users to make good use of EDIutils::search_data_packages
.
I've taken a stab at this function and will attach the full code to this issue. Note that I also wrote two helper functions solr_wild
and solrize
to make the internal components of make_query
as streamlined as possible. I'm definitely a novice to Solr queries so make_query
may be missing crucial arguments but I think it's a reasonable starting point and is built to be semi-modular and could easily support additional arguments. All functions are written in base R (version 4.3.2).
Let me know if this doesn't work on your end and/or if you'd like me to make any changes before it could possibly be built into EDIutils
. Thanks!
# Load needed libaries
library(EDIutils)
# Clear environment
rm(list = ls())
# Define helper function
## Swaps human equivalents of wildcards for Solr wildcard
solr_wild <- function(bit){
# Handle empty `bit`
if(is.null(bit) == TRUE){
# Replace with wildcard
bit_v2 <- "*"
}
# Handle English equivalents for wildcard
else if(length(bit) == 1){
# Replace allowed keywords with wildcard
bit_v2 <- gsub(pattern = "all|any", replacement = "*", x = bit)
}
# If neither condition is met, return whatever was originally supplied
else { bit_v2 <- bit }
# Return finished product
return(bit_v2) }
# Example(s)
solr_wild(bit = NULL)
solr_wild(bit = "any")
solr_wild(bit = "something else")
# Define helper function
## Parses English text into Solr syntax (i.e., right delimiters, etc.)
solrize <- function(bit){
# Replace spaces with hyphens
bit_v2 <- gsub(pattern = " ", replacement = "-", x = bit)
# If more than one value, handle that
if(length(bit_v2) > 1){
# Collapse with plus signs
bit_v3 <- paste0("(", paste0(bit_v2, collapse = "+"), ")")
} else { bit_v3 <- bit_v2 }
# Return finished bit
return(bit_v3) }
# Example(s)
solrize(bit = c("primary production", "plants"))
# Define function to generate query
make_query <- function(keywords = NULL, subjects = NULL, authors = NULL,
scopes = NULL, excl_scopes = NULL,
return_fields = "all", limit = 10){
## Error Checking ----
# Define supported return 'return_fields'
good_fields <- c("*", "all", "abstract", "begindate", "doi", "enddate", "funding", "geographicdescription", "id", "methods", "packageid", "pubdate", "responsibleParties", "scope", "site", "taxonomic", "title", "authors", "spatialCoverage", "sources", "keywords", "organizations", "singledates", "timescales")
# Error out for unsupported ones
if(all(return_fields %in% good_fields) != TRUE)
stop("Unrecognized return field(s): ",
paste(base::setdiff(x = return_fields, y = good_fields), collapse = "; "))
# Error out for non-numeric limit
if(is.numeric(limit) != TRUE){
message("`limit` must be numeric, coercing to 10")
limit <- 10 }
## Solr Query Construction ----
# Make start of query object
query_v0 <- "q="
# If keywords are provided:
### 1. Turn into Solr Syntax
solr_kw <- solrize(bit = solr_wild(bit = keywords))
### 2. Add to query
query_v1 <- paste0(query_v0, "keyword:", solr_kw)
# Handle authors
solr_aut <- solrize(bit = solr_wild(bit = authors))
query_v2 <- paste0(query_v1, "&fq=", "author:", solr_aut)
# Handle subjects
solr_sub <- solrize(bit = solr_wild(bit = subjects))
query_v3 <- paste0(query_v2, "&fq=", "subject:", solr_sub)
# Handle scopes
solr_scp <- solrize(bit = solr_wild(bit = scopes))
query_v4 <- paste0(query_v3, "&fq=", "scope:", solr_scp)
# EXCLUDED scopes
## Handled differently because don't want to swap `NULL` for wildcard
if(is.null(excl_scopes) != TRUE){
# Solr-ize
solr_excl_scp <- solrize(bit = excl_scopes)
# Add to query
query_v5 <- paste0(query_v4, "&fq=", "-scope:", solr_excl_scp)
# Or skip
} else { query_v5 <- query_v4 }
# Parse return fields
## Solr syntax for multiple entries differs here from other elements of query
solr_fl <- paste(solr_wild(bit = return_fields), collapse=",")
query_v6 <- paste0(query_v5, "&fl=", solr_fl)
# Finally, assemble full query with row limit
solr_query <- paste0(query_v6, "&rows=", limit)
# Return that to the user
return(solr_query) }
# Invoke function
( request <- make_query(keywords = "*",
scopes = "knb-lter-fce",
excl_scopes = c("ecotrends", "lter landsat"),
return_fields = c("title", "authors", "id", "doi"),
limit = 10) )
# Test assembled query
EDIutils::search_data_packages(query = request)
# Test use of `make_query` inside of `search_data_packages`
EDIutils::search_data_packages(query = make_query(excl_scopes = "knb-lter-fce",
return_fields = c("title", "id")))
api_get_provenance_metadata
is a fantastic resource but I ran into a case where I needed to access provenance information but had the doi and/or url of the dataset rather than the project identifier (e.g., knb-lter-xxx.x.x). Below is an R-based MRE using a dataset from BNZ that I used to address this task but it seems that the utility of api_get_provenance_metadata
would be increased if it would natively accept a dataset doi or url in addition to the project ### identifier.
MRE (in R):
library(rvest)
library(EDIutils)
library(EML)
library(dplyr)
library(stringr)
url <- "https://doi.org/10.6073/pasta/31b32868ddbb099c4b5480fb00eb2481"
landingPage <- read_html(url)
pageSubset <- landingPage %>%
html_nodes(".no-list-style") %>%
html_text()
packageId <- str_extract(grep("knb-lter-", pageSubset, value = TRUE)[[1]], "^\\S*")
packageProv <- emld::as_emld(EDIutils::api_get_provenance_metadata(packageId))
packageProv$`@context` <- NULL
packageProv$`@type` <- NULL
# desired output
packageProv
It is unclear what content to put in '.Renviron' for login(). The Authentication section in invignettes/evaluate_and_upload.Rmd has a more extensive explanation of authenticating but not for using .Renviron. From https://db.rstudio.com/best-practices/managing-credentials/ it looks like one could put the following in the .Renviron file.
userId = "my_name"
userPass = "my_secert"
And use
login(userId = Sys.getenv("userId"),userPass = Sys.getenv("userPass"))
Dear EDI,
Using EML Assembly Line through Metashark, @earnaud , I have an issue trying importing a tab separated datafile:
Templating table attributes ...
Warning: Error in detect_errors: occurrence.txt contains an inconsistent number of field delimeters. The correct number of field delimiters for this table appears to be 237. Deviation from this occurs at rows: 141, 209, 520, 1056, 1058, 1059, 1071, 1119, 1120, 1121, 1132, 1140, 1142, 1147, 1153, 1157, 1165, 1185, 1189, 1219, 1238, 3852, 3932 ... Check the number of field delimiters in these rows. All rows of your table must contain a consistent number of fields.
This file is an occurence.txt file coming from GBIF and apparently, the number of delimiters, so \t, is ok looking at the content of the file through a text editor, notably on lines 140, 141, 142.
occurrence.txt
Not fully sure detect_errors error message is from EDIutils, but it seems to me that it come from R/validate_fields.R isn't it ?
The current EDI contact email address uses the deprecated domain @environmentaldatainitaitive.org. To ensure that communication flows smoothly, we need to update the contact email to the new domain @edirepository.org.
Please ignore if by design but otherwise note that EDIutils
is not listed in the EDIorg-repository-index
The order of results returned from the PASTA+ API are not guaranteed among calls to the same method. To link between entity attributes (e.g. name, size, etc.) the entity identifier must be used as a key, but is not currently returned by the EDIutils
api*
functions. This needs to be fixed.
The internal function xml2df()
drops null elements from the returned data.frame. The expected result is a data.frame with null elements filled with NA
. This issue affects several EDItuils functions. See:
https://github.com/EDIorg/EDIutils/blob/5ee913a63c4127938ddadcf7bdbfbd622796fdf3/R/utilities.R#L426
Hilary Dugan suggested to write a vignette the describes the filter="newest" functionality in some functions.
Consider curl::send_mail()
as a method of notifying users when data package evaluations and uploads have completed.
For each API call, first check that the resource is available before proceeding. If the resource is unavailable, then return the appropriate output with a warning.
Hi,
when I am using EMLassemblyline::template_taxonomic_coverage()
, an error occurs on validate_file_names
:
Templating taxonomic coverage ...
Error in EDIutils::validate_file_names(path = fun.args$data.path, data.files = fun.args$taxa.table) :
Invalid data.files entered: /home/pndb-elie/dataPackagesOutput/emlAssemblyLine/test_emldp/decomp.csv
I use a path in the path
argument and let data.files
take its default value (aka path
).
By manual tests, I see that this error occurs at the condition:
sum(use_i) == length(data.files)
But sum_i
has length equals to the vector of files in path
directory. So it could not have length equals to data.files
. I think the condition could be re-written:
length(which(sum(use_i))) == length(data.files)
I'm experiencing a possible bug when attempting to query FCE package metadata when including the pubdate.
Using the search_data_packages() function to query pubdate only returns the year for that value instead of a date (expecting something including YYYY-MM-DD). In comparison, including begindate or enddate in the same query returns YYYY-MM-DD for those values.
I am using version 1.0.2 of the package with R version 4.2.2.
An example of the script I'm running to query and screenshot from the result is provided below.
library(EDIutils)
query <- search_data_packages(query = 'q=scope:(knb-lter-fce)&fl=doi,title,packageid,begindate,enddate,pubdate')
A recommendation from @laijasmine, but not yet implemented.
Dear EDIutils team,
using a new test version of MetaShARK, we seen this error message when trying uploading a datafile to "infer" attribute names and related metadata:
Error : 'validate_file_names' is not an exported object from 'namespace:EDIutils'
Can you indicate us if there is an issue with the last version of the R package or maybe an error elsewhere ?
Wishing you a very good end of week !
Cheers,
Yvan
I am calling EDIutils::api_update_data_package() in an R script to update a data package.
The response from the function call is an HTTP error 401, but the PUT to update the datapackage is successful, and I can view the package with its new doi. The error seems to come from the logic after a successful PUT- line 93, api_update_data_package.R
> new_doi
[1] "edi.416.6"
> EDIutils::api_update_data_package(
+ path = eml_path,
+ package.id = new_doi,
+ environment = "staging",
+ user.id = usern,
+ user.pass = passw,
+ affiliation = "EDI"
+ )
Error in open.connection(con, "rb") : HTTP error 401.
> traceback()
7: open.connection(con, "rb")
6: open(con, "rb")
5: read_connection(path)
4: datasource_connection(file, skip, skip_empty_rows, comment, skip_quote)
3: datasource(file, skip_empty_rows = FALSE)
2: readr::read_file(paste0(url_env(environment), ".lternet.edu/package/report/eml/",
stringr::str_replace_all(package.id, "\\.", "/")))
1: EDIutils::api_update_data_package(path = eml_path, package.id = new_doi,
environment = "staging", user.id = usern, user.pass = passw,
affiliation = "EDI")
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Wrapper function that automates the process of uploading a "staged" data packaged (in the EDI Staging environment) to production.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.