ropensci / ediutils Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 2.0 1.06 MB

An API Client for the Environmental Data Initiative Repository

Home Page: https://docs.ropensci.org/EDIutils/

License: Other

R 100.00%

r r-package rstats ecology eml-metadata open-access open-data research-data-management research-data-repository

ediutils's Issues

read_data_entity_names of multiple package_ids

Hey!

I need a hand here.
I have a list with the latest package IDs of multiple datasets. (accessed 2023-09-25).
Instead of accessing them one by one, I want to run all of them at once:

# This is a list with all the package IDs I am trying to access (n=81)
package_ids <- data_list$packageid #chr

# Download all of the data entities for the package IDs in the list
for(i in 1:length(package_ids)){
  data_entity_names <- read_data_entity_names(packageId = package_ids[i])
}

and returns:

''Error in read_data_entity_names(packageId = package_ids[i]) :
Not Found (HTTP 404). Failed to .''

Has anyone encountered a similar problem before? Do you have a better way/command to go around?

Thanks,
Paschalis

function validate_path() should tell me the invalid path

EAL uses this function. But R is not very good at tracing the problem for you, so scripts have to do it. Here is the message I get if EAL (EDI utils) cannot find a path:

Error in EDIutils::validate_path(fun.args$path) : 
  The directory specified by the argument "path" does not exist! Please enter the correct path for your dataset working directory.

My script has several paths in it. the validate_path() should PRINT OUT the offending path to make it easier to track down.

LTER affiliation no longer valid

Hi @clnsmth - I believe that since, @servilla changed the LDAP, that LTER as an option for the affiliation argument is no longer valid and should be removed. Below is a relevant snippet from api_evaluate_data_package but this may be relevant in other places within the package as well.

#' @param affiliation
#'     (character) Affiliation corresponding with the user.id argument supplied
#'     above. Can be: 'LTER' or 'EDI'.

Make read_tables() more flexible

Add functionality that will accommodate less-than-ideal header formats in csv (i.e. header on line 1, data starts line 5)

Warning: File naming best practices

Issue a warning when file naming best practices are not followed. Consider implementing in validate_file_names().

wrapper function to download all tables in a package

@clnsmth this might already exist but I just didn't find it. I think most common use case is something like this wherein someone just wants to download ALL the tabular data into either a list (or possibly stacked should be an option) and have them come out with sensible names.

library (EDIutils)
library (magrittr)
library (data.table)

#assume you used the awesome workflow already to get the package you want
pkgid='knb-lter-nwt.210.1'

#get the ids
entities_id=EDIutils::api_list_data_entities(pkgid) 
#sort the urls
entities_id$url=paste0('https://pasta.lternet.edu/package/data/eml/',
                       gsub('\\.', '/', pkgid), '/', entities_id$identifier)
#also want the names
entities_name=sapply(entities_id$identifier, function(x)
  api_read_data_entity_name('knb-lter-nwt.210.1', x, 
                            environment = 'production'))%>%
  #spaces = hassle to work with, get rid of them
  gsub(' ', '_', .)

#download all
alldata=list()
#should add some EML scraping here to make it read_delim based on the delimiter
#headers TRUE/FALSE, etc
for (k in 1:length(entities_id)){
  alldata[[entities_name[k]]]=read.csv(entities_id$url[k])
}
#probably optional whether you want to bindrows them into a single df or not
#this is an example where it makes sense to but in other instances you might not
finally=alldata%>%rbindlist(., idcol=TRUE, use.names=TRUE)

Does this exist and I am just not finding it? Or would it be worth wrapping the above into a function wtih the package?

Vignettes covering common use cases

Vignettes covering common use cases would be helpful. A few examples:

Query for the latest version of a data package and then download all the tables
Get the latest data package version number and increment by 1
Evaluate and upload a data package

Failing CRAN checks related to vcr

Few weeks ago, EDIutils CRAN checks began failing as a result of issues in the vcr package. After the vcr v1.2.0 upgrade, EDIutils checks returned to a passing state (mostly). Two OS configurations are failing, namely r-release-windows-x86_64 and r-oldrel-windows-ix86+x86_64.

Investigate the source of these issues and fix.

comma in entity names causes read_data_entity_names to return extra columns

EDIutils::read_data_entity_name("knb-lter-ble.12.1") returns

See the extra unnamed column. The entity name as listed in EML is "Sediment pigment data, personnel list".

I figure this is not particularly EDIutils' fault but perhaps the PASTA API? Reporting here because EDIutils is what I was using.

Demonstrate common search patterns using Solr queries

Add these to the "Search and Access" vignette.

Create vignette on working with EML and XML

A great recommendation from @laijasmine and @bozaah to help get users working with EML and XML.

Use AppVeyor

Use AppVeyor for Windows CI.

detect_delimeter.R does not properly detect delimeters

I am testing the EML Assembly Line which relies on the detect_delimeter.R script with .csv files. This CSV file is delimited by ";" (which is a norm using read.csv2() and write.csv2() ).

Edit

The error message itself is :

I'm having trouble identifying the field delimeter of SupplementaryTable4.csv. Enter the field delimeter of this file. Valid options are:  ,  \t  ;  |

Thus indicating that .csv files can have separators in , \t ; |

Draft `make_query` Function for R-Style Solr Queries

Summary

Hey EDIutils team! I had a conversation with Colin Smith and Greg Maurer recently about creating a make_query function to help make Solr queries for people with some R literacy but limited prior exposure to Solr. The hope is that this new function would make it easier for R users to make good use of EDIutils::search_data_packages.

I've taken a stab at this function and will attach the full code to this issue. Note that I also wrote two helper functions solr_wild and solrize to make the internal components of make_query as streamlined as possible. I'm definitely a novice to Solr queries so make_query may be missing crucial arguments but I think it's a reasonable starting point and is built to be semi-modular and could easily support additional arguments. All functions are written in base R (version 4.3.2).

Let me know if this doesn't work on your end and/or if you'd like me to make any changes before it could possibly be built into EDIutils. Thanks!

Function Demo Script

# Load needed libaries
library(EDIutils)

# Clear environment
rm(list = ls())

# Define helper function
## Swaps human equivalents of wildcards for Solr wildcard
solr_wild <- function(bit){
  
  # Handle empty `bit`
  if(is.null(bit) == TRUE){
    
    # Replace with wildcard
    bit_v2 <- "*"
  }
  
  # Handle English equivalents for wildcard
  else if(length(bit) == 1){
    
    # Replace allowed keywords with wildcard
    bit_v2 <- gsub(pattern = "all|any", replacement = "*", x = bit)
  } 
  
  # If neither condition is met, return whatever was originally supplied
  else { bit_v2 <- bit }
  
  # Return finished product
  return(bit_v2) }

# Example(s)
solr_wild(bit = NULL)
solr_wild(bit = "any")
solr_wild(bit = "something else")

# Define helper function
## Parses English text into Solr syntax (i.e., right delimiters, etc.)
solrize <- function(bit){
  
  # Replace spaces with hyphens
  bit_v2 <- gsub(pattern = " ", replacement = "-", x = bit)
  
  # If more than one value, handle that
  if(length(bit_v2) > 1){
    
    # Collapse with plus signs
    bit_v3 <- paste0("(", paste0(bit_v2, collapse = "+"), ")")
    
  } else { bit_v3 <- bit_v2 }
  
  # Return finished bit
  return(bit_v3) }

# Example(s)
solrize(bit = c("primary production", "plants"))

# Define function to generate query
make_query <- function(keywords = NULL, subjects = NULL, authors = NULL, 
                       scopes = NULL, excl_scopes = NULL, 
                       return_fields = "all", limit = 10){

  ## Error Checking ----
  # Define supported return 'return_fields'
  good_fields <- c("*", "all", "abstract", "begindate", "doi", "enddate", "funding", "geographicdescription", "id", "methods", "packageid", "pubdate", "responsibleParties", "scope", "site", "taxonomic", "title", "authors", "spatialCoverage", "sources", "keywords", "organizations", "singledates", "timescales")
  
  # Error out for unsupported ones
  if(all(return_fields %in% good_fields) != TRUE)
    stop("Unrecognized return field(s): ", 
         paste(base::setdiff(x = return_fields, y = good_fields), collapse = "; "))
  
  # Error out for non-numeric limit
  if(is.numeric(limit) != TRUE){
    message("`limit` must be numeric, coercing to 10")
    limit <- 10 }
  
  ## Solr Query Construction ----
  # Make start of query object
  query_v0 <- "q="
  
  # If keywords are provided:
  ### 1. Turn into Solr Syntax
  solr_kw <- solrize(bit = solr_wild(bit = keywords)) 
  
  ### 2. Add to query
  query_v1 <- paste0(query_v0, "keyword:", solr_kw)
  
  # Handle authors
  solr_aut <- solrize(bit = solr_wild(bit = authors))
  query_v2 <- paste0(query_v1, "&fq=", "author:", solr_aut)
  
  # Handle subjects
  solr_sub <- solrize(bit = solr_wild(bit = subjects))
  query_v3 <- paste0(query_v2, "&fq=", "subject:", solr_sub)
  
  # Handle scopes
  solr_scp <- solrize(bit = solr_wild(bit = scopes))
  query_v4 <- paste0(query_v3, "&fq=", "scope:", solr_scp)
  
  # EXCLUDED scopes
  ## Handled differently because don't want to swap `NULL` for wildcard
  if(is.null(excl_scopes) != TRUE){
    
    # Solr-ize
    solr_excl_scp <- solrize(bit = excl_scopes)
    
    # Add to query
    query_v5 <- paste0(query_v4, "&fq=", "-scope:", solr_excl_scp)
    
    # Or skip
  } else { query_v5 <- query_v4 }
  
  # Parse return fields
  ## Solr syntax for multiple entries differs here from other elements of query
  solr_fl <- paste(solr_wild(bit = return_fields), collapse=",")
  query_v6 <- paste0(query_v5, "&fl=", solr_fl)
  
  # Finally, assemble full query with row limit
  solr_query <- paste0(query_v6, "&rows=", limit)
  
  # Return that to the user
  return(solr_query) }

#  Invoke function
( request <- make_query(keywords = "*", 
                        scopes = "knb-lter-fce",
                        excl_scopes = c("ecotrends", "lter landsat"),
                        return_fields =  c("title", "authors", "id", "doi"),
                        limit = 10) )

# Test assembled query
EDIutils::search_data_packages(query = request)

# Test use of `make_query` inside of `search_data_packages`
EDIutils::search_data_packages(query = make_query(excl_scopes = "knb-lter-fce",
                                                  return_fields = c("title", "id")))

enhance input to api_get_provenance_metadata to accept urls and dois

api_get_provenance_metadata is a fantastic resource but I ran into a case where I needed to access provenance information but had the doi and/or url of the dataset rather than the project identifier (e.g., knb-lter-xxx.x.x). Below is an R-based MRE using a dataset from BNZ that I used to address this task but it seems that the utility of api_get_provenance_metadata would be increased if it would natively accept a dataset doi or url in addition to the project ### identifier.

MRE (in R):

library(rvest)
library(EDIutils)
library(EML)
library(dplyr)
library(stringr)

url <- "https://doi.org/10.6073/pasta/31b32868ddbb099c4b5480fb00eb2481"

landingPage <- read_html(url)

pageSubset <- landingPage %>%
  html_nodes(".no-list-style") %>%
  html_text()

packageId <- str_extract(grep("knb-lter-", pageSubset, value = TRUE)[[1]], "^\\S*")

packageProv <- emld::as_emld(EDIutils::api_get_provenance_metadata(packageId))
packageProv$`@context` <- NULL
packageProv$`@type` <- NULL

# desired output
packageProv

In vignettes/retrieve_downloads.Rmd documentation on using .Renviron is confusing.

It is unclear what content to put in '.Renviron' for login(). The Authentication section in invignettes/evaluate_and_upload.Rmd has a more extensive explanation of authenticating but not for using .Renviron. From https://db.rstudio.com/best-practices/managing-credentials/ it looks like one could put the following in the .Renviron file.
userId = "my_name"
userPass = "my_secert"
And use
login(userId = Sys.getenv("userId"),userPass = Sys.getenv("userPass"))

detect_delimeter.R and "inconsistent number of field delimiters

Dear EDI,

Using EML Assembly Line through Metashark, @earnaud , I have an issue trying importing a tab separated datafile:

Templating table attributes ...
Warning: Error in detect_errors: occurrence.txt contains an inconsistent number of field delimeters. The correct number of field delimiters for this table appears to be 237. Deviation from this occurs at rows: 141, 209, 520, 1056, 1058, 1059, 1071, 1119, 1120, 1121, 1132, 1140, 1142, 1147, 1153, 1157, 1165, 1185, 1189, 1219, 1238, 3852, 3932 ... Check the number of field delimiters in these rows. All rows of your table must contain a consistent number of fields.

This file is an occurence.txt file coming from GBIF and apparently, the number of delimiters, so \t, is ok looking at the content of the file through a text editor, notably on lines 140, 141, 142.
occurrence.txt

Not fully sure detect_errors error message is from EDIutils, but it seems to me that it come from R/validate_fields.R isn't it ?

Update EDI contact email

The current EDI contact email address uses the deprecated domain @environmentaldatainitaitive.org. To ensure that communication flows smoothly, we need to update the contact email to the new domain @edirepository.org.

include EDIutils in EDIorg-repository-index

Please ignore if by design but otherwise note that EDIutils is not listed in the EDIorg-repository-index

Return entity identifiers when supported

The order of results returned from the PASTA+ API are not guaranteed among calls to the same method. To link between entity attributes (e.g. name, size, etc.) the entity identifier must be used as a key, but is not currently returned by the EDIutils api* functions. This needs to be fixed.

xml2df() drops null elements

The internal function xml2df() drops null elements from the returned data.frame. The expected result is a data.frame with null elements filled with NA. This issue affects several EDItuils functions. See:

https://github.com/EDIorg/EDIutils/blob/5ee913a63c4127938ddadcf7bdbfbd622796fdf3/R/utilities.R#L426

add vignette to get the newest version of a dataset

Hilary Dugan suggested to write a vignette the describes the filter="newest" functionality in some functions.

Notify users when evaluations and uploads complete

Consider curl::send_mail() as a method of notifying users when data package evaluations and uploads have completed.

API functions: Check resource availability

For each API call, first check that the resource is available before proceeding. If the resource is unavailable, then return the appropriate output with a warning.

`validate_file_names` returns "Invalid data.files entered"

Hi,

when I am using EMLassemblyline::template_taxonomic_coverage(), an error occurs on validate_file_names:

Templating taxonomic coverage ...
Error in EDIutils::validate_file_names(path = fun.args$data.path, data.files = fun.args$taxa.table) : 
  Invalid data.files entered: /home/pndb-elie/dataPackagesOutput/emlAssemblyLine/test_emldp/decomp.csv

I use a path in the path argument and let data.files take its default value (aka path).
By manual tests, I see that this error occurs at the condition:

sum(use_i) == length(data.files)

But sum_i has length equals to the vector of files in path directory. So it could not have length equals to data.files. I think the condition could be re-written:

length(which(sum(use_i))) == length(data.files)

Using search_data_packages only returns YYYY for pubdate instead of YYYY-MM-DD

I'm experiencing a possible bug when attempting to query FCE package metadata when including the pubdate.

Using the search_data_packages() function to query pubdate only returns the year for that value instead of a date (expecting something including YYYY-MM-DD). In comparison, including begindate or enddate in the same query returns YYYY-MM-DD for those values.

I am using version 1.0.2 of the package with R version 4.2.2.

An example of the script I'm running to query and screenshot from the result is provided below.

library(EDIutils)

query <- search_data_packages(query = 'q=scope:(knb-lter-fce)&fl=doi,title,packageid,begindate,enddate,pubdate')

Add parameter to report functions for only listing warns and errors

A recommendation from @laijasmine, but not yet implemented.

Error : 'validate_file_names' is not an exported object from 'namespace:EDIutils'

Dear EDIutils team,

using a new test version of MetaShARK, we seen this error message when trying uploading a datafile to "infer" attribute names and related metadata:

Error : 'validate_file_names' is not an exported object from 'namespace:EDIutils'

Can you indicate us if there is an issue with the last version of the R package or maybe an error elsewhere ?

Wishing you a very good end of week !

Cheers,

Yvan

api_update_data_package() returning 401 after successful PUT

I am calling EDIutils::api_update_data_package() in an R script to update a data package.

The response from the function call is an HTTP error 401, but the PUT to update the datapackage is successful, and I can view the package with its new doi. The error seems to come from the logic after a successful PUT- line 93, api_update_data_package.R

> new_doi
[1] "edi.416.6"
> EDIutils::api_update_data_package(
+     path = eml_path,
+     package.id = new_doi,
+     environment = "staging",
+     user.id = usern,
+     user.pass = passw,
+     affiliation = "EDI"
+ )
Error in open.connection(con, "rb") : HTTP error 401.
> traceback()
7: open.connection(con, "rb")
6: open(con, "rb")
5: read_connection(path)
4: datasource_connection(file, skip, skip_empty_rows, comment, skip_quote)
3: datasource(file, skip_empty_rows = FALSE)
2: readr::read_file(paste0(url_env(environment), ".lternet.edu/package/report/eml/", 
       stringr::str_replace_all(package.id, "\\.", "/")))
1: EDIutils::api_update_data_package(path = eml_path, package.id = new_doi, 
       environment = "staging", user.id = usern, user.pass = passw, 
       affiliation = "EDI")
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Function to push a "staged" data package to the production environment

Wrapper function that automates the process of uploading a "staged" data packaged (in the EDI Staging environment) to production.

ropensci / ediutils Goto Github PK

ediutils's Issues

Edit

Summary

Function Demo Script

Recommend Projects

Recommend Topics

Recommend Org