Giter Club home page Giter Club logo

frictionless-r's Introduction

frictionless

CRAN status CRAN checks R-CMD-check codecov repo status rOpenSci DOI

Frictionless is an R package to read and write Frictionless Data Packages. A Data Package is a simple container format and standard to describe and package a collection of (tabular) data. It is typically used to publish FAIR and open datasets.

To get started, see:

Installation

Install the latest released version from CRAN:

install.packages("frictionless")

Or the development version from GitHub or R-universe:

# install.packages("devtools")
devtools::install_github("frictionlessdata/frictionless-r")

# Or rOpenSci R-universe
install.packages("frictionless", repos = "https://ropensci.r-universe.dev")

Usage

With frictionless you can read data from a Data Package (local or remote) into your R environment. Here we read bird GPS tracking data from a Data Package published on Zenodo:

library(frictionless)

# Read the datapackage.json file
# This gives you access to all Data Resources of the Data Package without 
# reading them, which is convenient and fast.
package <- read_package("https://zenodo.org/records/10053702/files/datapackage.json")

package
#> A Data Package with 3 resources:
#> • reference-data
#> • gps
#> • acceleration
#> For more information, see <https://doi.org/10.5281/zenodo.10053702>.
#> Use `unclass()` to print the Data Package as a list.

# List resources
resources(package)
#> [1] "reference-data" "gps"            "acceleration"

# Read data from the resource "gps"
# This will return a single data frame, even though the data are split over 
# multiple zipped CSV files.
read_resource(package, "gps")
#> # A tibble: 73,047 × 21
#>     `event-id` visible timestamp           `location-long` `location-lat`
#>          <dbl> <lgl>   <dttm>                        <dbl>          <dbl>
#>  1 14256075762 TRUE    2018-05-25 16:11:37            4.25           51.3
#>  2 14256075763 TRUE    2018-05-25 16:16:41            4.25           51.3
#>  3 14256075764 TRUE    2018-05-25 16:21:29            4.25           51.3
#>  4 14256075765 TRUE    2018-05-25 16:26:28            4.25           51.3
#>  5 14256075766 TRUE    2018-05-25 16:31:21            4.25           51.3
#>  6 14256075767 TRUE    2018-05-25 16:36:09            4.25           51.3
#>  7 14256075768 TRUE    2018-05-25 16:40:57            4.25           51.3
#>  8 14256075769 TRUE    2018-05-25 16:45:55            4.25           51.3
#>  9 14256075770 TRUE    2018-05-25 16:50:49            4.25           51.3
#> 10 14256075771 TRUE    2018-05-25 16:55:36            4.25           51.3
#> # ℹ 73,037 more rows
#> # ℹ 16 more variables: `bar:barometric-pressure` <dbl>,
#> #   `external-temperature` <dbl>, `gps:dop` <dbl>, `gps:satellite-count` <dbl>,
#> #   `gps-time-to-fix` <dbl>, `ground-speed` <dbl>, heading <dbl>,
#> #   `height-above-msl` <dbl>, `location-error-numerical` <dbl>,
#> #   `manually-marked-outlier` <lgl>, `vertical-error-numerical` <dbl>,
#> #   `sensor-type` <chr>, `individual-taxon-canonical-name` <chr>, …

You can also create your own Data Package, add data and write it to disk:

# Create a Data Package and add the "iris" data frame as a resource
my_package <-
  create_package() %>%
  add_resource(resource_name = "iris", data = iris)

my_package
#> A Data Package with 1 resource:
#> • iris
#> Use `unclass()` to print the Data Package as a list.

# Write the Data Package to disk
my_package %>%
  write_package("my_directory")

For more functionality, see get started or the function reference.

frictionless vs datapackage.r

datapackage.r is an alternative R package to work with Data Packages. It has an object-oriented design and offers validation.

frictionless on the other hand allows you to quickly read and write Data Packages to and from data frames, getting out of the way for the rest of your analysis. It is designed to be lightweight, follows tidyverse principles and supports piping. Its validation functionality is limited to what is needed for reading and writing, see frictionless-py for extensive validation.

Meta

frictionless-r's People

Contributors

damianooldoni avatar hansvancalster avatar khusmann avatar mpadge avatar nepito avatar peterdesmet avatar pietrh avatar yihui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

frictionless-r's Issues

Create write_package() function

write_package(pkg, "directory") or write_package(df1, df2, df3, "directory")
  • Wrap write_package() in backticks in create_schema.R to create a link (multiple occurrences)
  • Writes all of package to datapackage.json file in directory
  • write_csv() any newly added resources + updates path
  • leave existing resources (non-df) untouched
  • Add example
  • Remove full_path, resource_names, directory
  • Use jsonlite::toJSON(schema, pretty = TRUE, auto_unbox = TRUE) for schemas
  • file.copy() any existing resources that are not http URLs to the directory and updates path, this:
    • avoids having ../ and / paths which are forbidden
    • avoids loading and any doing any data transformations: data are just copied as is

Create read_descriptor() function

Create a function to read a descriptor file:

descriptor <- read_descriptor("datapackage.json")

Return an descriptor object

  • Should check that at least a resources property is available
  • It can check for other required properties

The function could be called read_dp() or read_datapackage().

Add frictionless.Rmd vignette to get started

Read

  • Read package and data (local)
  • Read package and data (remote)
  • Read from data property

Manipulate

  • Create package
  • Add resource
  • Add resource + schema
  • Remove resource
  • create_package() %>% add_resource() %>% add_resource()
  • read_package() %>% add_resource("new") %>% remove_resource("existing")

Write

  • Write package: describe behaviour

Set type for empty CSV fields to `string`, not `boolean` in `create_schema()`

When adding a resource from a CSV, a schema will be created for the file. read_delim() will interpret empty fields in the CSV files as logical (see tidyverse/readr#839). For the interpretation to a Table Schema, it would be better if those were set to string, not boolean.

The only way I see on how to do that is to see if there's a way to see if all values in the column are NA and if so, set that column to string.

@damianooldoni thoughts?

One-digit hours cannot be parsed

With the introduction of readr 2.0.0 three time parsing tests fail:

── Failure (test-read_resource.R:319:3): read_resource() handles times ─────────
resource$tm_any (`actual`) not identical to `expected_value` (`expected`).

  `actual`:    NA
`expected`: 30600

── Failure (test-read_resource.R:320:3): read_resource() handles times ─────────
resource$tm_shortcut (`actual`) not identical to `expected_value` (`expected`).

  `actual`:    NA
`expected`: 30600

── Failure (test-read_resource.R:321:3): read_resource() handles times ─────────
resource$tm_1 (`actual`) not identical to `expected_value` (`expected`).

  `actual`:    NA
`expected`: 30600

I don't understand why yet, because trying with parsing functions work:

tm_any <- parse_time("8:30", "%AT")
tm_shortcut <- parse_time("8:30:00", "%X")
tm_1 <- parse_time("8AM30", "%I%p%M")

Create create_package() function

create_package()
  • Creates a minimal list object
  • Technically it should have resources already to be valid :-/
  • It would be nice to allow users to pass some information to start a new data package which is a little more than a dummy. Something like a vector of resource names? Will be implemented with add_resource()
  • Update documentation to point to this function where ever read_package() is mentioned

Warn on header vs fields mismatch

read_resource() will use the schema$fields as column names (ignoring headers). The internal read_delim() will provide warnings if there are a) more or less columns than expected or b) data types that cannot be cast.

But it is possible that schema$fields are silently different from headers (e.g. when the number of columns is the same + the types are all character). It would be nice to warn (not error) users when this is the case.

Allow to add resources with a path

One might want to create a data package from csv files on disk. In that case it should be possible to use write_package() to just write the datapackage.json to disk and not read/write all the csv files.

On disk, in directory my_dir:

gps-2018.csv
gps-2019.csv
my_package <- create_package()

# Create resource
gps <- read_csv("my_dir/gps-2018.csv")
add_resource(
  my_package,
  resource_name = "ref",
  df = ref, # Not used to write data, only to create schema
  path = c("my_dir/gps-2018.csv", "my_dir/gps-2019.csv") # Define path
)
# Sets resource "path", does not add inline "data"

# Write package
write_package(package, directory = "my_dir")
# Detects that some of the paths contain "my_dir" and:
# - Does not write the files to disk
# - Shortens the path to just the file names

Caveats:

  • CSV dialect, delimiters, etc. not known.

Ideas for write functions

Ideas for write functions:

create_package() from scratch

See #43

create_schema() for a df

See #12

✅ get_schema() from a resource

get_schema("resource_name", pkg)
  • If we order arguments like this, then also do read_resource("gps", pkg)

add_resource() to a package

See #44

✅ remove_resource()

remove_resource("name", pkg)
  • Ask a "are you sure?" question before the removal? no, blocks flow and user has to save into new variable anyway

write_package() to a directory

See #42

Support reading inline data with schema

read_resource() supports reading from inline data, but that feature is currently marked as experimental because it completely ignores schema. Would be good to support schema, but likely involves a hefty rewrite of read_resource().

Create function `update_schema()` to edit field properties

A created schema will only have the field properties name, type and (sometimes) constraints. I see it as fairly common to add more properties, such as description, required etc. It is possible to do that with purrr, but it isn't very straightforward. Maybe a specific function would be useful.

Create schema:

library(frictionless)
iris_schema <- create_schema(iris)
str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"

Atomic function

iris_schema <- edit_field_property(iris_schema, "Sepal.Width", "description", "Sepal width in cm.")
# Same as: iris_schema$fields[[2]]$description <- "Sepal width in cm."

Not sure this is super useful, but it is very clear what field you are setting.

Loop function

iris_schema <- edit_fields(
  iris_schema,
  "description",
  c("Sepal length in cm.", "Sepal width in cm.", "Petal length in cm.", "Petal width in cm.", NA_character_)
)
# If value is NA or NULL, don't set property

Faster, but disconnect between field name and value you want to set.

Recode like function

iris_schema <- edit_fields(
  iris_schema,
  "description",
  "Sepal.length" = "Sepal length in cm.",
  "Sepal.width" = "Sepal width in cm.",
  "Species" = NA_character
)
# If field is not listed, don't set property
# If field is listed but NA or NULL, remove it

Note, it should also work for nested properties:

iris_schema <- edit_fields(
  iris_schema,
  "constraints$required",
  "Sepal.length" = true
)

Create create_schema() function

create_schema() for a df

create_schema(df)
  • Returns list
  • Lists fields
  • Field have name = colname
  • Field has type = translated coltype
  • Field could have format
  • Field have no other properties (e.g. title, description, constraints)
  • Has missingValues: NA no write_package() will set to ""
  • Link with any original table schema is lost (e.g. when df is read from read_resource(package, "resource"))

Create add_resource() function

add_resource(df, "name", pkg, schema) # schema is optional
  • Adds a resource object to an existing package list object
  • profile: tabular-data-resource
  • name is name of df variable, unless specified
  • path: NULL? reference to df? df itself?
  • schema: hidden call to create_schema()
  • dialect: none, will be default
  • title: optionally set by user?
  • format: csv
  • mediatype: none?
  • encoding: utf-8
  • bytes, hash, sources, licenses: none
  • Do we also need remove_resource()
  • If schema is provided, check that it has same headers as df
  • Update references to function
  • Use the function in tests where data is attached (e.g. read_resource() and write_resource())

Cannot pass empty grouping_mark to locale()

Since the upgrade to readr 2.0.0 (or maybe other packages), the integer and number tests fail, specifically on the property bareNumber:

https://github.com/inbo/datapackage/blob/a195f686a86b46b6042966db69c3cc5ee2c7c237/tests/testthat/types.json#L70

The moment that property is added in the JSON file (whether true or false) it stalls any reading of the file. I haven't figured out why. bareNumber is handled here:

https://github.com/inbo/datapackage/blob/a195f686a86b46b6042966db69c3cc5ee2c7c237/R/read_resource.R#L337

Write CSVs from remotely read package to disk

One can read a Data Package (read_package()) from a URL or download it first and then read it locally.

The current behaviour for writing a remotely read package is to not copy the files to disk, since they are already available online. However, that makes it more difficult to 1) download a whole Data Package using R and 2) update a Data Package (e.g. adding a resource), because the originally local paths are now URLs pointing to an online CSV, and that might not be the version you want to point to.

library(frictionless)
package <- read_package("https://zenodo.org/record/5070086/files/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.5070086
# Paths are local
package$resources[[2]]$path
#> [1] "O_WESTERSCHELDE-gps-2018.csv" "O_WESTERSCHELDE-gps-2019.csv"
#> [3] "O_WESTERSCHELDE-gps-2020.csv"
write_package(package, "my_directory")
list.files("my_directory")
#> [1] "datapackage.json"

written_package <- read_package("my_directory/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your intended use.
#> Follow applicable norms or requirements to credit the dataset and its authors.
#> For more information, see https://doi.org/10.5281/zenodo.5070086
# Paths are URLs now
written_package$resources[[2]]$path
#> [1] "https://zenodo.org/record/5070086/files/O_WESTERSCHELDE-gps-2018.csv"
#> [2] "https://zenodo.org/record/5070086/files/O_WESTERSCHELDE-gps-2019.csv"
#> [3] "https://zenodo.org/record/5070086/files/O_WESTERSCHELDE-gps-2020.csv"

Created on 2022-01-11 by the reprex package (v2.0.1)

"$ operator is invalid for atomic vectors" error when data package use external dialect property

In the section On dereferencing and descriptor validation the frictionless specs say that

Some properties in the Frictionless Data specifications allow a path (a URL or a POSIX path) that resolves to an object.

The most prominent example of this is the schema property on Tabular Data Resource descriptors.

Allowing such references has practical use for publishers, for example in allowing schema reuse. However, it does introduce difficulties in the validation of such properties. For example, validating a path pointing to a schema rather than the schema object itself will do little to guarantee the integrity of the schema definition. Therefore implementors MUST dereference such "referential" property values before attempting to validate a descriptor. At present, this requirement applies to the following properties in Tabular Data Package and Tabular Data Resource:

  • schema
  • dialect

frictionless-r doesn't dereference the dialect property, resulting in the error $ operator is invalid for atomic vectors in the calls to dialect properties such as dialect$delimiter (eg. https://github.com/frictionlessdata/frictionless-r/blob/main/R/read_resource.R#L312). Here is a reprex

library(frictionless)

dp <- read_package("https://raw.githubusercontent.com/dados-mg/datapackage-reprex/external-dialect/datapackage.json")
#> Please make sure you have the right to access data from this Data Package for your proposed use.
#> Follow applicable norms or requirements to credit the dataset and its authors.

res <- read_resource("estados", dp)
#> Error: $ operator is invalid for atomic vectors

We make extensive use of this feature in the data packages of the Open Data Portal of Minas Gerais and I would be happy to submit a PR by next week if there is interest.

Support date formats

Date

See https://specs.frictionlessdata.io/table-schema/#date

  • default: YYYY-MM-DD
  • any: some attempts, e.g. 2020/01/01 will work
  • PATTERN

Time

See https://specs.frictionlessdata.io/table-schema/#time

  • default: hh:mm:ss: note that 12:04:03.943 becomes 12:04:03 (floored)
  • any: no real other formats, e.g. T12:03:04 becomes NA
  • PATTERN

Datetime

See https://specs.frictionlessdata.io/table-schema/#datetime

  • default: YYYY-MM-DDThh:mm:ssZ:
    -2020-01-02T12:35:10Z
    • 2020-12-01T14:45:53+01:00 -> 2020-12-01 13:45:53
  • any: some attempts
    • 2019-05-04 14:26:33 -> 2019-05-04 14:26:33
    • 2019-05-04 14:26:33+01:00 -> 2019-05-04 13:26:33
  • PATTERN

Pattern implementation

That might be possible by passing the col_datetime(format = ""):

cols(
  deployment_id = col_double(),
  longitude = col_double(),
  latitude = col_double(),
  start = col_datetime(format = ""),
  comments = col_character()
)

Allow descriptors represented in yaml

According to the specs all descriptors (eg. data package, table schema, etc) MUST be a JSON object. However, there are some discussions to support a YAML representation, and frictionless-py already supports it.

This is a nice feature because it allows for a more readable and diffable documentation, especially for the description property in table schema which can make use of YAML multiline strings for writing markdown without suffering the pain of the JSON handling of newlines.

Compare

fields:
- name: COD
  type: integer
  format: default
  description: |
    Código de dois dígitos que identificam o estado

    > Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  title: Código do Estado
- name: NOME
  type: string
  format: default
  title: Nome do Estado
- name: SIGLA
  type: string
  format: default
  title: Sigla do Estado

with

{
  "fields": [
    {
      "name": "COD",
      "type": "integer",
      "format": "default",
      "description": "Código de dois dígitos que identificam o estado\n\n> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n",
      "title": "Código do Estado"
    },
    {
      "name": "NOME",
      "type": "string",
      "format": "default",
      "title": "Nome do Estado"
    },
    {
      "name": "SIGLA",
      "type": "string",
      "format": "default",
      "title": "Sigla do Estado"
    }
  ]
}

Because some data packages are going to mix both JSON and YAML, it would be nice to be able to write

library("frictionless")

dp <- read_package("https://raw.githubusercontent.com/dados-mg/datapackage-reprex/yaml-schema/datapackage.json")
df <- dp |> read_resource("estados")

which currently fails with

Error in parse_con(txt, bigint_as_char) : 
  lexical error: invalid string in json text.
                                       fields: - name: COD   type: int
                     (right here) ------^

Write factor to enum

Factor to enum (create_schema()):

If a df column is a factor, add the factor levels as constraints

Overflowing integers

> pkg <- read_package("https://zenodo.org/record/5056105/files/datapackage.json")
> gps <- read_resource(pkg, "gps")
Warning: 2690 parsing failures.
row      col   expected      actual                                                                file
  1 event-id an integer 19193113855 'https://zenodo.org/record/5056105/files/MH_ANTWERPEN-gps-2018.csv'
  2 event-id an integer 19193113856 'https://zenodo.org/record/5056105/files/MH_ANTWERPEN-gps-2018.csv'
  3 event-id an integer 19193113857 'https://zenodo.org/record/5056105/files/MH_ANTWERPEN-gps-2018.csv'
  4 event-id an integer 19193113858 'https://zenodo.org/record/5056105/files/MH_ANTWERPEN-gps-2018.csv'
  5 event-id an integer 19193113859 'https://zenodo.org/record/5056105/files/MH_ANTWERPEN-gps-2018.csv'

Integers (https://specs.frictionlessdata.io/table-schema/#integer) can be valid in Frictionless, but too big for R. The data frame will still be loaded, but the column will contain NA for overflowing integers. The whole column would better be cast to numeric.

@damianooldoni Is there a way we can achieve that, but only if there are overflow values?

CR line endings no longer supported

readr 2.0.0 no longer support CR (only) line endings:

Normalizing newlines in files with just carriage returns \r is no longer supported. The last major OS to use only CR as the newline was ‘classic’ Mac OS, which had its final release in 2001.

This is not a huge issue for datapackage, since LF and CRLF are the expected defaults. Will update the function documentation and tests.

Create function `add_metadata()` or `add_property()`

Add function to add common properties to descriptor file, e.g.:

package <- add_metadata(id = "https://doi.org/10.5281/zenodo.5070086")
  • Would the function check if the parameter names are valid? If so, what about optional ones? Maybe add_property() is better
  • How would the function react if a property is already there? Overwrite with warning?
  • Can we use the add_property() function for adding properties to fields or schemas? If so, can the user give a vector of equal length to the number of properties?

Should resource_names and directory be optional?

The current workflow is:

  1. Read datapackage.json file with package <- read_package(), which creates a list object and adds two convenience terms: resource_names and directory. Those terms are not part of the Frictionless spec.
  2. Read a resource with read_resource(package, "resource name"). It will make use of those two convenience terms.

If users want to start a package object from scratch or use another tool to make it, they need to do the following for read_resource() to work:

  • Need to add resource_names, otherwise they get an error message. Alternative: read_resource() can be written so that it doesn't need this term.
  • Need to add directory, otherwise the get an error that resources could not be found in the root (i.e. no directory). Alternative: offer a way to provide a directory read_resource(package, "resource_name", directory)

@niconoe @damianooldoni what would be your take on this? Can we expect users to always use read_package() and when not add the two convenience terms manually? Or should we allow read_resource() to work without those convenience terms?

Message users to make sure they can use data

When using read_package() a generic short message could be printed to console to warn users to follow data licenses and guidelines and read metadata. This would be on by default, but maybe we can add an option to silence it.

One add_resource() test causes segfault

The following test in test-add_resource():

test_that("add_resource() creates resource that can be passed to write_package()", {
pkg <- example_package
df <- data.frame(
"col_1" = c(1, 2),
"col_2" = factor(c("a", "b"), levels = c("a", "b", "c"))
)
pkg <- add_resource(pkg, "new", df)
temp_dir <- tempdir()
expect_invisible(write_package(pkg, temp_dir)) # Can write successfully
unlink(temp_dir, recursive = TRUE)
})

... runs fine, but causes a critical error in 2 tests for other functions:

test_that("check_schema() returns TRUE on valid Table Schema", {
pkg <- example_package
# Can't obtain df using read_resource(), because that function uses
# check_schema() (in get_schema()) internally, which is what we want to test
df <- suppressMessages(
readr::read_csv(file.path(pkg$directory, pkg$resources[[1]]$path))
)
schema_get <- get_schema(pkg, "deployments")
schema_create <- create_schema(df)
expect_true(check_schema(schema_get))
expect_true(check_schema(schema_create))
expect_true(check_schema(schema_get, df))
expect_true(check_schema(schema_create, df))
})

test_that("read_resource() returns a tibble", {
pkg <- example_package
df <- data.frame(
"col_1" = c(1, 2),
"col_2" = factor(c("a", "b"), levels = c("a", "b", "c"))
)
pkg <- add_resource(pkg, "new", df)
expect_s3_class(read_resource(pkg, "deployments"), "tbl") # via path
expect_s3_class(read_resource(pkg, "media"), "tbl") # via data
expect_s3_class(read_resource(pkg, "new"), "tbl") # via df
})

The error is:

 *** caught segfault ***
address 0x68, cause 'memory not mapped'
x | 1      13 | check_schema [0.2s]                       
──────────────────────────────────────────────────────────
Error (test-check_schema.R:5:3): check_schema() returns TRUE on valid Table Schema
Error in `vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, 
    col_types = col_types, id = id, skip = skip, col_select = col_select, 
    name_repair = .name_repair, na = na, quote = quote, trim_ws = trim_ws, 
    escape_double = escape_double, escape_backslash = escape_backslash, 
    comment = comment, skip_empty_rows = skip_empty_rows, locale = locale, 
    guess_max = guess_max, n_max = n_max, altrep = vroom_altrep(altrep), 
    num_threads = num_threads, progress = progress)`: R_Reprotect: only 90 protected items, can't reprotect index 117
Backtrace:
 1. base::suppressMessages(...) test-check_schema.R:5:2
 3. readr::read_csv(file.path(pkg$directory, pkg$resources[[1]]$path))
 4. vroom::vroom(...)
 5. vroom:::vroom_(...)
──────────────────────────────────────────────────────────

I don't know what is causing the issue. @damianooldoni any ideas?

For now, I have disabled the add_resource() test that is causing this in cfe03b7. When fixed, reverse that commit

Create read_resource() function

Create a function to read a resource:

descriptor$resources
#> ["deployments", "multimedia", "observations"]
df <- read_resource(descriptor, "deployments")

Returns a data.frame

  • Should use the resource name (not path) as input
  • Will need a descriptor object to understand paths, fields
  • Adapts internal read function on provided csv dialect
  • If multiple paths are provided, merges data in path order
  • If schema is available, assigns df data types based on schema field types
  • Could provide a schema_sync to allow syncing csv columns with schema based on field names
  • Returns warning if data type cannot be cast (sets to character)
  • Could potentially do validate data against constraints

Can we remove rlist dependency?

We currently use rlist for one step:

# Remove elements that are NULL or empty list
schema <- rlist::list.clean(
schema,
function(x) is.null(x) | length(x) == 0L,
recursive = TRUE
)

It might be worth investigating if this functionality cannot be replace with base R and purrr:: code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.