Giter Club home page Giter Club logo

Comments (9)

traversc avatar traversc commented on May 29, 2024 2

New version on CRAN has this function.

from qs.

mrcaseb avatar mrcaseb commented on May 29, 2024 1

update: I was able to speed up my function in case anyone is interested

load_qs <- function(url) qs::qdeserialize(curl::curl_fetch_memory(url)$content)

from qs.

traversc avatar traversc commented on May 29, 2024

It is a good idea, but CRAN doesn't allow using R-connections directly within C code. Glad you found a workaround!

from qs.

mrcaseb avatar mrcaseb commented on May 29, 2024

Ah dang CRAN. Before you replied I found what readRDS is actually doing. It should be the below given code block

https://github.com/microsoft/microsoft-r-open/blob/d72636113ede1c1d28959c8b8723c15c694957f4/source/src/main/serialize.c#L2236-L2282

I assume it's a CRAN exception for base R

from qs.

zecojls avatar zecojls commented on May 29, 2024

Is there any update to allow qs::qread to read URLs? Wrapping load_qs inside qs:qread would help a lot.

from qs.

traversc avatar traversc commented on May 29, 2024

@zecojls Sure it could be put in for a next update, just would like to think about how it looks.

Could you help me prototype this? Here are my thoughts:

I'd prefer to not have curl as a strict dependency (just to keep requirements at an absolute minimum). Is there a base-R option that's just a performant?

I'm thinking it should be in a separate function such as aqread_url, because qread is auto-generated by Rcpp (linking to the C++ code).

from qs.

zecojls avatar zecojls commented on May 29, 2024

I was just googling about it and found this qs_from_url function in the nflverse package. I agree that avoiding dependencies is good, but I think curl is pretty active and well-maintained.

from qs.

traversc avatar traversc commented on May 29, 2024

curl is great, but it has a system libcurl-dev requirement which presents an challenge e.g. if you're on a linux workstation where you don't have admin privileges.

So I'm considering two options, use curl and add it as a suggested dependency:

qread_url <- function(url, ...) {
if(<check if curl installed>) {
  qs::qdeserialize(curl::curl_fetch_memory(url)$content, ...)
} else {
  stop("qread_url requires curl installed")
}
}

Or some base R solution such as:

qread_url <- function(url, ...) {
con <- url(url, mode = "rb", raw = TRUE)
buffer_size <- 10000
data <- ...
while(x <- readBin(con, buffer_size)) {
  <append x to data>
  ...
}
close(con)
qdeserialize(data, ...)
}

from qs.

zecojls avatar zecojls commented on May 29, 2024

Well, they are pretty much the same I think (depends on the internet connection). Reading a 13 Mb file from google cloud storage took me around 3 sec in both modes. I think that sticking to base R is great but I'm not sure how it deals with larger files that extrapolate the chunk size. Unfortunately, I have no idea how to recursively download the chunks and append them.

library("qs")
library("curl")
library("tictoc")

options(timeout=240)

qread_url_curl <- function(url, ...) {
  if(!require("curl")) {
    stop("qread_url requires curl installed")
  } else {
    qs::qdeserialize(curl::curl_fetch_memory(url)$content, ...)
  }
}

qread_url_base <- function(url, ...) {
  con <- file(url, "rb", raw = TRUE)
  buffer_size <- 2^31-1 # limit from readBin help
  x <- readBin(con, what = "raw", n = buffer_size)
  close(con)
  qs::qdeserialize(x)
}

target.url <- "https://storage.googleapis.com/soilspec4gg-test/test.qs"

# 2.993 sec
tic()
test1 <- qread_url_curl(target.url)
toc()

# 2.991
tic()
test2 <- qread_url_base(target.url)
toc()

from qs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.