mlr-org / mlr3db Goto Github PK
View Code? Open in Web Editor NEWData Backends to let mlr3 work transparently with (remote) data bases
Home Page: https://mlr3db.mlr-org.com
License: GNU Lesser General Public License v3.0
Data Backends to let mlr3 work transparently with (remote) data bases
Home Page: https://mlr3db.mlr-org.com
License: GNU Lesser General Public License v3.0
To work with really large data bases.
This is required to combine mlr3db with non-local parallelization. Maybe https://github.com/rstudio/pool solves this (rstudio/pool#72).
We're in the process of releasing duckdb 0.9.0. Running your checks with the updated duckdb version revealed problems: https://github.com/duckdb/duckdb-r/blob/main/revdep/problems.md#mlr3db . We would like to send the update to CRAN as soon as possible.
duckdb 0.9.0 might reorder rows more aggressively than previous versions did. If your queries rely on a particular row output order, please make sure that you specify it.
Can you please confirm? Installation instructions on https://github.com/duckdb/duckdb-r/tree/main#installation-from-r-universe are up to date. Thanks!
CC @hannes.
Hi, I'm running revdep checks of future where I force the default future plan to be 'multisession'. Among other things, this will help detect when globals are not properly exported or the parallel code attempts to use non-exportable objects in the parallel workers. Doing this on mlr3db reveals some problem related to this;
$ export R_FUTURE_PLAN=multisession
$ R CMD check mlr3db_0.1.5.tar.gz
* using log directory ‘/tmp/mlr3db.Rcheck’
* using R version 4.0.2 (2020-06-22)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘mlr3db/DESCRIPTION’ ... OK
* this is package ‘mlr3db’ version ‘0.1.5’
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘mlr3db’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
Running ‘testthat.R’
ERROR
Running the tests in ‘tests/testthat.R’ failed.
Complete output:
> if (requireNamespace("testthat", quietly = TRUE)) {
+ library(testthat)
+ library(mlr3db)
+ test_check("mlr3db")
+ }
── 1. Error: resample work (@test_train_predict.R#16) ─────────────────────────
Invalid connection. Provide a connector during construction to automatically reconnect
Backtrace:
1. mlr3::resample(task, learner, mlr3::rsmp("cv", folds = 3))
2. future.apply::future_lapply(...)
3. future.apply:::future_xapply(...)
5. future:::value.list(fs)
7. future:::resolve.list(...)
8. future:::signalConditionsASAP(obj, resignal = FALSE, pos = ii)
9. future:::signalConditions(...)
══ testthat results ═══════════════════════════════════════════════════════════
[ OK: 604 | SKIPPED: 0 | WARNINGS: 1 | FAILED: 1 ]
1. Error: resample work (@test_train_predict.R#16)
Error: testthat unit tests failed
In addition: Warning message:
call dbDisconnect() when finished working with a connection
Execution halted
* checking PDF version of manual ... OK
* DONE
Status: 1 ERROR
See
‘/tmp/mlr3db.Rcheck/00check.log’
for details.
library(mlr3db)
#> Loading required package: mlr3
library(mlr3)
b = as_duckdb_backend(iris)
b$missings(1:150, "Species")
#> Species
#> 0
b$missings(1:140, "Species")
#> Species
#> 10
Created on 2024-06-21 with reprex v2.0.2
Wait for duckdb/duckdb#1509.
A primary key is not helping w.r.t. performance; instead, insert the data sorted.
It would be great to have a more general solution to for example also convert duckdb booleans to factors, in case they are to be used as the target variable. Currently this is not really possible.
A solution would be to replace the strings_as_factors
construction argument of DataBackendDuckDB
with as_factors
which also allows to convert bools (and maybe even other data types) to factors.
The step creating the backend and task takes about ~ 30 seconds.
Since we actually do not want to get all data during these steps but instead just check for validity, should it take this long?
This seems rather slow and could perhaps be drastically sped up via some caching?
library("nycflights13")
library("mlr3")
library("dplyr")
library("RSQLite")
library("DBI")
path = tempfile("nycflights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
DBI::dbWriteTable(con, "flights", as.data.frame(flights))
DBI::dbWriteTable(con, "weather", as.data.frame(weather))
DBI::dbWriteTable(con, "airports", as.data.frame(airports))
DBI::dbDisconnect(con)
Data is a join of three tables:
con = DBI::dbConnect(RSQLite::SQLite(), path)
xflights = tbl(con, "flights") %>%
filter(!is.na(arr_delay)) %>%
mutate(row_id = row_number()) %>%
filter(row_id %in% 1:100)
data = xflights %>%
left_join(
tbl(con, "airports") %>% select(faa, lat, lon, alt),
by = c("origin" = "faa") # for the origin
) %>%
left_join(
tbl(con, "airports") %>% select(faa, lat, lon, alt),
by = c("dest" = "faa"), # for the destination
suffix = c("_origin", "_dest")
)
weather = xflights %>%
left_join(
tbl(con, "weather") %>%
select(origin, year, month, day, hour, temp, wind_speed, visib),
by = c("origin", "year", "month", "day"),
suffix = c("", "_weather")
) %>%
filter(hour_weather * 100 >= dep_time && hour_weather * 100 <= arr_time) %>%
group_by(flight, year, month, day) %>%
summarize(across(c(temp, visib, wind_speed), list(mean, max), na.rm=TRUE))
data = data %>%
left_join(weather,
by = c("flight", "year", "month", "day"))
library(mlr3db)
b = as_data_backend(data, primary_key = "row_id")
t = TaskRegr$new("flights", b, "arr_delay")
t
profvis
points to __distinct__
and __head__
being called twice and taking quite some time.
For convenience (and interop with https://tensorflow.rstudio.com/guide/tfdatasets/introduction/) it would be cool to have a show_query
method for SQLite data backends.
db = as_sqlite_backend(tsk("iris"))
show_query(db$.__enclos_env__$private$.data)
Should be only 2-3 lines of code
Using as_data_backend
on a tbl_df
dplyr backend (with a sqlite database) currently yields a backend where
backend$connector
is NULL
. SInce the table we create it from contains it's connector, this slot should perhaps be auto-filled?
data$src$con
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.