mlr-org / mlr3db Goto Github PK

View Code? Open in Web Editor NEW

21.0 14.0 1.0 2.98 MB

Data Backends to let mlr3 work transparently with (remote) data bases

Home Page: https://mlr3db.mlr-org.com

License: GNU Lesser General Public License v3.0

R 100.00%

mlr3 machine-learning data-backend database mariadb mysql postgresql sqlite odbc bigquery

mlr3db's People

Contributors

Stargazers

Watchers

Forkers

hadley

mlr3db's Issues

finalize method should be private

https://github.com/r-lib/R6/blob/main/NEWS.md

Support SQL TABLESAMPLE/SAMPLE operations

To work with really large data bases.

Re-connect after disconnect

This is required to combine mlr3db with non-local parallelization. Maybe https://github.com/rstudio/pool solves this (rstudio/pool#72).

We're in the process of releasing duckdb 0.9.0. Running your checks with the updated duckdb version revealed problems: https://github.com/duckdb/duckdb-r/blob/main/revdep/problems.md#mlr3db . We would like to send the update to CRAN as soon as possible.

duckdb 0.9.0 might reorder rows more aggressively than previous versions did. If your queries rely on a particular row output order, please make sure that you specify it.

Can you please confirm? Installation instructions on https://github.com/duckdb/duckdb-r/tree/main#installation-from-r-universe are up to date. Thanks!

CC @hannes.

R CMD check errors with R_FUTURE_PLAN=multisession

Hi, I'm running revdep checks of future where I force the default future plan to be 'multisession'. Among other things, this will help detect when globals are not properly exported or the parallel code attempts to use non-exportable objects in the parallel workers. Doing this on mlr3db reveals some problem related to this;

$ export R_FUTURE_PLAN=multisession
$ R CMD check mlr3db_0.1.5.tar.gz 

* using log directory ‘/tmp/mlr3db.Rcheck’
* using R version 4.0.2 (2020-06-22)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘mlr3db/DESCRIPTION’ ... OK
* this is package ‘mlr3db’ version ‘0.1.5’
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘mlr3db’ can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Complete output:
  > if (requireNamespace("testthat", quietly = TRUE)) {
  +   library(testthat)
  +   library(mlr3db)
  +   test_check("mlr3db")
  + }
  ── 1. Error: resample work (@test_train_predict.R#16)  ─────────────────────────
  Invalid connection. Provide a connector during construction to automatically reconnect
  Backtrace:
   1. mlr3::resample(task, learner, mlr3::rsmp("cv", folds = 3))
   2. future.apply::future_lapply(...)
   3. future.apply:::future_xapply(...)
   5. future:::value.list(fs)
   7. future:::resolve.list(...)
   8. future:::signalConditionsASAP(obj, resignal = FALSE, pos = ii)
   9. future:::signalConditions(...)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 604 | SKIPPED: 0 | WARNINGS: 1 | FAILED: 1 ]
  1. Error: resample work (@test_train_predict.R#16) 
  
  Error: testthat unit tests failed
  In addition: Warning message:
  call dbDisconnect() when finished working with a connection 
  Execution halted
* checking PDF version of manual ... OK
* DONE

Status: 1 ERROR
See
  ‘/tmp/mlr3db.Rcheck/00check.log’
for details.

Bug in Duckdb Backend `$missings()`

library(mlr3db)
#> Loading required package: mlr3
library(mlr3)

b = as_duckdb_backend(iris)
b$missings(1:150, "Species")
#> Species 
#>       0
b$missings(1:140, "Species")
#> Species 
#>      10

^{Created on 2024-06-21 with reprex v2.0.2}

Package broken on CRAN

Wait for duckdb/duckdb#1509.

Do not create a primary key

A primary key is not helping w.r.t. performance; instead, insert the data sorted.

https://duckdb.org/docs/guides/performance/overview

Test with parquet as duckdb backend

General type conversion for duckdb

It would be great to have a more general solution to for example also convert duckdb booleans to factors, in case they are to be used as the target variable. Currently this is not really possible.

A solution would be to replace the strings_as_factors construction argument of DataBackendDuckDB with as_factors which also allows to convert bools (and maybe even other data types) to factors.

Feature: Ability to rename columns

path is unused for `as_duckdb_backend.character`

Creating backends is somewhat slow

The step creating the backend and task takes about ~ 30 seconds.
Since we actually do not want to get all data during these steps but instead just check for validity, should it take this long?
This seems rather slow and could perhaps be drastically sped up via some caching?

library("nycflights13")
library("mlr3")
library("dplyr")

library("RSQLite")
library("DBI")
path = tempfile("nycflights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
DBI::dbWriteTable(con, "flights", as.data.frame(flights))
DBI::dbWriteTable(con, "weather", as.data.frame(weather))
DBI::dbWriteTable(con, "airports", as.data.frame(airports))
DBI::dbDisconnect(con)

Data is a join of three tables: 
con = DBI::dbConnect(RSQLite::SQLite(), path)
xflights = tbl(con, "flights") %>%
  filter(!is.na(arr_delay)) %>%
  mutate(row_id = row_number()) %>%
  filter(row_id %in% 1:100)

data = xflights %>%
  left_join(
    tbl(con, "airports") %>% select(faa, lat, lon, alt),
    by = c("origin" = "faa") # for the origin
  ) %>%
  left_join(
    tbl(con, "airports") %>% select(faa, lat, lon, alt),
    by = c("dest" = "faa"), # for the destination
    suffix = c("_origin", "_dest")
  )

weather = xflights %>%
  left_join(
    tbl(con, "weather") %>%
      select(origin, year, month, day, hour, temp, wind_speed, visib),
    by = c("origin", "year", "month", "day"),
    suffix = c("", "_weather")
   ) %>%
 filter(hour_weather * 100 >= dep_time && hour_weather * 100 <= arr_time) %>%
 group_by(flight, year, month, day) %>%
 summarize(across(c(temp, visib, wind_speed), list(mean, max), na.rm=TRUE))


data = data %>%
  left_join(weather,
    by = c("flight", "year", "month", "day"))

library(mlr3db)
b = as_data_backend(data, primary_key = "row_id")
t = TaskRegr$new("flights", b, "arr_delay")
t

profvis points to __distinct__ and __head__ being called twice and taking quite some time.

show_query method

For convenience (and interop with https://tensorflow.rstudio.com/guide/tfdatasets/introduction/) it would be cool to have a show_query method for SQLite data backends.

db = as_sqlite_backend(tsk("iris"))
show_query(db$.__enclos_env__$private$.data)

Should be only 2-3 lines of code

`as_data_backend.tbl_df` should perhaps set it's connector

Using as_data_backend on a tbl_df dplyr backend (with a sqlite database) currently yields a backend where
backend$connector is NULL. SInce the table we create it from contains it's connector, this slot should perhaps be auto-filled?
data$src$con