Giter Club home page Giter Club logo

mlr3db's Introduction

mlr3db

r-cmd-check CRAN Status StackOverflow Mattermost

Package website: release | dev

Extends the mlr3 package with a DataBackend to transparently work with databases. Two additional backends are currently implemented:

  • DataBackendDplyr: Relies internally on the abstraction of dplyr and dbplyr. This allows working on a broad range of DBMS, such as SQLite, MySQL, MariaDB, or PostgreSQL.
  • DataBackendDuckDB: Connector to duckdb. This includes support for Parquet files (see example below).

To construct the backends, you have to establish a connection to the DBMS yourself with the DBI package. For the serverless SQLite and DuckDB, we provide the converters as_sqlite_backend() and as_duckdb_backend().

Installation

You can install the released version of mlr3db from CRAN with:

install.packages("mlr3db")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("mlr-org/mlr3db")

Example

DataBackendDplyr

library("mlr3db")
#> Loading required package: mlr3

# Create a classification task:
task = tsk("spam")

# Convert the task backend from a in-memory backend (DataBackendDataTable)
# to an out-of-memory SQLite backend via DataBackendDplyr.
# A temporary directory is used here to store the database files.
task$backend = as_sqlite_backend(task$backend, path = tempfile())

# Resample a classification tree using a 3-fold CV.
# The requested data will be queried and fetched from the database in the background.
resample(task, lrn("classif.rpart"), rsmp("cv", folds = 3))
#> <ResampleResult> of 3 iterations
#> * Task: spam
#> * Learner: classif.rpart
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations

DataBackendDuckDB

library("mlr3db")

# Get an example parquet file from the package install directory:
# spam dataset (tsk("spam")) stored as parquet file
file = system.file(file.path("extdata", "spam.parquet"), package = "mlr3db")

# Create a backend on the file
backend = as_duckdb_backend(file)

# Construct classification task on the constructed backend
task = as_task_classif(backend, target = "type")

# Resample a classification tree using a 3-fold CV.
# The requested data will be queried and fetched from the database in the background.
resample(task, lrn("classif.rpart"), rsmp("cv", folds = 3))
#> <ResampleResult> of 3 iterations
#> * Task: backend
#> * Learner: classif.rpart
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations

mlr3db's People

Contributors

github-actions[bot] avatar mgirlich avatar mllg avatar pat-s avatar sebffischer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

hadley

mlr3db's Issues

`as_data_backend.tbl_df` should perhaps set it's connector

Using as_data_backend on a tbl_df dplyr backend (with a sqlite database) currently yields a backend where
backend$connector is NULL. SInce the table we create it from contains it's connector, this slot should perhaps be auto-filled?
data$src$con

mlr3db and duckdb 0.9.0

We're in the process of releasing duckdb 0.9.0. Running your checks with the updated duckdb version revealed problems: https://github.com/duckdb/duckdb-r/blob/main/revdep/problems.md#mlr3db . We would like to send the update to CRAN as soon as possible.

duckdb 0.9.0 might reorder rows more aggressively than previous versions did. If your queries rely on a particular row output order, please make sure that you specify it.

Can you please confirm? Installation instructions on https://github.com/duckdb/duckdb-r/tree/main#installation-from-r-universe are up to date. Thanks!

CC @hannes.

Creating backends is somewhat slow

The step creating the backend and task takes about ~ 30 seconds.
Since we actually do not want to get all data during these steps but instead just check for validity, should it take this long?
This seems rather slow and could perhaps be drastically sped up via some caching?

library("nycflights13")
library("mlr3")
library("dplyr")

library("RSQLite")
library("DBI")
path = tempfile("nycflights", fileext = ".sqlite")
con = DBI::dbConnect(RSQLite::SQLite(), path)
DBI::dbWriteTable(con, "flights", as.data.frame(flights))
DBI::dbWriteTable(con, "weather", as.data.frame(weather))
DBI::dbWriteTable(con, "airports", as.data.frame(airports))
DBI::dbDisconnect(con)

Data is a join of three tables: 
con = DBI::dbConnect(RSQLite::SQLite(), path)
xflights = tbl(con, "flights") %>%
  filter(!is.na(arr_delay)) %>%
  mutate(row_id = row_number()) %>%
  filter(row_id %in% 1:100)

data = xflights %>%
  left_join(
    tbl(con, "airports") %>% select(faa, lat, lon, alt),
    by = c("origin" = "faa") # for the origin
  ) %>%
  left_join(
    tbl(con, "airports") %>% select(faa, lat, lon, alt),
    by = c("dest" = "faa"), # for the destination
    suffix = c("_origin", "_dest")
  )

weather = xflights %>%
  left_join(
    tbl(con, "weather") %>%
      select(origin, year, month, day, hour, temp, wind_speed, visib),
    by = c("origin", "year", "month", "day"),
    suffix = c("", "_weather")
   ) %>%
 filter(hour_weather * 100 >= dep_time && hour_weather * 100 <= arr_time) %>%
 group_by(flight, year, month, day) %>%
 summarize(across(c(temp, visib, wind_speed), list(mean, max), na.rm=TRUE))


data = data %>%
  left_join(weather,
    by = c("flight", "year", "month", "day"))

library(mlr3db)
b = as_data_backend(data, primary_key = "row_id")
t = TaskRegr$new("flights", b, "arr_delay")
t

profvis points to __distinct__ and __head__ being called twice and taking quite some time.

R CMD check errors with R_FUTURE_PLAN=multisession

Hi, I'm running revdep checks of future where I force the default future plan to be 'multisession'. Among other things, this will help detect when globals are not properly exported or the parallel code attempts to use non-exportable objects in the parallel workers. Doing this on mlr3db reveals some problem related to this;

$ export R_FUTURE_PLAN=multisession
$ R CMD check mlr3db_0.1.5.tar.gz 

* using log directory/tmp/mlr3db.Rcheck* using R version 4.0.2 (2020-06-22)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for filemlr3db/DESCRIPTION... OK
* this is packagemlr3dbversion0.1.5* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether packagemlr3dbcan be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking for unstated dependencies intests... OK
* checking tests ...
  Runningtestthat.RERROR
Running the tests intests/testthat.Rfailed.
Complete output:
  > if (requireNamespace("testthat", quietly = TRUE)) {
  +   library(testthat)
  +   library(mlr3db)
  +   test_check("mlr3db")
  + }
  ── 1. Error: resample work (@test_train_predict.R#16)  ─────────────────────────
  Invalid connection. Provide a connector during construction to automatically reconnect
  Backtrace:
   1. mlr3::resample(task, learner, mlr3::rsmp("cv", folds = 3))
   2. future.apply::future_lapply(...)
   3. future.apply:::future_xapply(...)
   5. future:::value.list(fs)
   7. future:::resolve.list(...)
   8. future:::signalConditionsASAP(obj, resignal = FALSE, pos = ii)
   9. future:::signalConditions(...)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 604 | SKIPPED: 0 | WARNINGS: 1 | FAILED: 1 ]
  1. Error: resample work (@test_train_predict.R#16) 
  
  Error: testthat unit tests failed
  In addition: Warning message:
  call dbDisconnect() when finished working with a connection 
  Execution halted
* checking PDF version of manual ... OK
* DONE

Status: 1 ERROR
See/tmp/mlr3db.Rcheck/00check.logfor details.

General type conversion for duckdb

It would be great to have a more general solution to for example also convert duckdb booleans to factors, in case they are to be used as the target variable. Currently this is not really possible.

A solution would be to replace the strings_as_factors construction argument of DataBackendDuckDB with as_factors which also allows to convert bools (and maybe even other data types) to factors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.