Giter Club home page Giter Club logo

Comments (8)

hrbrmstr avatar hrbrmstr commented on June 13, 2024 2

Something like…

ctas_profile <- function(x, new_table_name = "CHANGE____ME") {
  
  stopifnot(inherits(x, "tbl_drill"))
  
  vals <- collect(head(x))
  
  vals <- suppressMessages(type_convert(vals))
  
  data_type <- function(x) {
    switch(
      class(x)[1],
      integer64 = "BIGINT",
      logical = "BOOLEAN",
      integer = "INTEGER",
      numeric = "DOUBLE",
      factor =  "VARCHAR",
      character = "VARCHAR",
      Date = "DATE",
      POSIXct = "TIMESTAMP"
    )
  }
  
  field_types <- vapply(vals, data_type, character(1))
  
  sprintf(
    "  CAST(`%s` AS %s) AS `%s`", 
    names(field_types), 
    field_types, 
    names(field_types)
  ) -> casts
  
  casts <- unlist(strsplit(paste0(casts, collapse=",\n"), "\n"))
  
  case_when(
    grepl("AS TIMESTAMP", casts) ~ sprintf("%s -- ** Change this TO_TIMESTAMP() with a proper format string ** (will eventually auto-convert)", casts),
    grepl("AS DATE", casts) ~ sprintf("%s -- ** Change this TO_DATE() with a proper format string ** (will eventually auto-convert)", casts),
    TRUE ~ casts
  ) -> casts
  
  orig_query <- x$ops$x
  
  if (!grepl("select", orig_query, ignore.case=TRUE)) {
    orig_query <- sprintf("SELECT * FROM %s", orig_query)
  }
  
  sprintf(
    "CREATE TABLE %s AS\nSELECT\n%s\nFROM (%s)\n", 
    new_table_name,
    paste0(casts, collapse="\n"),
    orig_query
  )  
  
}

A cpl light tests.

Bare data source

flt1 <- tbl(db, "dfs.d.`/flights.csvh`")

cat(
  ctas_profile(flt1)
)
CREATE TABLE CHANGE____ME AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS CHARACTER) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT * FROM dfs.d.`/flights.csvh`)

Carved out fields with SELECT

flt2 <- tbl(db, sql("SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`"))

cat(
  ctas_profile(flt2, "dfs.d.`flights.parquet`")
)
CREATE TABLE dfs.d.`flights.parquet` AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`)

I'll be augmenting it to handle figuring out the timestamp and date formats and replacing the CAST with the appropriate converter and get it into 0.8.0 (just running short on time right now).

from sergeant.

davidski avatar davidski commented on June 13, 2024 1

Sounds like a plan. I don't suppose you know of any dbWriteTable()-like magic that would make writing out a skeleton fully qualified CTAS query easier? I have some relatively wide tables that the current type conversion does a good job against. I'd ideally like a flow for:

  • pulling in the first couple of records of a file w/best guess on types
  • convert the dataframe to a fully qualified create table query
  • inspect the query manually
  • concat the create table with a select for a drill CTAS file conversion.

After a few columns of manual query creation my typo probability increases. 😬

If the rebase item you mention in #29 (comment) isn't something you need tracking, please do feel free to close this issue.

Appreciate the package!

from sergeant.

hrbrmstr avatar hrbrmstr commented on June 13, 2024

Hey David!

Can you share a sanitized version of the … in a tbl(src, …) query?

By default, Drill will treat all columns in a CSV as VARCHAR; e.g: this:

write_csv(mtcars, "~/Data/mtcars.csv")

httr::POST(
  "http://localhost:8047/query.json",
  encode = "json",
  body = list(
    queryType = "SQL", 
    query = "SELECT * FROM dfs.d.`/mtcars.csv` LIMIT 1"
  ) 
) -> res

cat(
  httr::content(res, as = "text")
)

causes this to be passed on down the wire:

{
  "queryId" : "23ccb7be-d896-51d5-c1d3-7e93a32a769f",
  "columns" : [ "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb" ],
  "rows" : [ {
    "carb" : "4",
    "mpg" : "21",
    "qsec" : "16.46",
    "cyl" : "6",
    "hp" : "110",
    "am" : "1",
    "disp" : "160",
    "wt" : "2.62",
    "vs" : "0",
    "drat" : "3.9",
    "gear" : "4"
  } ],
  "metadata" : [ "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR" ]
}

Would an type_convert = TRUE/FALSE option to tbl() be helpful? i.e. force a bypass of the reading the metadata that comes along for the ride? I can add that pretty quick to the 0.8.0 branch (I should likely put back the main branch to pre-0.8.0 dev too since I made the "will dev on branches" decision as part of a new year's resolution :-)

from sergeant.

hrbrmstr avatar hrbrmstr commented on June 13, 2024

AUGHH!!! they've deprecated the … in tbl_sql() (meaning if I make the param change to tbl it will just break at some point). Would making the param to src_drill() be OK? That would impact all query results for that db but it may be a workaround if having R do the heavy lifting (which is really just a call to readr::type_convert() sans any type hints post-query result retrieval) is desired.

from sergeant.

davidski avatar davidski commented on June 13, 2024

Hrm. πŸ€” My inexperience with drill may be showing here. I've been using CTAS queries to convert CSVs (CSVH, technically) to Parquet. Your comments about drill always returning VARCHAR are right in line with https://drill.apache.org/docs/text-files-csv-tsv-psv/, which cautions against using CTAS queries in this fashion.

There does seem to be a difference in how v.0.5.0 and master operate. The following code under the CRAN published v0.5.0 gives a typed-version of this data set, while the same code under master gives me character() types for all of the columns.

library(tidyverse)
library(sergeant)
data(flights, package = "nycflights13")
flights %>% write_csv(here::here("data/flights.csvh"))
ds <- sergeant::src_drill()
dat <- tbl(ds, "dfs.root.`/data/flights.csvh`")
glimpse(dat)

I may have been relying inadvertently on sergeant magic... Would you recommend explicit conversions using drill_query() to go from CSV -> Parquet, in line with the drill guide? Not as slick as I was hoping, but perhaps more stable. 😁

from sergeant.

hrbrmstr avatar hrbrmstr commented on June 13, 2024

Aye. I inadvertently pushed the new code to master and just lazily didn't rebase.

You could also do:

dat <- tbl(ds, sql("
SELECT 
  CAST(`year` AS INT) `year`,
  CAST(`month` as INT) `month`,
  CAST(`day` AS INT) `day`,
  CAST(dep_time AS INT) dep_time,
  CAST(sched_dep_time AS INT) sched_dep_time,
  CAST(dep_delay AS DOUBLE) dep_delay,
  CAST(arr_time AS INT) arr_time,
  CAST(sched_arr_time AS INT) sched_arr_time,
  CAST(arr_delay AS DOUBLE) arr_delay,
  carrier,
  CAST(flight AS VARCHAR) flight,
  tailnum,
  origin
FROM dfs.data.`flights.csvh`
"))

glimpse(dat)
## Observations: ??
## Variables: 13
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, ...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, ...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851,...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858,...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, ...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", ...
## $ flight         <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, ...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", ...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK",...

from sergeant.

hrbrmstr avatar hrbrmstr commented on June 13, 2024

The above is likely the best way to do it (apart from running readr::type_convert() ex-post-query-facto) since it's deterministic and actually pulling what is expected (provided folks are running latest Drill). Old Drill shld still behave the old way.

from sergeant.

hrbrmstr avatar hrbrmstr commented on June 13, 2024

I put this WIP ctas_profile() into the 0.8.0 branch as well.

from sergeant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.