Under sergeant master (20190106), csvh files being read in over DBI (<code class="notr

Something like… <div class="highlight highlight-source-r notranslate position-rela

Hey David! Can you share a sanitized version of the <code class="not

AUGHH!!! they've deprecated the … in <code class="not

Hrm. <g-emoji class="g-emoji" alias="thinking" fallback-src="https://github.githubasse

The above is likely the best way to do it (apart from running <code class="notranslate

Column data types not being parsed under v0.8.0-dev about sergeant HOT 8 CLOSED

hrbrmstr commented on June 13, 2024

Column data types not being parsed under v0.8.0-dev

from sergeant.

Comments (8)

hrbrmstr commented on June 13, 2024 2

Something like…

ctas_profile <- function(x, new_table_name = "CHANGE____ME") {
  
  stopifnot(inherits(x, "tbl_drill"))
  
  vals <- collect(head(x))
  
  vals <- suppressMessages(type_convert(vals))
  
  data_type <- function(x) {
    switch(
      class(x)[1],
      integer64 = "BIGINT",
      logical = "BOOLEAN",
      integer = "INTEGER",
      numeric = "DOUBLE",
      factor =  "VARCHAR",
      character = "VARCHAR",
      Date = "DATE",
      POSIXct = "TIMESTAMP"
    )
  }
  
  field_types <- vapply(vals, data_type, character(1))
  
  sprintf(
    "  CAST(`%s` AS %s) AS `%s`", 
    names(field_types), 
    field_types, 
    names(field_types)
  ) -> casts
  
  casts <- unlist(strsplit(paste0(casts, collapse=",\n"), "\n"))
  
  case_when(
    grepl("AS TIMESTAMP", casts) ~ sprintf("%s -- ** Change this TO_TIMESTAMP() with a proper format string ** (will eventually auto-convert)", casts),
    grepl("AS DATE", casts) ~ sprintf("%s -- ** Change this TO_DATE() with a proper format string ** (will eventually auto-convert)", casts),
    TRUE ~ casts
  ) -> casts
  
  orig_query <- x$ops$x
  
  if (!grepl("select", orig_query, ignore.case=TRUE)) {
    orig_query <- sprintf("SELECT * FROM %s", orig_query)
  }
  
  sprintf(
    "CREATE TABLE %s AS\nSELECT\n%s\nFROM (%s)\n", 
    new_table_name,
    paste0(casts, collapse="\n"),
    orig_query
  )  
  
}

A cpl light tests.

Bare data source

flt1 <- tbl(db, "dfs.d.`/flights.csvh`")

cat(
  ctas_profile(flt1)
)

CREATE TABLE CHANGE____ME AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS CHARACTER) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT * FROM dfs.d.`/flights.csvh`)

Carved out fields with `SELECT`

flt2 <- tbl(db, sql("SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`"))

cat(
  ctas_profile(flt2, "dfs.d.`flights.parquet`")
)

CREATE TABLE dfs.d.`flights.parquet` AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`)

I'll be augmenting it to handle figuring out the timestamp and date formats and replacing the CAST with the appropriate converter and get it into 0.8.0 (just running short on time right now).

from sergeant.

davidski commented on June 13, 2024 1

Sounds like a plan. I don't suppose you know of any dbWriteTable()-like magic that would make writing out a skeleton fully qualified CTAS query easier? I have some relatively wide tables that the current type conversion does a good job against. I'd ideally like a flow for:

pulling in the first couple of records of a file w/best guess on types
convert the dataframe to a fully qualified create table query
inspect the query manually
concat the create table with a select for a drill CTAS file conversion.

After a few columns of manual query creation my typo probability increases. 😬

If the rebase item you mention in #29 (comment) isn't something you need tracking, please do feel free to close this issue.

Appreciate the package!

from sergeant.

hrbrmstr commented on June 13, 2024

Hey David!

Can you share a sanitized version of the … in a tbl(src, …) query?

By default, Drill will treat all columns in a CSV as VARCHAR; e.g: this:

write_csv(mtcars, "~/Data/mtcars.csv")

httr::POST(
  "http://localhost:8047/query.json",
  encode = "json",
  body = list(
    queryType = "SQL", 
    query = "SELECT * FROM dfs.d.`/mtcars.csv` LIMIT 1"
  ) 
) -> res

cat(
  httr::content(res, as = "text")
)

causes this to be passed on down the wire:

{
  "queryId" : "23ccb7be-d896-51d5-c1d3-7e93a32a769f",
  "columns" : [ "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb" ],
  "rows" : [ {
    "carb" : "4",
    "mpg" : "21",
    "qsec" : "16.46",
    "cyl" : "6",
    "hp" : "110",
    "am" : "1",
    "disp" : "160",
    "wt" : "2.62",
    "vs" : "0",
    "drat" : "3.9",
    "gear" : "4"
  } ],
  "metadata" : [ "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR" ]
}

Would an type_convert = TRUE/FALSE option to tbl() be helpful? i.e. force a bypass of the reading the metadata that comes along for the ride? I can add that pretty quick to the 0.8.0 branch (I should likely put back the main branch to pre-0.8.0 dev too since I made the "will dev on branches" decision as part of a new year's resolution :-)

from sergeant.

hrbrmstr commented on June 13, 2024

AUGHH!!! they've deprecated the … in tbl_sql() (meaning if I make the param change to tbl it will just break at some point). Would making the param to src_drill() be OK? That would impact all query results for that db but it may be a workaround if having R do the heavy lifting (which is really just a call to readr::type_convert() sans any type hints post-query result retrieval) is desired.

from sergeant.

davidski commented on June 13, 2024

Hrm. 🤔 My inexperience with drill may be showing here. I've been using CTAS queries to convert CSVs (CSVH, technically) to Parquet. Your comments about drill always returning VARCHAR are right in line with https://drill.apache.org/docs/text-files-csv-tsv-psv/, which cautions against using CTAS queries in this fashion.

There does seem to be a difference in how v.0.5.0 and master operate. The following code under the CRAN published v0.5.0 gives a typed-version of this data set, while the same code under master gives me character() types for all of the columns.

library(tidyverse)
library(sergeant)
data(flights, package = "nycflights13")
flights %>% write_csv(here::here("data/flights.csvh"))
ds <- sergeant::src_drill()
dat <- tbl(ds, "dfs.root.`/data/flights.csvh`")
glimpse(dat)

I may have been relying inadvertently on sergeant magic... Would you recommend explicit conversions using drill_query() to go from CSV -> Parquet, in line with the drill guide? Not as slick as I was hoping, but perhaps more stable. 😁

from sergeant.

hrbrmstr commented on June 13, 2024

Aye. I inadvertently pushed the new code to master and just lazily didn't rebase.

You could also do:

dat <- tbl(ds, sql("
SELECT 
  CAST(`year` AS INT) `year`,
  CAST(`month` as INT) `month`,
  CAST(`day` AS INT) `day`,
  CAST(dep_time AS INT) dep_time,
  CAST(sched_dep_time AS INT) sched_dep_time,
  CAST(dep_delay AS DOUBLE) dep_delay,
  CAST(arr_time AS INT) arr_time,
  CAST(sched_arr_time AS INT) sched_arr_time,
  CAST(arr_delay AS DOUBLE) arr_delay,
  carrier,
  CAST(flight AS VARCHAR) flight,
  tailnum,
  origin
FROM dfs.data.`flights.csvh`
"))

glimpse(dat)
## Observations: ??
## Variables: 13
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, ...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, ...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851,...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858,...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, ...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", ...
## $ flight         <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, ...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", ...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK",...

from sergeant.

hrbrmstr commented on June 13, 2024

The above is likely the best way to do it (apart from running readr::type_convert() ex-post-query-facto) since it's deterministic and actually pulling what is expected (provided folks are running latest Drill). Old Drill shld still behave the old way.

from sergeant.

hrbrmstr commented on June 13, 2024

I put this WIP ctas_profile() into the 0.8.0 branch as well.

from sergeant.

Column data types not being parsed under v0.8.0-dev about sergeant HOT 8 CLOSED

Comments (8)

Bare data source

Carved out fields with `SELECT`

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (8)

Bare data source

Carved out fields with SELECT

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Carved out fields with `SELECT`