Comments (8)
Something likeβ¦
ctas_profile <- function(x, new_table_name = "CHANGE____ME") {
stopifnot(inherits(x, "tbl_drill"))
vals <- collect(head(x))
vals <- suppressMessages(type_convert(vals))
data_type <- function(x) {
switch(
class(x)[1],
integer64 = "BIGINT",
logical = "BOOLEAN",
integer = "INTEGER",
numeric = "DOUBLE",
factor = "VARCHAR",
character = "VARCHAR",
Date = "DATE",
POSIXct = "TIMESTAMP"
)
}
field_types <- vapply(vals, data_type, character(1))
sprintf(
" CAST(`%s` AS %s) AS `%s`",
names(field_types),
field_types,
names(field_types)
) -> casts
casts <- unlist(strsplit(paste0(casts, collapse=",\n"), "\n"))
case_when(
grepl("AS TIMESTAMP", casts) ~ sprintf("%s -- ** Change this TO_TIMESTAMP() with a proper format string ** (will eventually auto-convert)", casts),
grepl("AS DATE", casts) ~ sprintf("%s -- ** Change this TO_DATE() with a proper format string ** (will eventually auto-convert)", casts),
TRUE ~ casts
) -> casts
orig_query <- x$ops$x
if (!grepl("select", orig_query, ignore.case=TRUE)) {
orig_query <- sprintf("SELECT * FROM %s", orig_query)
}
sprintf(
"CREATE TABLE %s AS\nSELECT\n%s\nFROM (%s)\n",
new_table_name,
paste0(casts, collapse="\n"),
orig_query
)
}
A cpl light tests.
Bare data source
flt1 <- tbl(db, "dfs.d.`/flights.csvh`")
cat(
ctas_profile(flt1)
)
CREATE TABLE CHANGE____ME AS
SELECT
CAST(`year` AS DOUBLE) AS `year`,
CAST(`month` AS DOUBLE) AS `month`,
CAST(`day` AS DOUBLE) AS `day`,
CAST(`dep_time` AS DOUBLE) AS `dep_time`,
CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
CAST(`arr_time` AS DOUBLE) AS `arr_time`,
CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
CAST(`carrier` AS CHARACTER) AS `carrier`,
CAST(`flight` AS DOUBLE) AS `flight`,
CAST(`tailnum` AS VARCHAR) AS `tailnum`,
CAST(`origin` AS VARCHAR) AS `origin`,
CAST(`dest` AS VARCHAR) AS `dest`,
CAST(`air_time` AS DOUBLE) AS `air_time`,
CAST(`distance` AS DOUBLE) AS `distance`,
CAST(`hour` AS DOUBLE) AS `hour`,
CAST(`minute` AS DOUBLE) AS `minute`,
CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT * FROM dfs.d.`/flights.csvh`)
Carved out fields with SELECT
flt2 <- tbl(db, sql("SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`"))
cat(
ctas_profile(flt2, "dfs.d.`flights.parquet`")
)
CREATE TABLE dfs.d.`flights.parquet` AS
SELECT
CAST(`year` AS DOUBLE) AS `year`,
CAST(`tailnum` AS VARCHAR) AS `tailnum`,
CAST(`time_hour` AS TIMESTAMP) AS `time_hour` -- ** Change this TO_TIMESTAMP() with a proper format string (will eventually auto-convert)
FROM (SELECT `year`, tailnum, time_hour FROM dfs.d.`/flights.csvh`)
I'll be augmenting it to handle figuring out the timestamp and date formats and replacing the CAST
with the appropriate converter and get it into 0.8.0 (just running short on time right now).
from sergeant.
Sounds like a plan. I don't suppose you know of any dbWriteTable()
-like magic that would make writing out a skeleton fully qualified CTAS query easier? I have some relatively wide tables that the current type conversion does a good job against. I'd ideally like a flow for:
- pulling in the first couple of records of a file w/best guess on types
- convert the dataframe to a fully qualified create table query
- inspect the query manually
- concat the create table with a select for a drill CTAS file conversion.
After a few columns of manual query creation my typo probability increases.
If the rebase item you mention in #29 (comment) isn't something you need tracking, please do feel free to close this issue.
Appreciate the package!
from sergeant.
Hey David!
Can you share a sanitized version of the β¦
in a tbl(src, β¦)
query?
By default, Drill will treat all columns in a CSV as VARCHAR
; e.g: this:
write_csv(mtcars, "~/Data/mtcars.csv")
httr::POST(
"http://localhost:8047/query.json",
encode = "json",
body = list(
queryType = "SQL",
query = "SELECT * FROM dfs.d.`/mtcars.csv` LIMIT 1"
)
) -> res
cat(
httr::content(res, as = "text")
)
causes this to be passed on down the wire:
{
"queryId" : "23ccb7be-d896-51d5-c1d3-7e93a32a769f",
"columns" : [ "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb" ],
"rows" : [ {
"carb" : "4",
"mpg" : "21",
"qsec" : "16.46",
"cyl" : "6",
"hp" : "110",
"am" : "1",
"disp" : "160",
"wt" : "2.62",
"vs" : "0",
"drat" : "3.9",
"gear" : "4"
} ],
"metadata" : [ "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR", "VARCHAR" ]
}
Would an type_convert = TRUE/FALSE
option to tbl()
be helpful? i.e. force a bypass of the reading the metadata
that comes along for the ride? I can add that pretty quick to the 0.8.0 branch (I should likely put back the main branch to pre-0.8.0 dev too since I made the "will dev on branches" decision as part of a new year's resolution :-)
from sergeant.
AUGHH!!! they've deprecated the β¦
in tbl_sql()
(meaning if I make the param change to tbl
it will just break at some point). Would making the param to src_drill()
be OK? That would impact all query results for that db but it may be a workaround if having R do the heavy lifting (which is really just a call to readr::type_convert()
sans any type hints post-query result retrieval) is desired.
from sergeant.
Hrm. VARCHAR
are right in line with https://drill.apache.org/docs/text-files-csv-tsv-psv/, which cautions against using CTAS queries in this fashion.
There does seem to be a difference in how v.0.5.0 and master
operate. The following code under the CRAN published v0.5.0 gives a typed-version of this data set, while the same code under master
gives me character()
types for all of the columns.
library(tidyverse)
library(sergeant)
data(flights, package = "nycflights13")
flights %>% write_csv(here::here("data/flights.csvh"))
ds <- sergeant::src_drill()
dat <- tbl(ds, "dfs.root.`/data/flights.csvh`")
glimpse(dat)
I may have been relying inadvertently on sergeant
magic... Would you recommend explicit conversions using drill_query()
to go from CSV -> Parquet, in line with the drill guide? Not as slick as I was hoping, but perhaps more stable.
from sergeant.
Aye. I inadvertently pushed the new code to master
and just lazily didn't rebase.
You could also do:
dat <- tbl(ds, sql("
SELECT
CAST(`year` AS INT) `year`,
CAST(`month` as INT) `month`,
CAST(`day` AS INT) `day`,
CAST(dep_time AS INT) dep_time,
CAST(sched_dep_time AS INT) sched_dep_time,
CAST(dep_delay AS DOUBLE) dep_delay,
CAST(arr_time AS INT) arr_time,
CAST(sched_arr_time AS INT) sched_arr_time,
CAST(arr_delay AS DOUBLE) arr_delay,
carrier,
CAST(flight AS VARCHAR) flight,
tailnum,
origin
FROM dfs.data.`flights.csvh`
"))
glimpse(dat)
## Observations: ??
## Variables: 13
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, ...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, ...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851,...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858,...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, ...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", ...
## $ flight <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, ...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", ...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK",...
from sergeant.
The above is likely the best way to do it (apart from running readr::type_convert()
ex-post-query-facto) since it's deterministic and actually pulling what is expected (provided folks are running latest Drill). Old Drill shld still behave the old way.
from sergeant.
I put this WIP ctas_profile()
into the 0.8.0 branch as well.
from sergeant.
Related Issues (20)
- Implement write table methods HOT 1
- BREAKING STUFF: 0.7.0 re-introducts JDBC dependency HOT 4
- Problem with installing sergeant HOT 2
- drill_version(ds) failed HOT 6
- Provide an automated downloader and way to start drill-embedded? HOT 1
- REST API: Add support for new query metadata in 1.15.0 release
- Add example of accessing S3 files HOT 3
- Switch to rapidjsonr HOT 2
- better error messages on query failures HOT 2
- Support field type metadata prior to rapidjsonr switch HOT 3
- dplyr interface does not work outside of `cp` schema HOT 1
- Support dbQuoteIdentifier method for `src_drill` class HOT 2
- Should dplyr idioms work if `tbl()` source is a custom SQL statement? HOT 4
- Improve drill_up() interactive experience HOT 3
- drill_up fails (maybe due to Docker Toolkit?) HOT 2
- Out of memory Issue reading single column from parquet file
- "bad pad value" when trying to install from GitHub HOT 2
- Prepare for new streaming JSON
- table identifier escaping causing dplyr interface errors HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sergeant.