hrbrmstr / sergeant Goto Github PK

View Code? Open in Web Editor NEW

125.0 14.0 14.0 18.18 MB

:guardsman: Tools to Transform and Query Data with 'Apache' 'Drill'

Home Page: https://hrbrmstr.github.io/sergeant/

License: Other

R 100.00%

drill parquet-files sql dplyr r rstats apache-drill r-cyber

sergeant's Issues

better error messages on query failures

current error reporting on failed queries — in a word — sucks. must improve.

drill_version(ds) failed

Environment

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Singapore.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.3.0   stringr_1.3.1   purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2   
 [7] ggplot2_3.0.0   tidyverse_1.2.1 sergeant_0.5.2  dbplyr_1.2.2    dplyr_0.7.6     DBI_1.0.0      
[13] rJava_0.9-10

How to replicate

ds <- src_drill(drill_ip) 
ds
>src:  DrillConnection
>tbls: cp.default, dfs.default, dfs.root, dfs.tmp, INFORMATION_SCHEMA, postgres.information_schema,
postgres.pg_catalog, postgres.postgres, postgres.public, postgres, sys

db <- tbl(ds, "cp.`employee.json`")
db
store_id gender department_id birth_date supervisor_id last_name position_title hire_date          
<int> <chr>          <int> <date>             <int> <chr>     <chr>          <dttm>             
1        0 F                  1 1961-08-26             0 Nowmer    President      1994-12-01 00:00:00
2        0 M                  1 1915-07-03             1 Whelply   VP Country Ma~ 1994-12-01 00:00:00
3        0 M                  1 1969-06-20             1 Spence    VP Country Ma~ 1998-01-01 00:00:00
4        0 F                  1 1951-05-10             1 Gutierrez VP Country Ma~ 1998-01-01 00:00:00
5        0 F                  2 1942-10-08             1 Damstra   VP Informatio~ 1994-12-01 00:00:00
6        0 F                  3 1949-03-27             1 Kanagaki  VP Human Reso~ 1994-12-01 00:00:00
7        9 F                 11 1922-08-10             5 Brunner   Store Manager  1998-01-01 00:00:00
8       21 F                 11 1979-06-23             5 Blumberg  Store Manager  1998-01-01 00:00:00
9        0 M                  5 1949-08-26             1 Stanz     VP Finance     1994-12-01 00:00:00
10        1 M                 11 1967-06-20             5 Murraiin  Store Manager  1998-01-01 00:00:00
# ... with more rows, and 8 more variables: management_role <chr>, salary <dbl>, marital_status <chr>,
#   full_name <chr>, employee_id <int>, education_level <chr>, first_name <chr>, position_id <int>


drill_version(ds)
Error in is.url(url) : length(url) == 1 is not TRUE

Investigation

sergeant:::make_server
function (drill_con) 
{
  sprintf("%s://%s:%s", ifelse(drill_con$ssl[1], "https", "http"), 
          drill_con$host, drill_con$port)
}
<bytecode: 0x0000000022b58b98>
<environment: namespace:sergeant>

str(ds)  
List of 1
$ con:Formal class 'DrillConnection' [package "sergeant"] with 5 slots
.. ..@ host    : chr "172.27.141.128"
.. ..@ port    : int 8047
.. ..@ ssl     : logi FALSE
.. ..@ username: chr(0) 
.. ..@ password: chr(0) 
- attr(*, "class")= chr [1:3] "src_drill" "src_sql" "src"

Addressing the ssl variable has a problem:

ds$ssl[1]
NULL

# Correct way
ds$con@ssl
FALSE

This code need to be fixed in make_server function.

function (drill_con) 
{
  sprintf("%s://%s:%s", ifelse(drill_con$con@ssl, "https", "http"), 
          drill_con$con@host, drill_con$con@port)
}

Problem with installing sergeant

Hi,

I am encountering problems in installing sergeant.

Upon installing

devtools::install_github("hrbrmstr/sergeant")

I am getting the error below:
Error: package or namespace load failed for 'sergeant':
.onLoad failed in loadNamespace() for 'rJava', details:
call: inDL(x, as.logical(local), as.logical(now), ...)
error: unable to load shared object 'C:/Users/herma/Documents/R/win-library/3.5/rJava/libs/i386/rJava.dll':
LoadLibrary failure: %1 is not a valid Win32 application.

Error: loading failed
Execution halted
*** arch - x64
ERROR: loading failed for 'i386'

removing 'C:/Users/herma/Documents/R/win-library/3.5/sergeant'
In R CMD INSTALL

version output:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.0
year 2018
month 04
day 23
svn rev 74626
language R
version.string R version 3.5.0 (2018-04-23)
nickname Joy in Playing

Switch to rapidjsonr

As of Drill 1.15.0 we now know the data types of things returned so we can use this to have proper 64-bit integers as well as proper types all 'round.

dplyr operations fail on columns which name is a reserved word

To translate a query like:

select `Year`,`Month` from dfs.`..../AIRLINE/` LIMIT 5;

I would use something like:

airline %>% select(Year)

while fails with:

warning message:
In .local(res, ...) : PARSE ERROR: Encountered "Year" at line 1, column 9.
Was expecting one of:
...
...

due to missing quotes.

Same for

airline %>% select(`Year`)

drill_jdbc() returns error

Hi Bob,

I tried
drill_jdbc("localhost")

and it returns error:

Error in stop("Package `", package, "` required for `", fun, "`.\n", "Please install and try again.",  : 
  argument "fun" is missing, with no default

Looking at the code in utils.r

try_require <- function(package, fun) {
  if (requireNamespace(package, quietly = TRUE)) {
    library(package, character.only = TRUE)
    return(invisible())
  }

  stop("Package `", package, "` required for `", fun , "`.\n", # nocov start
    "Please install and try again.", call. = FALSE) # nocov end
}

arg fun is required with no default.

When try_require is called it is missing 2nd arg fun.
I suggest set fun = "" in the function definition.

Regards,
Herman

drill_query errors with semicolon completion

Hi! Maybe I just lack experience with SQL APIs, but I find it odd that a query passed to drill_query without a terminating semicolon will work but with one will error out:

library(sergeant)

dc <- drill_connection() 

dc %>% drill_query('SELECT full_name, salary FROM cp.`employee.json` LIMIT 3')
#> # A tibble: 3 x 2
#>         full_name salary
#> *           <chr>  <dbl>
#> 1    Sheri Nowmer  80000
#> 2 Derrick Whelply  40000
#> 3  Michael Spence  40000

dc %>% drill_query('SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;')
#> Query ==> SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;
#> PARSE ERROR: Encountered ";" at line 1, column 57.
#> Was expecting one of:
#>     <EOF> 
#>     "OFFSET" ...
#>     "FETCH" ...
#>     
#> 
#> SQL Query SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;
#>                                                         ^
#> 
#> 
#> [Error Id: a172e1a9-833c-4d89-9b83-de4e250cd1c5 on 192.168.1.3:31010]

whereas the Drill shell only works with a semicolon (without it expects a continuation of the query):

0: jdbc:drill:zk=local> SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;
+--------------------+----------+
|     full_name      |  salary  |
+--------------------+----------+
| Sheri Nowmer       | 80000.0  |
| Derrick Whelply    | 40000.0  |
| Michael Spence     | 40000.0  |
+--------------------+----------+
3 rows selected (0.102 seconds)

and when run through the web interface both are fine.

Looking at query.r, the whole query is passed via httr::POST, so the error seems to be a product of how the API interprets queries. I'm not sure if this is intended behavior (the Drill API docs don't use a semicolon, but the SQL docs do), but from an R perspective, it seems like a good idea to either

make note of the restriction/behavior in ?drill_query or
add query <- sub(";$", "", query) or the like to its code so it works regardless.

Thoughts?

BIGINT column shows up as col_integer()

I am not sure if this is a bug or not yet. But I have a parquet table with columns of datatype BIGINT.
When querying using drill_connection(), they show up as col_integer(). I am not sure what the range of col_integer is, but if it's indeed 4 bytes then there will be loss of data when querying BIGINT columns where value exceeds integers min/max values.

port 8047

Hi, I am testing sergeant against MapR cluster and to my knowledge we should either be using 31010 or zk port and 8047 is drillbits management page in MapR.

It looks like using sergeant it cannot use 31010 but only 8047. Is this correct?

dplyr interface does not work outside of `cp` schema

Hi,

I tried to use the dplyr interface and it seems to fail with "Error in readr::type_convert(out$rows): is.data.frame(df) is not TRUE"

The reprex is below:

#install.packages("sergeant")
library(sergeant)
#> Loading required package: DBI
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: dbplyr
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(tidyverse)

# drill server IP Address
drill_ip <- "xxx.xxx.xxx.xxx"
drill_jdbc_port <- "31010" 

# Connecting to drill server
# src_drill: dplyr connection
# drill_connection: REST API connection
# drill_jdbc: JDBC connection (if remote, needs zk etc)

# Connect using dplyr
ds_dp <- src_drill(drill_ip) 
tbl(ds_dp, "cp.`employee.json`")
#> # Source:   table<cp.`employee.json`> [?? x 16]
#> # Database: DrillConnection
#>    store_id gender department_id birth_date supervisor_id last_name
#>       <dbl> <chr>          <dbl> <date>             <dbl> <chr>    
#>  1        0 F                  1 1961-08-26             0 Nowmer   
#>  2        0 M                  1 1915-07-03             1 Whelply  
#>  3        0 M                  1 1969-06-20             1 Spence   
#>  4        0 F                  1 1951-05-10             1 Gutierrez
#>  5        0 F                  2 1942-10-08             1 Damstra  
#>  6        0 F                  3 1949-03-27             1 Kanagaki 
#>  7        9 F                 11 1922-08-10             5 Brunner  
#>  8       21 F                 11 1979-06-23             5 Blumberg 
#>  9        0 M                  5 1949-08-26             1 Stanz    
#> 10        1 M                 11 1967-06-20             5 Murraiin 
#> # ... with more rows, and 10 more variables: position_title <chr>,
#> #   hire_date <dttm>, management_role <chr>, salary <dbl>,
#> #   marital_status <chr>, full_name <chr>, employee_id <dbl>,
#> #   education_level <chr>, first_name <chr>, position_id <dbl>
tbl(ds_dp, "SELECT * FROM dfs.`c:/apps/apache-drill-1.14.0/sample-data/nation.parquet` LIMIT 5")
#> Error in readr::type_convert(out$rows): is.data.frame(df) is not TRUE

ds_rest <- drill_connection(drill_ip) 
drill_query(ds_rest,"SELECT * FROM dfs.`c:/apps/apache-drill-1.14.0/sample-data/nation.parquet` LIMIT 5")
#> Parsed with column specification:
#> cols(
#>   N_COMMENT = col_character(),
#>   N_NAME = col_character(),
#>   N_NATIONKEY = col_double(),
#>   N_REGIONKEY = col_double()
#> )
#> # A tibble: 5 x 4
#>   N_COMMENT            N_NAME    N_NATIONKEY N_REGIONKEY
#> * <chr>                <chr>           <dbl>       <dbl>
#> 1 haggle. carefully f  ALGERIA             0           0
#> 2 al foxes promise sly ARGENTINA           1           1
#> 3 y alongside of the p BRAZIL              2           1
#> 4 eas hang ironic, sil CANADA              3           1
#> 5 y above the carefull EGYPT               4           4

^{Created on 2019-01-09 by the reprex package (v0.2.1)}

Support for authentication in drill_connection

Hi! I read in your README that sergeant tools were not designed with authentication/encryption in mind due to the lack of need for support. We have been using sergeant at my work to run drill queries against our hdfs for a while, and are currently in the process of migrating to a new drill endpoint that requires authentication. We would love to continue using the sergeant library to run these queries, so if you have the ability to implement authentication we would greatly appreciate it!

Does sergeant really need Java?

I'm running Apache Drill in a docker container linked to a rocker/tidyverse container, which doesn't include Java.

Is there any reason why you use RJDBC rather than DBI?

Add example of accessing S3 files

https://issues.apache.org/jira/browse/DRILL-6662 makes it possible to use non-hardcoded creds so it finally makes sense to add some examples of how to query S3 data.

How to handle the dplyr interface?

hack up a DBI interface to the REST API like Presto does vs lazy-wrap the JDBC interface?
add a schema parameter to src_drill() so folks can just use table identifiers in tbl()?

SQL Statement error

Hi,
In going through your code in
https://rud.is/rpubs/yelp.html#

I am unable to run this code:

tbl(db, "(SELECT b.name AS name, b.hours[5] AS sat_hrs FROM dfs.d.`/yelp/yelp_academic_dataset_business.json` b)") %>%
  filter(!is.na(sat_hrs)) %>%
  filter(grepl("Saturday", sat_hrs)) %>%
  select(name, sat_hrs) %>%
  collect() %>%
  mutate(sat_hrs = gsub("Saturday ", "", sat_hrs)) %>%
  tidyr::separate(sat_hrs, c("open", "close"), "-") %>%
  filter(open != "0:0" & close != "0:0")

tbl(db, "(SELECT b.name AS name, b.hours[5] AS sat_hrs FROM dfs.d.`/yelp/yelp_academic_dataset_business.json` b)")
returns an error
Error: is.data.frame(df) is not TRUE

tbl does not work with a SQL statement.

Herman

drill_up fails (maybe due to Docker Toolkit?)

I am eager to try out what described in the blog post about 0.8.0 version, but after having installed as instructed

devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0")

and then running the code below, I got a failure:

> library(sergeant)
> library(tidyverse)
── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
> # start Docker in terminal, then open 
> dr <- drill_up(data_dir = here::here("data-raw"))
Drill container started. Waiting for the service to become active (this may take up to 30s).
Error in drill_up(data_dir = here::here("data-raw")) : 
  Could not connect to Drill container.
>

I have been able to have drill running manually and mapping the local file system to my directory of interest.
On the Docker side I have an old MBP, so I have Docker Toolkit.
(That maybe is the reason for the failure.)
If I run drill image from the command line I get the following:

> docker run drill/apache-drill                                                                                         (master|✚2✭)
Jan 27, 2019 2:25:49 PM org.jline.utils.Log logr
WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
Apache Drill 1.15.0
"The only truly happy people are children, the creative minority and Drill users."
0: jdbc:drill:zk=local> Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl

I tried to hello_world Docker example using stevedore package and it works...so somewhat Docker (at least the easy bit) seems to work.

I then decided to dig into drill_up implementation and executed line by line to debug.

> docker <- stevedore::docker_client()
> (drill <- docker$container$run(image = "drill/apache-drill:1.15.0", name = "drill", cmd = "/usr/local/bin/bash", volumes = sprintf("%s:/data", here::here("data-raw"))))
E> Jan 27, 2019 3:01:03 PM org.jline.utils.Log logr
E> WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
E> Apache Drill 1.15.0
E> "Say hello to my little Drill."
E> Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
O> 0: jdbc:drill:zk=local> <docker_run_output>
  $container:
    <docker_container>
      id: 2db3d96f27f55d35017dd0361531b37e25c80fbbe9e01cf731654c15adb1a773
      name: drill

  $logs:
    Jan 27, 2019 3:01:03 PM org.jline.utils.Log logr
    WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
    Apache Drill 1.15.0
    "Say hello to my little Drill."
    Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
    0: jdbc:drill:zk=local> 
>

Then the 30 attempts of drill_active fail...(here I just show one manual try :-) )

> (drill_con <- drill_connection("localhost"))
<Drill REST API Direct Connection to localhost:8047>
> drill_active(drill_con)
[1] FALSE

I am out of ideas to try to make it work...any directions is welcome.

Provide an automated downloader and way to start drill-embedded?

You write

If there is sufficient desire for an automated downloader and a way to start the drill-embedded server from within R, please file an issue.

Consider the issue filed!

I think this would be great. I've been impressed by how RStudio's packages (blogdown, tinytex, keras, etc) ship with these mini-installers, and it's amazing how much time is saved by sidestepping external-to-R installation steps when I've introduced those tools to students.

Improve drill_up() interactive experience

I'm thinking through how the drill_up() interactive experience could be made a bit easier. That function returns a stevedore object, which the user is responsible for terminating (as it says in the fine documentation 😉). If the user doesn't capture the stevedore object coming back from the drill_up() call, they have to get() and then stop() the container manually, which could be further complicated if they also failed to capture the container ID from the drill_up() call.

Any thoughts on the best way of documenting a virtuous workflow? Can't protect the user completely here, but I want to make it a bit clearer if a user don't have previous experience with stevedore. Perhaps a more verbose message when drill_up() is in interactive mode? A vignette? (To be clear, I'm considering writing the PR myself...not just adding things to your queue!)

This may or may not be motivated by this author's first stumblings through the stevedore interface. 😜

column order not respected

The columns order for my table reference seems not to be respected.

The columns order in the CSV files is

snapshot_id,altitude,heading,latitude,longitude,radar_id,speed,squawk

But I would like to see them as:

snapshot_id longitude latitude altitude heading speed radar_id squawk

I tried the following SELECT in the SQL prompt (it worked):

0: jdbc:drill:> SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv` LIMIT 20;
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| snapshot_id  | longitude  | latitude  | altitude  | heading  | speed  | squawk  | radar_id  |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| 1486340102   | 14.27592   | 50.10523  | 0         | 40       | 0      | 0       | 8520      |
| 1486340568   | 14.27612   | 50.10521  | 0         | 58       | 0      | 0       | 8520      |
| 1486340668   | 14.27592   | 50.10512  | 0         | 247      | 0      | 0       | 8520      |
| 1486340773   | 14.27569   | 50.10515  | 0         | 232      | 0      | 0       | 8520      |
| 1486340828   | 14.27578   | 50.10507  | 0         | 77       | 0      | 0       | 8520      |
| 1486340903   | 14.27590   | 50.10528  | 0         | 41       | 0      | 0       | 8520      |
| 1486341363   | 14.27594   | 50.10523  | 0         | 48       | 0      | 0       | 8520      |
| 1486341828   | 14.27573   | 50.10522  | 0         | 241      | 0      | 0       | 8520      |
| 1486342643   | 14.27581   | 50.10522  | 0         | 120      | 0      | 0       | 8520      |
| 1486342870   | 14.27571   | 50.10522  | 0         | 255      | 0      | 0       | 8520      |
| 1486343421   | 14.27588   | 50.10524  | 0         | 63       | 0      | 0       | 8520      |
| 1486343966   | 14.27568   | 50.10509  | 0         | 312      | 0      | 0       | 8520      |
| 1486344251   | 14.27590   | 50.10522  | 0         | 119      | 0      | 0       | 8520      |
| 1486345301   | 14.27586   | 50.10524  | 0         | 294      | 0      | 0       | 8520      |
| 1486345875   | 14.27606   | 50.10507  | 0         | 123      | 0      | 0       | 8520      |
| 1486345931   | 14.27627   | 50.10509  | 0         | 132      | 0      | 0       | 8520      |
| 1486345961   | 14.27619   | 50.10529  | 0         | 339      | 0      | 0       | 8520      |
| 1486345981   | 14.27595   | 50.10538  | 0         | 297      | 0      | 0       | 8520      |
| 1486346091   | 14.27594   | 50.10522  | 0         | 120      | 0      | 0       | 8520      |
| 1486347131   | 14.27592   | 50.10543  | 0         | 12       | 0      | 0       | 8520      |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
20 rows selected (9.943 seconds)

But when I use the same SELECT as definition of the table reference with sergeant/dplyr I only get the column order as in the original CSV files (even if later I try to reorder them with dplyr).

Here is the sequence of commands in R:

> db <- src_drill("localhost")
> fr24pos <- tbl(db, "(SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv`)")
> fr24pos
# Source:   table<(SELECT snapshot_id, longitude, latitude, altitude, heading, speed,
', '') AS int) AS `squawk`, radar_id FROM
#   dfs.fr24.`/*.csv`)> [?? x 8]
# Database: DrillConnection
   altitude squawk snapshot_id heading latitude radar_id speed longitude
      <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
 1        0      0  1486340102      40     50.1     8520     0      14.3
 2        0      0  1486340568      58     50.1     8520     0      14.3
 3        0      0  1486340668     247     50.1     8520     0      14.3
 4        0      0  1486340773     232     50.1     8520     0      14.3
 5        0      0  1486340828      77     50.1     8520     0      14.3
 6        0      0  1486340903      41     50.1     8520     0      14.3
 7        0      0  1486341363      48     50.1     8520     0      14.3
 8        0      0  1486341828     241     50.1     8520     0      14.3
 9        0      0  1486342643     120     50.1     8520     0      14.3
10        0      0  1486342870     255     50.1     8520     0      14.3
# ... with more rows
> fr24pos %>% filter(altitude > 1500) %>% head()
# Source:   lazy query [?? x 8]
# Database: DrillConnection
  altitude squawk snapshot_id heading latitude radar_id speed longitude
     <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
1    39000  30276  1486339252     114     37.9      230   506      44.0
2    39000  30276  1486339315     114     37.8      230   509      44.2
3    38975  20836  1486339327     114     37.8      230   509      44.2
4    39025  20836  1486339361     114     37.8      230   508      44.2
5    39000  20836  1486339452     114     37.7      230   505      44.6
6    39000  20836  1486339513     131     37.6      230   503      44.7
> fr24pos %>% select(snapshot_id, longitude, latitude, heading, speed, everything()) %>% filter(altitude > 1500) %>% head()
# Source:   lazy query [?? x 8]
# Database: DrillConnection
  altitude squawk snapshot_id heading latitude radar_id speed longitude
     <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
1    39000  30276  1486339252     114     37.9      230   506      44.0
2    39000  30276  1486339315     114     37.8      230   509      44.2
3    38975  20836  1486339327     114     37.8      230   509      44.2
4    39025  20836  1486339361     114     37.8      230   508      44.2
5    39000  20836  1486339452     114     37.7      230   505      44.6
6    39000  20836  1486339513     131     37.6      230   503      44.7

I file it here, but maybe it could be an issue with dplyr or Apache Drill...

Example CSV file in attachment (extension in .txt to get into the issue):
20170206_207761844.csv.txt

PS: I have to explicitly remove "\r" from squawk (last field in the CSV) to deal with line ending...probably there is a config to do that...
PPS: fr24 is a workspace of dfs with CSV files with headers and I followed "Using Apache Drill with R" recipes to get up an running: THANKS!

Prepare for new streaming JSON

Support field type metadata prior to rapidjsonr switch

Current plan is to use the new metadata field and deliberately convert result sets using the specified field types and provide the following warning (once per R session) whenever BIGINT is encountered:

Warning message:
One or more columns are of type BIGINT. The sergeant package is in the process of switching to the use of the rapidjsonr package in an effort to provide support for this data type. Until then, BIGINT columns will still be converted to numeric since that's how jsonlite::fromJSON() works.

If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill with the MapR ODBC drivers.

This informational warning will only be shown once per R session.

This will go away once #26 is done.

make_server not getting connection URL correctly

I could PR a fix for this, but I'm not sure how much it has to do with package version compatibility and how much other refactoring is required related to dbplyr/dplyr updates. Currently using CRAN version of everything.

library(sergeant)
ncdr <- src_drill(host = "localhost", port=22023L)

# The connection appears to work
ncdr
#> src:  DrillConnection
#> tbls: INFORMATION_SCHEMA, cp.default, dfs.default, dfs.root, dfs.tmp, sys

#But I can't do a query
snaps <- drill_query(ncdr, "SELECT * FROM INFORMATION_SCHEMA")
#> Error: length(url) == 1 is not TRUE

# It looks like there's no URL to send the query to
sergeant:::make_server(ncdr)
#> character(0)

#This seems to be because `make_server` doesn't get the host url correctly
sergeant:::make_server
#> function (drill_con) 
#> {
#>     sprintf("%s://%s:%s", ifelse(drill_con$ssl[1], "https", "http"), 
#>         drill_con$host, drill_con$port)
#> }
#> <environment: namespace:sergeant>
ncdr$host
#> NULL
ncdr$con@host
#> [1] "localhost"

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.2 (2016-10-31)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       <NA>                        
#>  date     2017-07-23
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                            
#>  assertthat   0.2.0      2017-04-11 cran (@0.2.0)                     
#>  backports    1.1.0      2017-05-22 cran (@1.1.0)                     
#>  base       * 3.3.2      2016-11-01 local                             
#>  bindr        0.1        2016-11-13 CRAN (R 3.3.2)                    
#>  bindrcpp     0.2        2017-06-17 CRAN (R 3.3.2)                    
#>  colorspace   1.3-2      2016-12-14 cran (@1.3-2)                     
#>  curl         2.8.1      2017-07-21 CRAN (R 3.3.2)                    
#>  datasets   * 3.3.2      2016-11-01 local                             
#>  DBI        * 0.7        2017-06-18 CRAN (R 3.3.2)                    
#>  dbplyr     * 1.1.0      2017-06-27 CRAN (R 3.3.2)                    
#>  devtools     1.13.2     2017-06-02 CRAN (R 3.3.2)                    
#>  digest       0.6.12     2017-01-27 CRAN (R 3.3.1)                    
#>  dplyr      * 0.7.2      2017-07-20 CRAN (R 3.3.2)                    
#>  evaluate     0.10       2016-10-11 CRAN (R 3.3.1)                    
#>  glue         1.1.1      2017-06-21 CRAN (R 3.3.2)                    
#>  graphics   * 3.3.2      2016-11-01 local                             
#>  grDevices  * 3.3.2      2016-11-01 local                             
#>  hms          0.3        2016-11-22 cran (@0.3)                       
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.3.2)                    
#>  httr         1.2.1      2016-07-03 CRAN (R 3.3.2)                    
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.3.2)                    
#>  knitr        1.16       2017-05-18 cran (@1.16)                      
#>  magrittr     1.5        2014-11-22 CRAN (R 3.1.2)                    
#>  memoise      1.0.0.9001 2016-12-13 Github (hadley/memoise@e392c7b)   
#>  methods    * 3.3.2      2016-11-01 local                             
#>  munsell      0.4.3      2016-02-13 CRAN (R 3.3.0)                    
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.3.2)                    
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.3.0)                    
#>  purrr        0.2.2.2    2017-05-11 CRAN (R 3.3.2)                    
#>  R6           2.2.2      2017-06-17 CRAN (R 3.3.2)                    
#>  Rcpp         0.12.12    2017-07-15 CRAN (R 3.3.2)                    
#>  readr        1.1.1      2017-05-16 CRAN (R 3.3.2)                    
#>  rlang        0.1.1      2017-05-18 cran (@0.1.1)                     
#>  rmarkdown    1.3.9002   2017-02-06 Github (rstudio/rmarkdown@c4908dc)
#>  rprojroot    1.2        2017-01-16 CRAN (R 3.3.1)                    
#>  scales       0.4.1      2016-11-09 cran (@0.4.1)                     
#>  sergeant   * 0.5.2      2017-07-17 CRAN (R 3.3.2)                    
#>  stats      * 3.3.2      2016-11-01 local                             
#>  stringi      1.1.5      2017-04-07 cran (@1.1.5)                     
#>  stringr      1.2.0      2017-02-18 cran (@1.2.0)                     
#>  tibble       1.3.3      2017-05-28 CRAN (R 3.3.2)                    
#>  tools        3.3.2      2016-11-01 local                             
#>  utils      * 3.3.2      2016-11-01 local                             
#>  withr        1.0.2      2016-06-20 CRAN (R 3.3.1)                    
#>  yaml         2.1.14     2016-11-12 CRAN (R 3.3.1)

REST API: Add support for new query metadata in 1.15.0 release

Drill Jira: https://issues.apache.org/jira/browse/DRILL-6847

Package Issue Ref: #18 (comment)

dplyr::count() is behaving weirdly after Drill 1.10.9 release

Support dbQuoteIdentifier method for `src_drill` class

Tested under 0.8.0 branch

It looks like the src_drill() class doesn't support the dbQuoteIdentifier() method. This keeps the super nifty glue::glue_sql() function from being as useful as it could be.

library(sergeant)
con <- src_drill()

fields_to_select <- c("first_name", "last_name")

glue::glue_sql("SELECT {`fields_to_select`*} FROM cp.`employee.json` LIMIT 20", .con = con)
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'dbQuoteIdentifier' for signature '"src_drill", "character"'

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

build_sql moved from dplyr to dbplyr

Hi! I saw your posts about this on Twitter and it looked really cool, so I though I'd try it out. However, that involved another bout in the battle to get rJava to compile from source (which I appear to have won this time!), which led me to a sergeant installation error that

Error : object ‘build_sql’ is not exported by 'namespace:dplyr'

which is true for devel-dplyr, as it's been moved to dbplyr. (It's actually the case for most of the database-focused dplyr imports; build_sql is just the first it runs into.) Reinstalling CRAN-dplyr allows sergeant to be installed, but it still can't be used with devel-dplyr, which is a shame, as the update contains a lot of useful features.

According to the blog post about the dplyr 0.6 release candidate, it was to be submitted to CRAN yesterday. I'm not sure if that happened or quite when it will arrive on the mirrors, but for sergeant to be used at that point sergeant/R/sergeant-package.r will need to point to dbplyr.

If you want to update it now, I'm happy to make a PR if you want, though it's not a very complicated change, really.

Should dplyr idioms work if `tbl()` source is a custom SQL statement?

Are dplyr idioms expected to work if the src of a tbl() call is a drill SELECT statement instead of a bare table name? It looks like column name evaluation isn't working as expected in these cases. Reprex of the behavior I see under v0.8.0 is below:

library(sergeant)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

ds <- src_drill()
emp_full <- tbl(ds, "cp.`employee.json`")
emp_full %>% select(full_name)
#> # Source:   lazy query [?? x 1]
#> # Database: DrillConnection
#>    full_name        
#>    <chr>            
#>  1 Sheri Nowmer     
#>  2 Derrick Whelply  
#>  3 Michael Spence   
#>  4 Maya Gutierrez   
#>  5 Roberta Damstra  
#>  6 Rebecca Kanagaki 
#>  7 Kim Brunner      
#>  8 Brenda Blumberg  
#>  9 Darren Stanz     
#> 10 Jonathan Murraiin
#> # ... with more rows

emp_partial <- tbl(ds, "SELECT full_name from cp.`employee.json`") #edits? we don't need no stinkin' edits!
emp_partial %>% select(full_name)
#> Error in .f(.x[[i]], ...): object 'full_name' not found

^{Created on 2019-01-18 by the reprex package (v0.2.1)}

sergeant errors and warnings keeps popping even after removing it

I had installed the sergeant package but could not figure out how to use it.
Meanwhile, strange errors started appearing mostly when I am inside the debugger and press Q. I used to get sergeant package related errors. Even though I had nothing (no code) on sergeant in my source file. So I decided to uninstall it.

Even after uninstalling the package following warning comes on sourcing my R file.

Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘sergeant’

And to top it now on entering the Q command in the Rstudio debugger the sergeant package is loaded - twice. Where is the package coming from? And how is it called by just the Q command?

Does Drill's REST API support session settings?

I'm running into an issue where session settings don't see to stick when they're submitted through sergeant. The problem seems to be with the REST api itself, not with sergeant.

Here's an example:

# Packages
library(sergeant)
#> Loading required package: DBI
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: dbplyr
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

# Connect to Drill.
con <- drill_connection()

# Alternatively with drill_use.
drill_use(con, "dfs.tmp")
#> Parsed with column specification:
#> cols(
#>   summary = col_character(),
#>   ok = col_character()
#> )
#> Default schema changed to [dfs.tmp]true
drill_query(con, "CREATE TABLE employee as SELECT * FROM cp.`employee.json`")
#> Query ==> CREATE TABLE employee as SELECT * FROM cp.`employee.json`
#> VALIDATION ERROR: Root schema is immutable. Creating or dropping tables/views is not allowed in root schema.Select a schema using 'USE schema' command.
#> 
#> 
#> [Error Id: d4d341b3-07da-4562-813e-67f0ef89c028 on DESKTOP-73M5KKL.mshome.net:31010]

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       America/New_York            
#>  date     2018-03-01
#> Packages -----------------------------------------------------------------
#>  package    * version date       source        
#>  assertthat   0.2.0   2017-04-11 CRAN (R 3.4.3)
#>  backports    1.1.2   2017-12-13 CRAN (R 3.4.3)
#>  base       * 3.4.3   2017-12-06 local         
#>  bindr        0.1     2016-11-13 CRAN (R 3.4.3)
#>  bindrcpp     0.2     2017-06-17 CRAN (R 3.4.3)
#>  colorspace   1.3-2   2016-12-14 CRAN (R 3.4.3)
#>  compiler     3.4.3   2017-12-06 local         
#>  curl         3.1     2017-12-12 CRAN (R 3.4.3)
#>  datasets   * 3.4.3   2017-12-06 local         
#>  DBI        * 0.7     2017-06-18 CRAN (R 3.4.3)
#>  dbplyr     * 1.2.1   2018-02-19 CRAN (R 3.4.3)
#>  devtools     1.13.5  2018-02-18 CRAN (R 3.4.3)
#>  digest       0.6.15  2018-01-28 CRAN (R 3.4.3)
#>  dplyr      * 0.7.4   2017-09-28 CRAN (R 3.4.3)
#>  evaluate     0.10.1  2017-06-24 CRAN (R 3.4.3)
#>  glue         1.2.0   2017-10-29 CRAN (R 3.4.3)
#>  graphics   * 3.4.3   2017-12-06 local         
#>  grDevices  * 3.4.3   2017-12-06 local         
#>  hms          0.4.1   2018-01-24 CRAN (R 3.4.3)
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.4.3)
#>  httr         1.3.1   2017-08-20 CRAN (R 3.4.3)
#>  jsonlite     1.5     2017-06-01 CRAN (R 3.4.3)
#>  knitr        1.20    2018-02-20 CRAN (R 3.4.3)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.4.3)
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.4.3)
#>  methods    * 3.4.3   2017-12-06 local         
#>  munsell      0.4.3   2016-02-13 CRAN (R 3.4.3)
#>  pillar       1.1.0   2018-01-14 CRAN (R 3.4.3)
#>  pkgconfig    2.0.1   2017-03-21 CRAN (R 3.4.3)
#>  plyr         1.8.4   2016-06-08 CRAN (R 3.4.3)
#>  purrr        0.2.4   2017-10-18 CRAN (R 3.4.3)
#>  R6           2.2.2   2017-06-17 CRAN (R 3.4.3)
#>  Rcpp         0.12.15 2018-01-20 CRAN (R 3.4.3)
#>  readr        1.1.1   2017-05-16 CRAN (R 3.4.3)
#>  rlang        0.2.0   2018-02-20 CRAN (R 3.4.3)
#>  rmarkdown    1.8     2017-11-17 CRAN (R 3.4.3)
#>  rprojroot    1.3-2   2018-01-03 CRAN (R 3.4.3)
#>  scales       0.5.0   2017-08-24 CRAN (R 3.4.3)
#>  sergeant   * 0.5.2   2017-07-17 CRAN (R 3.4.3)
#>  stats      * 3.4.3   2017-12-06 local         
#>  stringi      1.1.6   2017-11-17 CRAN (R 3.4.2)
#>  stringr      1.3.0   2018-02-19 CRAN (R 3.4.3)
#>  tibble       1.4.2   2018-01-22 CRAN (R 3.4.3)
#>  tools        3.4.3   2017-12-06 local         
#>  utils      * 3.4.3   2017-12-06 local         
#>  withr        2.1.1   2017-12-19 CRAN (R 3.4.3)
#>  yaml         2.1.16  2017-12-12 CRAN (R 3.4.3)

One workaround is to reference cp.tmp.employee directly when creating the table, but this makes drill_use not very useful.

Created on 2018-03-01 by the reprex package (v0.2.0).

Implement write table methods

It would be nice if the seargeant package could create new tables as I can do with other database drivers. Something like this:

library(sergeant)
library(dplyr)

conn = dbConnect(Drill())

# Load original data
original_data = tbl(conn, "`dfs.downloads`.`original_data `")

# Do some operation
edited_data = original_data %>% 
  filter(name = "John")

# Write the result to a table without loading it into R
dbWriteTable(conn, table = "`dfs.downloads`.`edited_data`", values = edited_data )

table identifier escaping causing dplyr interface errors

Not sure exactly what changed, but some version somewhere changed so I'm now getting errors when using the dplyr interface because it's putting double-quotes around the table identifier. Reprex:

> library(sergeant)
> tbl(src_drill('localhost'), 'cp.`employee.json`')
Request failed [500]. Retrying in 1 seconds...
Request failed [500]. Retrying in 2 seconds...
Request failed [500]. Retrying in 1.8 seconds...
Request failed [500]. Retrying in 1.7 seconds...
# Source:   table<cp.`employee.json`> [?? x 8]
# Database: DrillConnection
Warning message:
PARSE ERROR: Encountered "FROM \"" at line 1, column 1.
SQL Query: SELECT *
FROM "cp.`employee.json`"
^
LIMIT 11

Original Query:

  1: SELECT *
  2: FROM "cp.`employee.json`"
  3: LIMIT 11

Query Profile Error Link:
http://localhost:8047/profiles/772b3c22-8d04-43d5-b733-9e2ae0bcbb50
> con <- drill_connection('localhost')

If I try to run that query directly, it fails:

> drill_query(
+     con,
+     'SELECT *
+      FROM "cp.`employee.json`"
+      LIMIT 11'
+ )
  |======================================================================| 100%
Request failed [500]. Retrying in 1.4 seconds...
  |======================================================================| 100%
Request failed [500]. Retrying in 1.9 seconds...
  |======================================================================| 100%
Query ==> SELECT *      FROM "cp.`employee.json`"      LIMIT 11
PARSE ERROR: Encountered "FROM \"" at line 1, column 6.

SQL Query: SELECT *
     FROM "cp.`employee.json`"
     ^
     LIMIT 11

[Error Id: dc495019-3219-4b88-b5f8-9a4fd657dc0f ]

whereas if I remove the quotes around cp.`employee.json` it works:

> drill_query(
+     con,
+     'SELECT *
+      FROM cp.`employee.json`
+      LIMIT 11'
+ )
  |======================================================================| 100%
# A tibble: 11 x 16
   employee_id full_name first_name last_name position_id position_title
         <dbl> <chr>     <chr>      <chr>           <dbl> <chr>
 1           1 Sheri No… Sheri      Nowmer              1 President
 2           2 Derrick … Derrick    Whelply             2 VP Country Ma…
 3           4 Michael … Michael    Spence              2 VP Country Ma…
 4           5 Maya Gut… Maya       Gutierrez           2 VP Country Ma…
 5           6 Roberta … Roberta    Damstra             3 VP Informatio…
 6           7 Rebecca … Rebecca    Kanagaki            4 VP Human Reso…
 7           8 Kim Brun… Kim        Brunner            11 Store Manager
 8           9 Brenda B… Brenda     Blumberg           11 Store Manager
 9          10 Darren S… Darren     Stanz               5 VP Finance
10          11 Jonathan… Jonathan   Murraiin           11 Store Manager
11          12 Jewel Cr… Jewel      Creek              11 Store Manager
# … with 10 more variables: store_id <dbl>, department_id <dbl>,
#   birth_date <chr>, hire_date <chr>, salary <dbl>, supervisor_id <dbl>,
#   education_level <chr>, marital_status <chr>, gender <chr>,
#   management_role <chr>

Results are the same in the Drill shell. I assume this is a result of a version or setting changing somewhere, but so far I can't locate it. What I'm running:

software	version
MacOS	10.14.6
Apache Drill	1.18.0
R	4.0.3
sergeant	0.9.0
dbplyr	2.0.0
dplyr	1.0.2

Happy to write a PR to fix this if this is something that can be fixed here, but so far I can't figure out what has to change.

"bad pad value" when trying to install from GitHub

Hello and thanks for the package! I can't find a way to install it on my system from GitHub, however:

> devtools::install_github("hrbrmstr/sergeant")
Downloading GitHub repo hrbrmstr/sergeant@master
Error in structure(.External(.C_dotTclObjv, objv), class = "tclObj") : 
  [tcl] bad pad value "2m": must be positive screen distance.

I can use it from CRAN, however:

> installed.packages() %>% as_tibble() %>% filter(Package == "sergeant") %>% flatten() %>% .$Version
[1] "0.5.2"
> library(sergeant)
> db <- src_drill("localhost")
> drill_connection() %>% drill_version()
[1] "1.16.0"

This is my sessionInfo:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] nvimcom_0.9-82 colorout_1.2-1 usethis_1.5.0  devtools_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        magrittr_1.5      pkgload_1.0.2     R6_2.4.0         
 [5] rlang_0.3.4       tcltk_3.6.0       tools_3.6.0       pkgbuild_1.0.3   
 [9] parallel_3.6.0    sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2      
[13] remotes_2.0.4     assertthat_0.2.1  rprojroot_1.3-2   digest_0.6.18    
[17] crayon_1.3.4      processx_3.3.1    callr_3.2.0       fs_1.3.1         
[21] ps_1.3.0          curl_3.3          testthat_2.1.1    memoise_1.1.0    
[25] glue_1.3.1        compiler_3.6.0    desc_1.2.0        backports_1.1.4  
[29] prettyunits_1.0.2

Out of memory Issue reading single column from parquet file

Thanks for providing the sergeant package!

My use case is reading a single column of data from a fairly large parquet file. The column I want to read from the parquet file is called occurrenceId and the whole column fits into a character vector of length 70M in R where each value is a string of length 41 characters (it is a unique identifier, and I'd like to check it for uniqueness and presence in R). In R the whole column would occupy about 700M in memory when I inspect it with ls().

I can do this with sparklyr but the drill sergeant's approach is appealing, being more lightweight. I am struggling with an out-of-memory issue, though and I have 16 GB available, which I think should suffice, given the size of the vector in R, so now I'm wondering if this use case is supported or if I'm doing it wrong?

The dataset I'm using is public and can be viewed here https://www.gbif.org/dataset/38b4c89f-584c-41bb-bd8f-cd1def33e92f and it can be downloaded from here in .zip format: http://www.gbif.se/ipt/archive.do?r=artdata

I first tried using vroom and reading directly from the compressed .zip file (details here tidyverse/vroom#116) but the promise of being able to read directly from a comparatively smaller parquet file and being able to just read the columns I need made me turn to the sergeant. So in my attempt to read the parquet file I have first converted the .zip to parquet using sparklyr, like so:

library(sparklyr)
library(dplyr)

# first install spark 2.40 hadoop 2.7 with sparklyr::spark_install()

Sys.setenv("SPARK_MEM" = "12g")

config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '12G'
config$`sparklyr.shell.executor-memory` <- '4G'
config$sparklyr.defaultPackages <- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3"
config$spark.cassandra.cassandra.host <- "localhost"
config$spark.driver.maxResultSize <- "4G"
config$spark.executor.cores <- 3

# is pushdown option TRUE?

sc <- spark_connect(master = "local", config = config)

# for this connection, load all records

system.time(
  spark_read_csv(sc, memory = FALSE,
    name = "artdata", path = "file:///home/roger/artdata/artdata.tsv", delimiter = "\t")
)

#user   system  elapsed 
#6.154    7.060 1559.874 

# generate a parquet file based on the dataframe above

system.time(
  spark_write_parquet(
    tbl(sc, "artdata"),  
    "file:///home/roger/artdata/artdata.parquet")
)

#user   system  elapsed 
#14.634   16.586 3816.375 

# the parquet-file is 3.8 GB on disk, smaller than the zip

spark_tbl_handle <- spark_read_parquet(sc, memory = FALSE,
  "artdata", "file:///home/roger/artdata/artdata.parquet")


has_valid_bor <- function() {
  
  bor <- 
    spark_tbl_handle %>%
    count(basisOfRecord) %>%
    collect() %>%
    mutate(is_ok = basisOfRecord %in% c(
        "humanobservation", 
        "machineobservation"
      )
    )

  bor %>% pull(is_ok) %>% all
}

n_rowcount <- function() {
  
  spark_tbl_handle %>%
  summarise(n = n()) %>%
  pull(n)

}


has_valid_id <- function() {
  
  ids <- 
    spark_tbl_handle %>%
    count(occurrenceID) %>%
    filter(n > 1, is.na(occurrenceID)) %>%
    collect()
  
  nrow(ids) == 0
  
}


system.time(
  has_valid_bor()
)

system.time(
  has_valid_id()
)

system.time(
  n_rowcount()
)

sort_artdata <- function() {
  spark_tbl_handle %>%
  arrange(occurrenceID) %>%
  head(10) %>%
  collect()
}

system.time(
  sort_artdata()
)

# sorting in spark takes about 5 minutes...
#user  system elapsed 
#3.182   1.370 282.698

This gives me a parquet file on disk.

I then proceed to attempt to use the sergeant to read the occurrenceId column like so:

library(sergeant)
library(tidyverse)

if (Sys.which("docker") == "")
  stop("Please install docker first - see https://docs.docker.com/install/")

# install and run official Apache Drill software
system("docker stop drill; docker rm drill; docker run -i --name drill -e DRILL_HEAP=10G -v /home/markus/tmp/artdata:/tmp -p 8047:8047 --detach drill/apache-drill:1.16.0 /bin/bash")

dc <- drill_connection("localhost") 

df <- drill_query(dc, "SELECT occurrenceId FROM dfs.`/tmp/artdata.parquet`")

The error message that I get is the following:

  |===========================================================================| 100%
Query ==> SELECT occurrenceId FROM dfs.`/tmp/artdata.parquet`
RESOURCE ERROR: There is not enough heap memory to run this query using the web interface. 

Please try a query with fewer columns or with a filter or limit condition to limit the data returned. 
You can also try an ODBC/JDBC client. 

[Error Id: f42e398d-0afb-4931-a800-2a56baaa074c ]

I tried to set the DRILL_HEAP to 10G.

Is this use case supported with the sergeant?

Any advice on how I should proceed?

Column data types not being parsed under v0.8.0-dev

Under sergeant master (20190106), csvh files being read in over DBI (src_drill()) are coming across as character columns, while the same file parsed under CRAN v0.5.0 sergeant is properly coerced into R data types.

Apologies for the lack of reprex here. I'm in the thick of a drill-powered project and wanted to throw this early report up in case this was something related to the rapidjsonr conversion progress and the new dev branches. If previously unknown, I'll generate a reprex when my hair is a bit less on fire. ;)

BREAKING STUFF: 0.7.0 re-introducts JDBC dependency

Hopefully if you are running into issues with this, you'll see this issue and add your comments here.

Sufficient interest in moving the JDBC bits to a sibling pkg will cause that to happen.

hrbrmstr / sergeant Goto Github PK

sergeant's Issues

Environment

How to replicate

Investigation

Recommend Projects

Recommend Topics

Recommend Org