Giter Club home page Giter Club logo

kudusparklyr's Introduction

####Kudusparklyr ###A Sparklyr extention for Kudu

#####Introduction Sparklyr is an R interface for Apache Spark that is highly extensible. Apache Kudu is a new storage engine that enables fast analytics on fast data and fits use cases in the Hadoop ecosystem where you need to frequently update/delete data while also servicing analytical queries as it comes in (NRT). Kudusparklyr is a Sparklyr extension that leverages the Kudu integration with Spark to make working with Kudu in an R environment easier. Usage requires Spark, Kudu, and Sparklyr and supports Spark 1.6 today.

#####Installation Install through devtools:

library(devtools)
install_github("bkvarda/kudusparklyr")

Alternatively download the repo and load from local:

library(devtools)
load_all('/path/to/kudusparklyr')

#####Functions Create a KuduContext and append it to a sparklyr spark_connection object.

kudu_context(sc,kudu_master)

Read a Kudu table in as a Spark DataFrame

read_kudu_table(sc,kudu_table_name)

Insert, insert(ignore), upsert, update, and delete rows from a table (based on the contents of a DataFrame):

kudu_insert_rows(sc,data_frame,kudu_table)
kudu_insert_ignore_rows(sc,data_frame,kudu_table)
kudu_upsert_rows(sc,data_frame,kudu_table)
kudu_update_rows(sc,data_frame,kudu_table)
kudu_delete_rows(sc,data_frame,kudu_table)

Check if a Kudu table exists:

kudu_table_exists(sc,kudu_table)

Delete a Kudu table:

delete_kudu_table(sc,kudu_table)

Create a Kudu Options object (needed for creating a table):

kudu_table_options(sc)

Changing options properties (partition_columns need to be a list() of columns):

set_num_replicas(options,num_replicas)
add_hash_partitions(options,parition_columns,num_buckets,seed=0)
set_range_partition_columns(options,partition_columns)

Or more elegantly using magrittr pipes:

options <- kudu_table_options(sc) %>% set_num_replicas(3) %>% add_hash_partitions(list("playerID"),16)

Extract the schema of a dataframe (can be used in non-kudu scenarios):

sdf_schema(df)

Create a Kudu table:

create_kudu_table(sc,tbl_name,schema,key_columns,kudu_options)

#Example:
create_kudu_table(sc, 'batting_table',schema,list("playerID","yearID","teamID"),options)

#####Example

library(devtools)
library(sparklyr)
library(DBI)
library(dplyr)
library(kudusparklyr)
library(magrittr)

#Initialize connection to Spark (on YARN)
sc <- spark_connect(master="yarn-client",version="1.6")

#Create a KuduContext for manipulating tables (writing, updating, etc). Appends reference to KuduContext and Kudu Master to your Spark connection object
sc <- kudu_context(sc,"ip-10-0-0-138.ec2.internal:7051")

#Read a Kudu table and create a Spark DataFrame
df <- read_kudu_table(sc, "particle_test")

#Read in a sample of your DataFrame
sdf_sample(df)

#Register as a table (needed for dplyr functionality as far as I can tell)
tbl <- sdf_register(df, "temp_tbl")

#Select only the 'coreid' column through the dplyr interface:
select(tbl,coreid)

#Select only the 'coreid' column through the DBI interface:
dbGetQuery(sc, "Select coreid FROM temp_tbl")

#Select the relative counts of each occurence of coreid through the DBI interface:
counts <- dbGetQuery(sc, "Select coreid, count(coreid) as count FROM temp_tbl GROUP BY coreid ORDER BY count DESC")

#Convert to a SparkDataframe
counts_df <- copy_to(sc,counts)

#Insert the contents of counts_df into "counts_table" - use spark_dataframe(counts_df) to get the Java reference of counts_df:
kudu_insert_rows(sc,spark_dataframe(counts_df),"counts_table")

#Read in the new Kudu table as Spark DF
df <- read_kudu_table(sc, "counts_table")

#Register as a temp table
counts_tbl <- sdf_register(df, "counts_tbl")

#Select specific records through the dplyr interface. This needs to end up being a list of keys to be deleted
records <- select(filter(counts_tbl, count <= 10),coreid)

#Delete these records from the counts_tbl table
kudu_delete_rows(sc,spark_dataframe(records),"counts_table")

#Copy the Batting table into Spark as a DF:
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")

#Sample the DF:
sdf_sample(batting_tbl)

#Let's create a Kudu table. We'll derive a schema from the DataFrame and set some required options for our table...

#Derive the schema:
schema <- sdf_schema(batting_tbl)

#Create a Kudu options object which is needed to create a table. We can use magrittr pipes to make this easier:
options <- kudu_table_options(sc) %>% set_num_replicas(3) %>% add_hash_partitions(list("playerID"),16)

#Create the table:
create_kudu_table(sc, 'batting_table',schema,list("playerID","yearID","teamID"),options)

#Insert rows (this particular dataset has a bunch of duplicate rows):
kudu_insert_ignore_rows(sc,batting_tbl,'batting_table')

#Read the new table, register it and filter it with a magrittr pipe:
df <- read_kudu_table(sc, 'batting_table') %>% sdf_register('batting_temp') %>% filter(yearID > 1990)

#Grab a sample (with ouput as shown in console):
sdf_sample(df)
Source:   query [?? x 22]
Database: spark connection master=yarn-client app=sparklyr local=FALSE

    playerID yearID teamID stint  lgID     G    AB     R     H   X2B   X3B    HR   RBI    SB    CS    BB    SO   IBB
       <chr>  <int>  <chr> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1  alfonan01   1997    FLO     1    NL    17     3     0     0     0     0     0     0     0     0     0     3     0
2  alfonan01   1997    FLO     1    NL    17     3     0     0     0     0     0     0     0     0     0     3     0
3  alfonan01   1998    FLO     1    NL    58     4     0     0     0     0     0     0     0     0     0     2     0
4  alfonan01   1999    FLO     1    NL    73     2     0     0     0     0     0     0     0     0     0     2     0
5  alfonan01   1999    FLO     1    NL    73     2     0     0     0     0     0     0     0     0     0     2     0
6  alfonan01   1999    FLO     1    NL    73     2     0     0     0     0     0     0     0     0     0     2     0
7  alfonan01   1999    FLO     1    NL    73     2     0     0     0     0     0     0     0     0     0     2     0
8  alfonan01   1999    FLO     1    NL    73     2     0     0     0     0     0     0     0     0     0     2     0
9  alfonan01   2000    FLO     1    NL    68     0     0     0     0     0     0     0     0     0     0     0     0
10 alfonan01   2000    FLO     1    NL    68     0     0     0     0     0     0     0     0     0     0     0     0
 ... with more rows, and 4 more variables: HBP <int>, SH <int>, SF <int>, GIDP <int>

#Delete the table 
delete_kudu_table(sc, 'batting_table')

#Validate:
kudu_table_exists(sc, 'batting_table')
[1] FALSE

#####Limitations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.