Giter Club home page Giter Club logo

fast-read's Introduction

Data ingestion and manipulation

Download and unpack data

if(!file.exists("flights.csv")) {
  download.file(
    "http://stat-computing.org/dataexpo/2009/2008.csv.bz2", 
    "flights.csv.bz2")
  R.utils::bunzip2(
    "flights.csv.bz2", 
    "flights.csv")
  unlink("flights.csv.bz2", force = TRUE)
  }

Read data

readr

library(readr)

tr <- system.time(
  flights_readr <- read_csv("flights.csv")  
)
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   UniqueCarrier = col_character(),
#>   TailNum = col_character(),
#>   Origin = col_character(),
#>   Dest = col_character(),
#>   CancellationCode = col_character()
#> )
#> See spec(...) for full column specifications.

tr[[3]]
#> [1] 21.748

data.table

library(data.table)

tdt <- system.time(
  flights_dt <- fread("flights.csv")  
)

tdt[[3]]
#> [1] 3.717

vroom

tva <- system.time(
  flights_vroom_altrep <- vroom("flights.csv", altrep_opts = TRUE)
)
#> Observations: 7,009,728
#> Variables: 29
#> chr [ 5]: UniqueCarrier, TailNum, Origin, Dest, CancellationCode
#> dbl [24]: Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTim...
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

tva[[3]]
#> [1] 1.996

Results

library(tidyverse)

comparison <- tibble(
  readr = tr[[3]],
  `data.table` = tdt[[3]],
  vroom = tva[[3]]
)

comparison 
#> # A tibble: 1 x 3
#>   readr data.table vroom
#>   <dbl>      <dbl> <dbl>
#> 1  21.7       3.72  2.00
comparison %>%
  gather() %>%
  ggplot(aes(key, value, fill = key)) +
  geom_col() +
  geom_label(aes(label = paste0(round(value), " secs")), fill = "white") +
  coord_flip() +
  labs(title = "File read times", x = "", y = "") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_blank())

Data manipulation

flights_readr %>% 
    group_by(Month) %>% 
    summarise(avg_delay = mean(ArrDelay, na.rm = TRUE))
#> # A tibble: 12 x 2
#>    Month avg_delay
#>    <dbl>     <dbl>
#>  1     1    10.2  
#>  2     2    13.1  
#>  3     3    11.2  
#>  4     4     6.81 
#>  5     5     5.98 
#>  6     6    13.3  
#>  7     7     9.98 
#>  8     8     6.91 
#>  9     9     0.698
#> 10    10     0.415
#> 11    11     2.02 
#> 12    12    16.7

Transformations

mr <- system.time(
  flights_readr %>% 
    group_by(Month) %>% 
    summarise(avg_delay = mean(ArrDelay, na.rm = TRUE))
)
mva <- system.time(
  flights_vroom_altrep %>% 
    group_by(Month) %>% 
    summarise(avg_delay = mean(ArrDelay, na.rm = TRUE))
)
mdt <- system.time(
  flights_dt[!is.na(ArrDelay), .(avg_delay = mean(ArrDelay)), Month]
)

Results

comp <- tibble(
  readr = mr[[3]],
  `data.table` = mdt[[3]],
  vroom = mva[[3]]
)

comp
#> # A tibble: 1 x 3
#>   readr data.table vroom
#>   <dbl>      <dbl> <dbl>
#> 1 0.232      0.212 0.536
comp %>%
  gather() %>%
  ggplot(aes(key, value, fill = key)) +
  geom_col() +
  geom_label(aes(label = paste0(round(value, 2), " secs")), fill = "white") +
  coord_flip() +
  labs(title = "Data manipulation times", x = "", y = "") +
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_blank())

fast-read's People

Contributors

edgararuiz-zz avatar

Watchers

 avatar  avatar  avatar  avatar

fast-read's Issues

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

  • Help us firm up the list of targetted repositories
  • Make sure all maintainers are aware of what's coming
  • Give us an issue to close when the job is done
  • Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.