Giter Club home page Giter Club logo

quainjn / etl-language-comparison Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dimroc/etl-language-comparison

0.0 1.0 0.0 385 KB

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.

Home Page: http://blog.dimroc.com/2015/11/14/etl-language-showdown-pt3/

C# 11.37% Elixir 13.09% Erlang 13.59% Shell 6.21% Go 8.00% Mathematica 0.78% Nim 2.69% JavaScript 4.81% Perl 5.50% PHP 1.79% Python 4.92% Ruby 6.91% PowerShell 0.48% Rust 6.36% Scala 13.48%

etl-language-comparison's Introduction

Update

Please see the following blog posts for the latests updates:

  1. ETL Language Showdown - Sept. 2014
  2. ETL Language Showdown Part 2 - Now with Python - May. 2015
  3. ETL Language Showdown Part 3 - 10 Languages and growing - Nov. 2015

Wins

Analyses and discussions done here have led to the following language pull requests:

  1. Add BIF binary:split/2,3 to Erlang
  2. Improve case insensitive regex to Golang

ETL Language Showdown

This repo implements the same map reduce ETL (Extract-Transform-Load) task in multiple languages in an effort to compare language productivity, terseness and readability. The performance comparisons should not be taken seriously. If anything, it is a bigger indication of my skillset in that language rather than their performance capabilities.

The Task

Count the number of tweets that mention 'knicks' in their message and bucket based on the neighborhood of origin. The ~1GB dataset for this task, sampled below, contains a tweet's message and its NYC neighborhood.

Simply run fetch_tweets in the repo directory or downloaded here.

91	west-brighton	Brooklyn	Uhhh
121	turtle-bay-east-midtown	Manhattan	Say anything
175	morningside-heights	Manhattan	It feels half-cheating half-fulfilling to cite myself.

Initial Assumption

  • These tasks are not run on Hadoop but do run concurrently. Performance numbers are moot since the CPU mostly sits idle waiting on Disk IO.
  • **UPDATE: Boy was the IO bound assumption wrong.

The Languages

Below you will find the languages run. Note that frameworks also play a big role, for example the Scala implementation compares the parallel collection to futures and the Akka framework. Click through on each language to read more.

LanguageOwner
Ruby
Golangmatttproud
Scala
Nim
Node
PHP
Erlang
Elixirjosevalim
Rust
Python
C#mganss
shellmganss
perlsitaramc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.