Giter Club home page Giter Club logo

bashreduce's Introduction

bashreduce : mapreduce in a bash script

bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There’s no installation, administration, or distributed filesystem. You’ll need:

  • br somewhere handy in your path
  • vanilla unix tools: sort, awk, ssh, netcat, pv
  • password-less ssh to each machine you plan to use

Configuration

Edit /etc/br.hosts and enter the machines you wish to use as workers. Or specify your machines at runtime:

br -m "host1 host2 host3"

To take advantage of multiple cores, repeat the host name.

Examples

sorting

br < input > output

word count

br -r "uniq -c" < input > output

great big join

LC_ALL='C' br -r "join - /tmp/join_data" < input > output

Performance

big honkin’ local machine

Let’s start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I’m relegated to using just one core. How does br help us here? Here’s br on an 8-core machine, essentially operating as a poor man’s multi-core sort:

command using time rate
sort -k1,1 -S2G 4gb_file > 4gb_file_sorted coreutils 30m32.078s 2.24 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils 11m3.111s 6.18 MBps
br -i 4gb_file -o 4gb_file_sorted brp/brm 7m13.695s 9.44 MBps

The job completely i/o saturates, but still a reasonable gain!

many cheap machines

Here lies the promise of mapreduce: rather than use my big honkin’ machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix?

command using time rate
sort -k1,1 -S2G 4gb_file > 4gb_file_sorted coreutils 30m32.078s 2.24 MBps
br -i 4gb_file -o 4gb_file_sorted coreutils 8m30.652s 8.02 MBps
br -i 4gb_file -o 4gb_file_sorted brp/brm 4m7.596s 16.54 MBps

We have a new bottleneck: we’re limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and sort -m can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it’s cheating! if you can think of a better partition/merge using core unix tools, contact me) to partition the data and merge it back.

Future work

I’ve tested this on ubuntu/debian, but not on other distros. According to Daniel Einspanjer, netcat has different parameters on Redhat.

br has a poor man’s dfs like so:

br -r "cat > /tmp/myfile" < input

But this breaks if you specify the same host multiple times. Maybe some kind of very basic virtualization is in order. Maybe.

Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc).

bashreduce's People

Contributors

edwardbadboy avatar erikfrey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bashreduce's Issues

README unclear

Hi there,

I just started reading the README and was wondering about the first table (comparsing sort to br). What is the difference between the second and third row? It looks the same to me.

Question: How to use 'poor man's dfs'?

In README, it is said that we can use bashreduce as a poorman's dfs by doing

br -r "cat > /tmp/myfile" < input

I think I can see a file 'input' will be distributed among the hosts with the file name /tmp/myfile.
But how can we read the file?

br -r "cat /tmp/myfile"

would give us a broken file since those distributed chunks do not have information about in what order the original file was written.
Or it is assumed that input is a file whose lines have numbers in order?

Mac OS X Support

(I'm working on this one — if I find something I'll send a pull request. Consider this as much a development log as a feature request!)

Apple's version of nc doesn't support -q, though -w seems to do something similar. Even with the change, it results in the rather mystifying behavior of reading the input and producing no output, and I have no idea why.

(This is of course after installing pv from Homebrew; one could easily alias pv=cat if one was so inclined!)

get none results

i have run this command to test it:

br < Makefile

and get nothing...
20130616172222

Redhat/CentOS

Any luck with the redhat system? I tried removing the -q0 from nc and that took me a little further.

However, there seem to be other problems with the ssh'ing.

better documentation on how to specify hosts

I don't get how I should define the hosts in my configuration file.

I have access to the following hosts:

  • gioby@localhost
  • gdallolio@server1
  • gioby@server2

I have tried to run bashreduce with the following command line:
$: br -m "gioby@localhost gdallolio@server1" -r 'head' < bigfile.txt
but it asks me the password to connect to localhost.

Could you improve the documentation on how to specifiy hosts, with different usernames?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.