Giter Club home page Giter Club logo

common_crawl_types's Introduction

common_crawl_types

Ben Nagy wrote the original code for this project, and posted it inline to the
Common Crawl mailing list. I tidied it up and wrote a how-to guide:
http://petewarden.typepad.com/searchbrowser/2012/03/twelve-steps-to-running-your-ruby-code-across-five-billion-web-pages.html

Ben's original message is below.

Pete Warden, [email protected]

-------------------------------------------------------------------

Hi,

So I found this a bit of a pain, so I thought I'd share. If you want
to mess with the Common Crawl stuff but don't feel like learning Java,
this might be for you.

I'm sure that this could be easily adapted for other streaming
languages, once you work out how to read requester-pays buckets.

First up, see this:
http://arfon.org/getting-started-with-elastic-mapreduce-and-hadoop-streaming

Which has basic information and nice screenshots about EMR Streaming,
setting up the job, bootstrapping and such.

To install the AWS Ruby SDK on an EMR instance you'll need to
bootstrap some stuff. Some of the packages might not be necessary, but
it was a bit of a pain to trim down from a working set of basic
packages.

(see setup.sh)

OK, now we're ready for the mapper. This example just collects
mimetypes and URL extensions. The key bits are the ArcFile class and
the monkeypatch to make requester-pays work. I'm not particularly
proud of this monkeypatch, by the way, but the SDK code is a bit
baffling, and it looked like too much work to patch it properly.

This mapper expects a file manifest as input, one arc.gz url to read
per line. By doing this you avoid the problem of weird splits, or
having hadoop automatically trying to gunzip the file and failing. It
should look like:

s3://commoncrawl-crawl-002/2010/09/24/9/1285380159663_9.arc.gz
s3://commoncrawl-crawl-002/2010/09/24/9/1285380179515_9.arc.gz
s3://commoncrawl-crawl-002/2010/09/24/9/1285380199363_9.arc.gz

You can get those names with the SDK, once you add the monkeypatch
below, or with a patched version of s3cmd ls, the instructions for
which have been posted here before.

(see extension_map.rb)

And finally, a trivial reducer

(see extension_reduce.rb)

IMHO you only need one of these puppies, which you can achieve by
adding '-D mapred.reduce.tasks=1' to your job args

If it all worked you should get something like this in your output
directory:

text/html                               : 4365
text/html                .html          : 4256
text/xml                                : 43
text/html                .aspx          : 16
text/html                .com           : 2
text/plain               .txt           : 1

Except with more entries, that is just an example based on one file.

For those interested in costs / timings, I finished 2010/9/24/9 (790
files) in 5h57m, or 30 normalised instance hours of m1.small, with 1
master and 4 core instances. The same job with 1 m1.small master and
2x cc1.4xlarge core was done in 1h31m, for 66 normalised instance
hours. I'll let you do your individual maths and avoid drawing any
conclusions. If anyone has additional (solid) performance data
comparing various cluster configs for identical workloads then that
might be useful. As an aside, my map tasks took from 9 minutes to 45
minutes to complete, but the average was probably ~33 (eyeball).

Anyway, hope this helps someone.

Cheers,

ben

common_crawl_types's People

Contributors

petewarden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.