Giter Club home page Giter Club logo

monkeyshines's Introduction

Monkeyshines is a tool for doing an algorithmic scrape.

It’s designed to handle large-scale scrapes that may exceed the capabilities of single-machine relational databases, so it plays nicely with Hadoop / Wukong, with distributed databases (MongoDB, tokyocabinet, etc.), and distributed job queue (eg edamame/beanstalk).

Install

Get the code

We’re still actively developing monkeyshines. The newest version is available via Git on github:

$ git clone git://github.com/mrflip/monkeyshines

A gem is available from gemcutter:

$ sudo gem install monkeyshines --source=http://gemcutter.org

(don’t use the gems.github.com version — it’s way out of date.)

You can instead download this project in either zip or tar formats.

Dependencies and setup

To finish setting up, see the detailed setup instructions and then read the usage notes


Overview

Runner

  • Builder Pattern to construct
  • does the running itself
  1. Set stuff up
  2. Loop. (Until no more requests)
    1. Get a request from #source
    2. Pass that request to the fetcher
      • The fetcher has a #get method,
      • which stuffs the response contents into the request object
    3. if the fetcher has a successful response,

Bulk URL Scraper

  1. Open a file with URLs, one per line
  2. Loop until no more requests:
    1. Get a simple_request from #source
      • The source is a FlatFileStore;
      • It generates simple_request (objects of type SimpleRequest): has a #url and
        an attribute holding (contents, response_code, response_message).
  1. Pass that request to an http_fetcher
    • The fetcher has a #get method,
    • which stuffs the body of the response — basically, the HTML for the page — request object’s contents. (and so on for the response_code and response_message).
  1. if the fetcher has a successful response,
    • Pass it to a flat_file_store
    • which just wrtes the request to disk, one line per request, tab separated on fields.
      url moreinfo scraped_at response_code response_message contents

beanstalk == queue
ttserver == distributed lightweight DB
god == monitoring & restart
shotgun == runs sinatra for development
thin == runs sinatra for production

~/ics/monkeyshines/examples/twitter_search == twitter search scraper
  • work directory holds everything generated: logs, output, dumps of the scrape queue
  • ./dump_twitter_search_jobs.rb —handle=com.twitter.search —dest-filename=dump.tsv
    serializes the queue to a flat file in work/seed
    load_twitter_search_jobs.rb*
    scrape_twitter_search.rb
  • nohup ./scrape_twitter_search.rb -handle=com.twitter.search >> work/log/twitter_search-console`date "+%Y%m%d%M%H%S`.log 2>&1 &
  • tail f work/log/twitter_search-console-20091006.log (<- replace date with latest run)
  1. the acutal file being stored
  2. tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150

Request Source

  • runner.source
    • request stream
    • Supplies raw material to initialize a job

Twitter search scra

Request Queue

Periodic requests

Request stream can be metered using read-through, scheduled (eg cron), or test-and-sleep.

  • Scheduled
  • Test and sleep. A queue of resources is cyclically polled, sleeping whenever bored.

Requests

  • Base: simple fetch and store of URI. (URI specifies immutable unique resource)
  • : single resource, want to check for updates over time.
  • Timeline:
    • Message stream, eg. twitter search or user timeline. Want to do paginated requests back to last-seen
    • Feed: Poll the resource and extract contents, store by GUID. Want to poll frequently enough that single-page request gives full coverage.

Scraper

  • HttpScraper —
    • JSON
    • HTML
      • \0 separates records, \t separates initial fields;
      • map \ to \\, then tab, cr and newline to \t, \r and \n resp.
      • map tab, cr and newline to &#x9; &#xD; and &#xA; resp.

x9 xa xd x7f

  • HeadScraper — records the HEAD parameters

Store

  • Flat file (chunked)
  • Key store
  • Read-through cache

Periodic

  • Log only every N requests, or t minutes, or whatever.
  • Restart session every hour
  • Close file and start new chunk every 4 hours or so. (Mitigates data loss if a file is corrupted, makes for easy batch processing).

Pagination

Session

  • Twitter Search: Each req brings in up to 100 results in strict reverse ID (pseudo time) order. If the last item ID in a request is less than the previous scrape session’s max_id, or if fewer than 100 results are returned, the scrape session is complete. We maintain two scrape_intervals: one spans from the earliest seen search hit to the highest one from the previous scrape; the other ranges backwards from the highest in this scrape session (the first item in the first successful page request) to the lowest in this scrape session (the last item on the most recent successful page request).
  • Set no upper limit on the first request.
  • Request by page, holding the max_id fixed
  • Use the lowest ID from the previous request as the new max_id
  • Use the supplied ‘next page’ parameter
  • Twitter Followers: Each request brings in 100 followers in reverse order of when the relationship formed. A separate call to the user can tell you how many total followers there are, and you can record how many there were at end of last scrape, but there’s some slop (if 100 people in the middle of the list /un/follow and 100 more people at the front /follow/ then the total will be the same). High-degree accounts may have as many as 2M followers (20,000 calls).
  • FriendFeed: Up to four pages. Expiry given by result set of <100 results.
  • Paginated: one resource, but requires one or more requests to
    • Paginated + limit (max_id/since_date): rather than request by increasing page, request one page with a limit parameter until the last-on-page overlaps the previous scrape. For example, say you are scraping search results, and that when you last made the request the max ID was 120_000; the current max_id is 155_000. Request the first page (no limit). Using the last result on each page as the new limit_id until that last result is less than 120_000.
    • Paginated + stop_on_duplicate: request pages until the last one on the page matches an already-requested instance.
    • Paginated + velocity_estimate: . For example, say a user acquires on average 4.1 followers/day and it has been 80 days since last scrape. With 100 followers/req you will want to request ceil( 4.1 * 80 / 100 ) = 4 pages.

Rescheduling

Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.


More info

There are many useful examples in the examples/ directory.

Credits

Monkeyshines was written by Philip (flip) Kromer ([email protected] / @mrflip) for the infochimps project

Help!

Send monkeyshines questions to the Infinite Monkeywrench mailing list

monkeyshines's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.