The monkeyshines from mrflip

Monkeyshines is a tool for doing an algorithmic scrape.

It’s designed to handle large-scale scrapes that may exceed the capabilities of single-machine relational databases, so it plays nicely with Hadoop / Wukong, with distributed databases (MongoDB, tokyocabinet, etc.), and distributed job queue (eg edamame/beanstalk).

Install

Main Install and Setup Documentation **

Get the code

We’re still actively developing monkeyshines. The newest version is available via Git on github:

$ git clone git://github.com/mrflip/monkeyshines

A gem is available from gemcutter:

$ sudo gem install monkeyshines --source=http://gemcutter.org

(don’t use the gems.github.com version — it’s way out of date.)

You can instead download this project in either zip or tar formats.

Dependencies and setup

To finish setting up, see the detailed setup instructions and then read the usage notes

Overview

Runner

Builder Pattern to construct
does the running itself

Set stuff up
Loop. (Until no more requests)
1. Get a request from #source
2. Pass that request to the fetcher
  - The fetcher has a #get method,
  - which stuffs the response contents into the request object
3. if the fetcher has a successful response,

Bulk URL Scraper

Open a file with URLs, one per line
Loop until no more requests:
1. Get a simple_request from #source
  - The source is a FlatFileStore;
  - It generates simple_request (objects of type SimpleRequest): has a #url and
    an attribute holding (contents, response_code, response_message).

Pass that request to an http_fetcher
- The fetcher has a #get method,
- which stuffs the body of the response — basically, the HTML for the page — request object’s contents. (and so on for the response_code and response_message).

if the fetcher has a successful response,
- Pass it to a flat_file_store
- which just wrtes the request to disk, one line per request, tab separated on fields.
  url moreinfo scraped_at response_code response_message contents

beanstalk == queue
ttserver == distributed lightweight DB
god == monitoring & restart
shotgun == runs sinatra for development
thin == runs sinatra for production

~/ics/monkeyshines/examples/twitter_search == twitter search scraper

work directory holds everything generated: logs, output, dumps of the scrape queue
./dump_twitter_search_jobs.rb —handle=com.twitter.search —dest-filename=dump.tsv
serializes the queue to a flat file in work/seed
load_twitter_search_jobs.rb*
scrape_twitter_search.rb
nohup ./scrape_twitter_search.rb ~~-handle=com.twitter.search >> work/log/twitter_search-console~~`date "+%Y%m%d%M%H%S`.log 2>&1 &

tail ~~f work/log/twitter_search-console-20091006.log (<~~- replace date with latest run)

the acutal file being stored
tail -f work/20091013/comtwittersearch+20091013164824-17240.tsv | cutc 150

Request Source

runner.source
- request stream
- Supplies raw material to initialize a job

Twitter search scra

Request Queue

Periodic requests

Request stream can be metered using read-through, scheduled (eg cron), or test-and-sleep.

Scheduled
Test and sleep. A queue of resources is cyclically polled, sleeping whenever bored.

Requests

Base: simple fetch and store of URI. (URI specifies immutable unique resource)
: single resource, want to check for updates over time.
Timeline:
- Message stream, eg. twitter search or user timeline. Want to do paginated requests back to last-seen
- Feed: Poll the resource and extract contents, store by GUID. Want to poll frequently enough that single-page request gives full coverage.

Scraper

HttpScraper —
- JSON
- HTML
  - \0 separates records, \t separates initial fields;
  - map \ to \\, then tab, cr and newline to \t, \r and \n resp.
  - map tab, cr and newline to 	  and 
     resp.

x9 xa xd x7f

HeadScraper — records the HEAD parameters

Store

Flat file (chunked)
Key store
Read-through cache

Periodic

Log only every N requests, or t minutes, or whatever.
Restart session every hour
Close file and start new chunk every 4 hours or so. (Mitigates data loss if a file is corrupted, makes for easy batch processing).

Pagination

Session

Twitter Search: Each req brings in up to 100 results in strict reverse ID (pseudo time) order. If the last item ID in a request is less than the previous scrape session’s max_id, or if fewer than 100 results are returned, the scrape session is complete. We maintain two scrape_intervals: one spans from the earliest seen search hit to the highest one from the previous scrape; the other ranges backwards from the highest in this scrape session (the first item in the first successful page request) to the lowest in this scrape session (the last item on the most recent successful page request).

Set no upper limit on the first request.
Request by page, holding the max_id fixed
Use the lowest ID from the previous request as the new max_id
Use the supplied ‘next page’ parameter

Twitter Followers: Each request brings in 100 followers in reverse order of when the relationship formed. A separate call to the user can tell you how many total followers there are, and you can record how many there were at end of last scrape, but there’s some slop (if 100 people in the middle of the list /un/follow and 100 more people at the front /follow/ then the total will be the same). High-degree accounts may have as many as 2M followers (20,000 calls).

FriendFeed: Up to four pages. Expiry given by result set of <100 results.

Paginated: one resource, but requires one or more requests to
- Paginated + limit (max_id/since_date): rather than request by increasing page, request one page with a limit parameter until the last-on-page overlaps the previous scrape. For example, say you are scraping search results, and that when you last made the request the max ID was 120_000; the current max_id is 155_000. Request the first page (no limit). Using the last result on each page as the new limit_id until that last result is less than 120_000.
- Paginated + stop_on_duplicate: request pages until the last one on the page matches an already-requested instance.
- Paginated + velocity_estimate: . For example, say a user acquires on average 4.1 followers/day and it has been 80 days since last scrape. With 100 followers/req you will want to request ceil( 4.1 * 80 / 100 ) = 4 pages.

Rescheduling

Want to perform next scrape to give a couple pages or a mostly-full page. Need to track a rate (num_items / timespan), clamped to a min_reschedule / max_reschedule bounds.

More info

There are many useful examples in the examples/ directory.

Credits

Monkeyshines was written by Philip (flip) Kromer ([email protected] / @mrflip) for the infochimps project

Help!

Send monkeyshines questions to the Infinite Monkeywrench mailing list

mrflip / monkeyshines Goto Github PK

monkeyshines's Introduction