Giter Club home page Giter Club logo

rubygems-crawler's Introduction

rubygems-crawler

A little utility to download rubygems.org information - used for an ElasticSearch demo at RubyConf 2013

Install

You can use the rubygems installation with:

gem install rubygems-crawler

Usage

We're assuming you have a running instance of MongoDB on localhost 27017 (if not, set the MONGO_URI environment variable to a mongodb url - like mongodb://10.0.0.1:27010/).

The crawler is made of 2 executables:

rubygems-web-crawler

and

rubygems-gems-crawler

The first one is the seeder of the crawler: it downloads all the available names of the gems and save the names into Mongo (by default on 'rubygems' database, collection 'gems').

The second one iterates through all the gem names saved in Mongo, and enrich each record with new data coming from the RubyGems APIs.

Seeding the database

To seed the Mongo database with all the gems available at RubyGems.org:

rubygems-web-crawler

The seeder will connect online, downloading the /gems page of rubygems.org. You should see a console output like:

Acquiring http://rubygems.org/gems?letter=A
[RubyGems Web Crawler] [http://rubygems.org/gems?letter=A] - Acquired 30 gems
Acquiring http://rubygems.org/gems?letter=A&page=2
[RubyGems Web Crawler] [http://rubygems.org/gems?letter=A&page=2] - Acquired 30 gems
Acquiring http://rubygems.org/gems?letter=A&page=3
[RubyGems Web Crawler] [http://rubygems.org/gems?letter=A&page=3] - Acquired 30 gems
Acquiring http://rubygems.org/gems?letter=A&page=4
...

Several (thousands) pages after this, you will have all gem names saved into MongoDB - database "rubygems", collection "gems".

You can test - using the Mongo console - what's happening in your local database:

ᐅ mongo
MongoDB shell version: 2.2.0
connecting to: test
> use rubygems
switched to db rubygems
> db.gems.count()
64741
> 

Downloading all the gems data

After you have downloaded all gems' names into your local database, let's run the gems-crawler. The gem crawler will use RubyGems.org APIs to enrich all the gem records with extra data.

To run it:

rubygems-gems-crawler 

You should see a similar output:

[RubyGems Web Crawler] Acquiring data for gem a
[RubyGems Web Crawler] Acquiring data for gem a13g
[RubyGems Web Crawler] Acquiring data for gem a2_printer
[RubyGems Web Crawler] Acquiring data for gem a2ws
...

Several thousands gems later, all the data is now loaded into your local database

Let's see what we have in Mongo now:

ᐅ mongo
MongoDB shell version: 2.2.0
connecting to: test
> use rubygems
switched to db rubygems
> db.gems.find(name: 'a').pretty()
Mon Oct 28 19:09:45 SyntaxError: missing ) after argument list (shell):1
> db.gems.find({name: 'a'}).pretty()
{
  "_id" : ObjectId("52684713bc2b36832b000001"),
  "name" : "a",
  "downloads" : 16520,
  "version" : "0.1.1",
  "version_downloads" : 10144,
  "platform" : "ruby",
  "authors" : " Author",
  "info" : "Summary",
  "licenses" : null,
  "project_uri" : "http://rubygems.org/gems/a",
  "gem_uri" : "http://rubygems.org/gems/a-0.1.1.gem",
  "homepage_uri" : "http://google.com",
  "wiki_uri" : null,
  "documentation_uri" : null,
  "mailing_list_uri" : null,
  "source_code_uri" : null,
  "bug_tracker_uri" : null,
  "dependencies" : {
    "development" : [ ],
    "runtime" : [ ]
  },
  "versions" : [
    {
      "authors" : " Author",
      "built_at" : "2010-05-27T16:00:00Z",
      "description" : null,
      "downloads_count" : 10144,
      "number" : "0.1.1",
      "summary" : "Summary",
      "platform" : "ruby",
      "prerelease" : false,
      "licenses" : null,
      "requirements" : null
    },
    {
      "authors" : " Author",
      "built_at" : "2010-05-27T16:00:00Z",
      "description" : null,
      "downloads_count" : 6376,
      "number" : "0.1.0",
      "summary" : "Summary",
      "platform" : "ruby",
      "prerelease" : false,
      "licenses" : null,
      "requirements" : null
    }
  ],
  "owners" : [
    {
      "email" : "[email protected]"
    }
  ]
}

Ain't Nobody Got Time For That

In a little hurry?

Well you can also download the full bson from here

Download the package and untar-bzipit:

$ tar -jxvf mongo_dump.tbz2

Then import the full data into your local MongoDB:

mongorestore -d rubygems mongo_dump/

rubygems-crawler's People

Contributors

openmosix avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.