Giter Club home page Giter Club logo

charyb's Introduction

h2. What is Charyb?

Charyb is a tool to help suck down data and shove them into a simple datawarehouse.

There are two parts, the web interface and the crawler.  

The web interface makes it easier to suck in data from different mime types.  
Because data scraping isn't yet fully automated, there needs to be a little bit 
of human intervention. 

The crawler makes use of the data humans entered to figure out how to scrape the 
data.  It will keep checking those data sources as updates.

h2. Installation

if you haven't added the github source, do it by doing:

  sudo gem source -a http://gems.github.com
  sudo gem source -a http://gems.rubyforge.org

The install the following packages:

  libopenssl-ruby

Then install the following gems:

  # for web application
  sudo gem install -d sinatra -v 0.9.4
  sudo gem install -d hpricot -v 0.8.1
  # sudo gem install -d couchrest -v 0.33
  sudo gem install -d sqlite3-ruby -v 1.2.5  # only for sqlite3

  # for testing
  sudo gem install -d rack-test -v 0.4.2
  sudo gem install -d webrat -v 0.5.3
  sudo gem install -d thoughtbot-shoulda -v 2.10.2

  # for mocks and stubs
  sudo gem install -d mocha -v 0.9.8

  # for fixtures
  sudo gem install -d notahat-machinist -v 1.0.3
  sudo gem install -d faker -v 0.3.1

h2. Setup

You need to pull in the git submodules first

  git submodule init
  git submodule update

If it tells you that you can access it, ping the owner of the repo to add you as 
a collaborator

Then you must set up the database.

  rake db:schema:load

h2. To run it

To run the web interface:

  rake web:run

And then go to http://localhost:4567

To run the crawler:

  rake crawler:run


h2. Architecture

There are two major components:  The web interface and the crawler.  The web 
interface is used to for the human in the loop to more easily say how to suck 
in data for a datasource.  Right now, it only does html tables.  Eventually, 
it'll also do CSV files as well. It stores this information in the 
SourceTracker, which is a wrapper around an SQLite database.

The crawler is run separately and uses the SourceTracker's information about 
where a datasource can be found and how to suck it in.

When the crawler sucks in the data and cleaned it properly, it then
pushes it to the redis datawarehouse.  Then teabag queries the
datawarehouse when it gets web requests from the outside world.

The web part starts with src/application.rb  it requires 'config/initialize' 
at the top, which sets all the global constants and paths.  you can find it 
in config/initialize.

Then the rest of that code is in sinatra's DSL.  I suggest reading 
http://www.sinatrarb.com/intro.html first, and then using 
http://www.sinatrarb.com/book.html as a reference.

The crawler starts with crawler.rb, but I haven't finished it yet.  Right 
now, the source tracker isn't finalized yet either, so the web interface 
directly uses the activerecord objects.  activerecord is an ORM object.  
http://api.rubyonrails.org/classes/ActiveRecord/Base.html

If you have specific questions, don't hesitate to ask either email or by 
phone.

h2. Import Table Use Case

1. user starts the web interface (rake web:run) from a console.
2. webserver starts, messages scroll by in the console.
3. user opens a browser to http://localhost:4567
4. browser opens showing the charyb front page.
5. user enters the url of a page containing an html table into the
   textfield and clicks "submit."
6. browser shows requested page in left div and list of tables
   imported from that page (none) and a button to add a new table in
   the right div.
7. user clicks "Add a table."
8. a form is displayed in the right div.
9. user fills out the fields from "Table heading" to "Converter."
   Some fields have special formatting rules:
   - "col heading" or "row heading" can be "Category" if no word
     describes the collection of headings.  Most of the time "row
     heading" will be "Year" or "State."
   - "other dims" should be a csv list of "dimension name, dimension
     value" pairs.  For example, "Year, 2006"
   - "default" should match the value of the "row header" "col header"
     or one of the "other dims."
   - "published at" should be a date, eg. "01/01/2009."
   - "units" can be a single value or can be a csv list for each column.
   - "converter" can be blank, or a single value, or a csv list for each
     column.  values must match a converter name. (not implemented yet)
10. user clicks a cell at one corner of the column headers in the table
    on the left div.
11. the cell is colored red.
12. user shift-clicks a cell at the opposite corner of the column headers.
13. the column headers are all colored red.
14. the user clicks the "set" button next to "Col labels."
15. the column headers are all colored green, their values are listed
    in the "Col labels" textarea, one per line.
16. the user does the same for row headers and data.
17. row headers are colored brown; data cells are colored purple.  if
    there are n row headers and m col headers there must be nxm data
    values.
18. user clicks the "Create" button.
19. right div removes the form and lists the table as a link.


h2. Edit Table Use Case
1. user starts the web interface (rake web:run) from a console.
2. webserver starts, messages scroll by in the console.
3. user opens a browser to http://localhost:4567
4. browser opens showing the charyb front page.
5. user clicks on a datasource title link.
6. browser shows requested page in left div and list of tables
   imported from that page with edit buttons, and a button to add a
   new table in the right div.
7. user clicks edit.
8. browser shows partially completed form used to import the table.
9. user clicks the "Finish Loading" button.
10. browser selects cells in the table in the right div, coloring the
   table as it was when it was imported, and fills in the textareas
   with selected values in the form.
11. user makes changes and clicks the "Update" button.
12. right div removes the form and lists the table as a link.


h2. cli commands to look into redis
In the following commands, all spaces are replaced by underscores.
This is only for input.  The result can contain spaces.

> redis-cli smembers datasets
lists datasets by name

> redis-cli smembers "[dsname]||dimensions"
lists the dimensions of dataset dsname.

> redis-cli smembers "[dsname]||[dimname]"
lists the values of the dimension dimname.

> redis-cli smembers "[dsname]||meta"
lists metadata for a dataset.

> redis-cli get "[dsname]||[dimvalue]||[nextdimvalue]"
returns a value.

charyb's People

Contributors

iamwilhelm avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.