Giter Club home page Giter Club logo

freebase-movies's Introduction

Overview

This project contains a series of command-line tools for processing the Freebase movie data into a data set for use by the Discovery Engine.

Freebase data is available under the creative commons attribution license. See this page for example HTML you can include to if you use their data on your web site.

Note that the TSV format that this tool uses appears to no longer be available.

See https://developers.google.com/freebase/data for the current data formats available.

Feeling lazy?

Many of these files are available from our public s3 bucket s3://t11e.datasets. You can download a complete changeset here. Freebase recently stopped hosting providing the tab-separated mini data dumps (e.g. http://download.freebase.com/datadumps/latest/browse/film.tar.bz2) We have an old snapshot of the film.tar.bz2 file in our public s3 bucket.

Prerequisites

  • Python 2.7 or newer
  • An internet connection
  1. To install the requisite python modules:

    easy_install elementtree # for facet_to_dimension.py, and json_to_tree_dimension.py
    easy_install google-api-python-client # for export_genres.py
  2. Obtain the latest freebase film data dump and extract it locally

    wget http://download.freebase.com/datadumps/latest/browse/film.tar.bz2
    tar --bzip2 --extract --verbose --file film.tar.bz2
  3. Process the film TSV files into a JSON intermediate form

    time ./parse_tsv.py film > film.jsons
  4. Optionally filter out pornographic movies

    time pv film.jsons | ./jsons_filter.py > filtered.jsons
  5. And then convert that into a Discovery Engine changeset. Note that if you do not have pv installed, use cat.

    time pv filtered.jsons | jsons_to_changeset.py | gzip -9 > changeset.xml.gz
  6. Or you can do the three steps above in one fell swoop (using tee to retain copies of the intermediate output)

    time ./parse_tsv.py film | tee film.jsons | ./jsons_filter.py | tee filtered.jsons \
    | jsons_to_changeset.py | tee changeset.xml | gzip -9 > changeset.xml.gz
  7. To export a keyword dimension to a tree dimension definition

    ./facet_to_dimension.py {keyword_dimension_id} {min_count_filter} | xmllint --format -
  8. To export a tree structure of film genres based on a MQL query of the live freebase data

  9. Go to https://code.google.com/apis/console/

  10. Create a project

  11. Create a new server API key. You will use this below.

  12. Enable the freebase API for the project

  13. Retrieve the data using the API

```sh
time ./export_genres.py API_KEY > genres.json
```
  1. Convert the genre dump to an XML tree dimension for hand editing and inclusion in your dimensions.xml

    cat genres.json | ./json_to_tree_dimension.py| xmllint --format -

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.