Giter Club home page Giter Club logo

bcs_spa16's Introduction

Real-World Big Data In Action

Code and documentation for a workshop at the BCS SPA2016 Conference in which you will create a Big Data cluster, load it with real data and run queries against it.

Laptops

You will need a laptop for the session. OS X, Linux and Windows are all supported (Windows can be a little tricky!). You will need at least 8GB RAM to run all the software.

We do have a couple of virtual servers available so you may be able to use on of those.

Documentation

Copies of the slides can be found in the slides directory.

Reference documentation for Hadoop, Spark and Hive can be found in the docs directory.

Course Errata

Slides 8: Install Hadoop, Spark and Hive

You can use 7zip to extract the compressed tar files on Windows. (There is a copy of the 7zip installer in the downloads directory.)

Right-click on a .tgz file and select 7zip -> Extract Here.

Slides 14 and 15: Review Spark and Hive Configuration

No editing of configuration files is necessary. Just review them to make sure they are correct for your setup.

Slide 16: Create Hive Metastore Database

You don't need to create the empty Hive metastore database - it is included in the git repository. I was unable to get the metastore command to work in Windows. (The format seems to be portable so I created it on OS X.)

Slide 18 (and onwards)

Ignore the following warning message from hadoop on OSX and Unix:

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Slide 31: Pyspark

If Pyspark on Windows can't find Python, make sure it is installed, and that the Python directory (typically C:\Python27) is in %PATH%.

Slide 41

The Hadoop load command should read:

$HADOOP_HOME/bin/hadoop fs -put \
    $SPA_2016/datasets/lhp/load/*.csv \
    hdfs://localhost:9000/user/spa16/load/lhp

(ie replace lfb with lhp).

Similarly, the Pyspark statement should start lhp =, not lfb = .

Slide 45: Troubleshooting

For Windows you need to run env.cmd, not env.bat.

If you get lots of errors on startup, make sure you have initialised your Hadoop filesystem!

Look out for rogue back-quotes in scripts and slides. I have checked carefully for these but it is possible one may have slipped through. These generally cause pretty obvious syntax errors.

bcs_spa16's People

Contributors

rozanski avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.