Giter Club home page Giter Club logo

dask-mini-tutorial's Introduction

Dask Live by Coiled Tutorial

The purpose of this tutorial is to introduce folks to Dask and show them how to scale their python data-science and machine learning workflows. The materials covered are:

  1. Overview of Dask - How it works and when to use it.
  2. Dask Delayed: How to parallelize existing Python code and your custom algorithms.
  3. Schedulers: Single Machine vs Distributed, and the Dashboard.
  4. From pandas to Dask: How to manipulate bigger-than-memory DataFrames using Dask.
  5. Dask-ML: Scalable machine learning using Dask.

Prerequisites

To follow along and get the most out of this tutorial it would help if you Know:

  • Programming fundamentals in Python (e.g variables, data structures, for loops, etc).
  • A bit of or are familiarized with numpy, pandas and scikit-learn.
  • Jupyter Lab/ Jupyter Notebooks
  • Your way around the shell/terminal

However, the most important prerequisite is being willing to learn, and everyone is welcomed to tag along and enjoy the ride. If you would like to watch and not code along, not a problem.

Get set up

We have two options for you to follow this tutorial:

  1. Click on the binder button right below, this will spin up the necessary computational environment for you so you can write and execute the notebooks directly on the browser. Binder is a free service so resources are not guaranteed, but they usually work. One thing to keep in mind is that the amount of resources are limited and sometimes you won't be able to see the benefits of parallelism due to this limitation.

    IMPORTANT: If you are joining the live session, make sure to click on the button few minutes before we start so we are ready to go.

    Binder

  2. You can create your own set-up locally. To do this you need to be comfortable with the git and github as well as installing packages and creating software environments. If so, follow the next steps:

    IMPORTANT: If you are joining for a live session please make sure you do the setup in advance, and be ready to go once the session starts.

    1. Clone this repository In your terminal:

      git clone https://github.com/coiled/dask-mini-tutorial.git
      

      Alternatively, you can download the zip file of the repository at the top of the main page of the repository. This is a good option if you don't have experience with git.

    2. Download Anaconda If you do not have anaconda already install, you will need the Python 3 Anaconda Distribution. If you don't want to install anaconda you can install all the packages with pip, if you take this route you will need to install graphviz separately before installing pygraphviz.

    3. Create a conda environment In your terminal navigate to the directory where you have cloned/downloaded th dask-mini-tutorial repo and install the required packages by doing:

      conda env create -f binder/environment.yml
      

      This will create a new environment called dask-mini-tutorial. To activate the environment do:

      conda activate dask-mini-tutorial
      
    4. Open Jupyter Lab Once your environment has been activated and you are in the dask-mini-tutorial repository, in your terminal do:

      jupyter lab
      

      You will see a notebooks directory, click on there and you will be ready to go.

dask-mini-tutorial's People

Contributors

hendrikmakait avatar ncclementi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dask-mini-tutorial's Issues

Add section that describes plots of dashboard

Following up on this comment from @pavithraes

Under "Distributed Scheduler", maybe we can discuss some plots in detail: task stream, progress bar, workers memory?

We should add some images of the dashboard and an explanation of the plots. Separating this issue from PR #2 and include as a follow-up soon. I want to merge the PR and test binder first.

Point out that graphviz needs to be installed

graphviz is included in the environment.yml, but if the user is installing from pip (which the README says they can do) then they need to install graphviz separately. We should point that out in that case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.