Giter Club home page Giter Club logo

cudf's Introduction

 cuDF - GPU DataFrames

Build Status  Documentation Status

The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.

Quick Start

Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.

Install cuDF

Conda

You can get a minimal conda installation with Miniconda or get the full installation with Anaconda.

You can install and update cuDF using the conda command:

conda install -c numba -c conda-forge -c rapidsai -c defaults cudf=0.2.0

You can create and activate a development environment using the conda command:

conda env create --name cudf --file conda_environments/testing_py35.yml
source activate cudf

Pip

Support is coming soon, please use conda for the time being.

Development Setup

The following instructions are tested on Linux Ubuntu 16.04 & 18.04, to enable from source builds and development. Other operatings systems may be compatible, but are not currently supported.

Get libgdf Dependencies

Compiler requirements:

  • g++ 5.4
  • cmake 3.12

CUDA/GPU requirements:

  • CUDA 9.2+
  • NVIDIA driver 396.44+
  • Pascal architecture or better

You can obtain CUDA from https://developer.nvidia.com/cuda-downloads

Since cmake will download and build Apache Arrow (version 0.7.1 or 0.8+) you may need to install Boost C++ (version 1.58+) before running cmake:

# Install Boost C++ for Ubuntu 16.04/18.04
$ sudo apt-get install libboost-all-dev

or

# Install Boost C++ for Conda
$ conda install -c conda-forge boost

Build from Source

To install cuDF from source, ensure the dependencies are met and follow the steps below:

  1. Clone the repository
git clone --recurse-submodules https://github.com/rapidsai/cudf.git
cd cudf
  1. Create the conda development environment cudf as detailed above
  2. Build and install libgdf
source activate cudf
mkdir -p libgdf/build
cd libgdf/build
cmake .. -DHASH_JOIN=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
make -j install
make copy_python
python setup.py install
  1. Build and install cudf from the root of the repository
cd ../..
python setup.py install

Automated Build in Docker Container

A Dockerfile is provided with a preconfigured conda environment for building and installing cuDF from source based off of the master branch.

Prerequisites

  • Install nvidia-docker2 for Docker + GPU support
  • Verify NVIDIA driver is 396.44 or higher
  • Ensure CUDA 9.2+ is installed

Usage

From cudf project root run the following, to build with defaults:

docker build -t cudf .

After the container is built run the container:

docker run --runtime=nvidia -it cudf bash

Activate the conda environment cudf to use the newly built cuDF and libgdf libraries:

root@3f689ba9c842:/# source activate cudf
(cudf) root@3f689ba9c842:/# python -c "import cudf"
(cudf) root@3f689ba9c842:/#

Customizing the Build

Several build arguments are available to customize the build process of the container. These are spcified by using the Docker build-arg flag. Below is a list of the available arguments and their purpose:

Build Argument Default Value Other Value(s) Purpose
CUDA_VERSION 9.2 10.0 set CUDA version
LINUX_VERSION ubuntu16.04 ubuntu18.04 set Ubuntu version
CC & CXX 5 7 set gcc/g++ version; NOTE: gcc7 requires Ubuntu 18.04
CUDF_REPO This repo Forks of cuDF set git URL to use for git clone
CUDF_BRANCH master Any branch name set git branch to checkout of CUDF_REPO
NUMBA_VERSION 0.40.0 Not supported set numba version
NUMPY_VERSION 1.14.3 Not supported set numpy version
PANDAS_VERSION 0.20.3 Not supported set pandas version
PYARROW_VERSION 0.10.0 0.8.0+ set pyarrow version
PYTHON_VERSION 3.5 3.6 set python version

Testing

cuDF

This project uses py.test

In the source root directory and with the development conda environment activated, run:

py.test --cache-clear --ignore=libgdf

libgdf

The libgdf tests require a GPU and CUDA. CUDA can be installed locally or through the conda packages of numba & cudatoolkit. For more details on the requirements needed to run these tests see the libgdf README.

libgdf has two testing frameworks py.test and GoogleTest:

# Run py.test command inside the /libgdf folder
py.test

# Run GoogleTest command inside the /libgdf/build folder after cmake
make -j test

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.

cudf's People

Contributors

aocsa avatar aschaffer avatar aucahuasi avatar beckernick avatar bradreeswork avatar dantegd avatar felipeblazing avatar harrism avatar hhuuggoo avatar iroy30 avatar jcrist avatar jirikraus avatar jrhemstad avatar kaatish avatar kkraus14 avatar mike-wendt avatar mrocklin avatar mtjrider avatar nsakharnykh avatar ogreen avatar pearu avatar randerzander avatar seibert avatar shwina avatar sklam avatar tomaugspurger avatar vindows avatar wamsiv avatar wmalpica avatar yashv28 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.