Giter Club home page Giter Club logo

ukgovdatascience.cookiecutter-data-science's Introduction

cookiecutter-data-science

cookiecutter-data-science contains the boilerplate you need to start a Data Science project.

Project Goals

  • Provide project templates to quickstart Data Science research & development projects
  • Stop the use of filenames such as YYYYMMDD_data_exploration_v1_new_old_new_v34_final_new.sql

Install guide

Get cookiecutter (first time only)

$ pip install cookiecutter

Now run it against this repo

$ cookiecutter https://github.com/jvawdrey/cookiecutter-data-science.git

You will be prompted to answer a set of questions which will be used to fill out the boilerplate. Note - Hitting enter will keep the defaults displayed.

Cloning into 'cookiecutter-data-science'...
remote: Counting objects: 24, done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 24 (delta 9), reused 23 (delta 8), pack-reused 0
Unpacking objects: 100% (24/24), done.
Checking connectivity... done.
full_name [Matthew Upson]:
email [[email protected]]:
client [GDS]:
project_name [My Data Science Project]:
project_slug [my-data-science-project]:
project_description [Data Science cookiecutter contains all the boilerplate you need to create a Data Science project.]:
created_date [2015-11-17]:

Enter project

$ cd <project name>

Create git repo for version control

# initialize project
$ git init

# add all files
$ git add -A

# first commit
$ git commit -m "Initial commit"

Repo layout

  • /data: Utility project data such as mapping tables
  • /docs: Project documents including excel files
  • /img: Visualizations generated for project
  • /python: Python code for project
    • /python/notebooks: Python notebook code
  • /prod: "Operational ready code"
  • /prod/tests: "Tests for prod code"
  • /r: R code for project
  • /sql: SQL scripts
  • /tmp: Temp files (Note - Files in this directory are ignored from Git repo)

File descriptions

  • /data:
    • filename: description
  • /docs:
    • filename: description
  • /img:
    • Note - Relevant images are documented in related powerpoint presentation in the '/docs' directory
  • /python:
    • filename: description
  • /r:
    • data-exploration.R: This code contains data exploration queries used to understand available data for cleaning and feature generation.
  • /sql:
    • data-exploration.sql: This file contains data exploration queries used to audit and understand available data in addition to determining correct data preparation logic
    • data-preparation.sql: This script contains table generation queries used to prepare available data for feature generation.
    • feature-generation.sql: This script creates features (independent variables) for testing as inputs into the models.
    • feature-selection.sql: This script contains a set of queries which calculate summary statistics, correlations etc. to be used for feature selection.
    • model-development.sql: This script is used to build (train) the model
    • model-scoring.sql: This script is used to apply model to new data (scoring)
    • model-update.sql: This script is used to update (retrain) model

Contact

ukgovdatascience.cookiecutter-data-science's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.