Giter Club home page Giter Club logo

cookiecutter-data-science's Introduction

Cookiecutter Data Science

A logical, reasonably standardized but flexible project structure for doing and sharing data science work.

Cookiecutter Data Science (CCDS) is a tool for setting up a data science project template that incorporates best practices. To learn more about CCDS's philosophy, visit the project homepage.

ℹ️ Cookiecutter Data Science v2 has changed from v1. It now requires installing the new cookiecutter-data-science Python package, which extends the functionality of the cookiecutter templating utility. Use the provided ccds command-line program instead of cookiecutter.

Installation

Cookiecutter Data Science v2 requires Python 3.8+. Since this is a cross-project utility application, we recommend installing it with pipx. Installation command options:

# With pipx from PyPI (recommended)
pipx install cookiecutter-data-science

# With pip from PyPI
pip install cookiecutter-data-science

# With conda from conda-forge (coming soon)
# conda install cookiecutter-data-science -c conda-forge

Starting a new project

To start a new project, run:

ccds

The resulting directory structure

The directory structure of your new project will look something like this (depending on the settings that you choose):

├── LICENSE            <- Open-source license if one is chosen
├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default mkdocs project; see www.mkdocs.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── pyproject.toml     <- Project configuration file with package metadata for 
│                         {{ cookiecutter.module_name }} and configuration for tools like black
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.cfg          <- Configuration file for flake8
│
└── {{ cookiecutter.module_name }}   <- Source code for use in this project.
    │
    ├── __init__.py             <- Makes {{ cookiecutter.module_name }} a Python module
    │
    ├── config.py               <- Store useful variables and configuration
    │
    ├── dataset.py              <- Scripts to download or generate data
    │
    ├── features.py             <- Code to create features for modeling
    │
    ├── modeling                
    │   ├── __init__.py 
    │   ├── predict.py          <- Code to run model inference with trained models          
    │   └── train.py            <- Code to train models
    │
    └── plots.py                <- Code to create visualizations   

Using v1

If you want to use the old v1 project template, you need to have either the cookiecutter-data-science package or cookiecutter package installed. Then, use either command-line program with the -c v1 option:

ccds https://github.com/drivendataorg/cookiecutter-data-science -c v1
# or equivalently
cookiecutter https://github.com/drivendataorg/cookiecutter-data-science -c v1

Contributing

We welcome contributions! See the docs for guidelines.

Installing development requirements

pip install -r dev-requirements.txt

Running the tests

pytest tests

cookiecutter-data-science's People

Contributors

adamkgoldfarb avatar andrewsanchez avatar apollonin avatar arturomoncadatorres avatar bernardoamaral1011 avatar chrisjkuch avatar codyrioux avatar daniellenz avatar ericmjalbert avatar fokko avatar gehbiszumeis avatar hwartig avatar ikuo-suyama avatar isms avatar jayqi avatar jbrambledc avatar jdanielsbamboo avatar johnpaton avatar jraviotta avatar keldlundgaard avatar kplauritzen avatar liudonghs avatar lorey avatar midnighter avatar mkcor avatar mparada avatar niloch avatar pjbull avatar proinsias avatar verginer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cookiecutter-data-science's Issues

Make default repo_name lowercase

Right now the default repo_name is simply the provided project_name, replacing spaces with underscores: "repo_name": "{{ cookiecutter.project_name|replace(' ', '_') }}".

It would be nice if the default also converted the project_name to lowercase as well: {{ cookiecutter.project_name.lower().replace(' ', '_') }}.

Thoughts?

Rename IPython NB to Jupyter NB

Alright, this is SUPER nitpicky. Jupyter Notebooks is the new name for IPython Notebooks. The comment above .ipynb_checkpoints/ in the .gitignore should be changed from # IPython NB Checkpoints to Jupyter NB Checkpoints.

I'm going to submit a PR to make the change (like I said, really nitpicky).

Make separate docs pages instead of one monolithic page

Especially if we're adding more content (e.g., #18), we may want to have a few separate pages. Possible segmentation could be:

  • Project introduction and documentation
  • Directory layout
  • Opinions and philosophy
  • Workflow components and the technologies that are chosen (or are options)
  • Extension strategies (e.g., #16)
  • Links to examples of projects that use the template

Imports not right for dotenv snippet in docs

# src/data/dotenv_example.py
from os.path import join, dirname
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)

database_url = os.environ.get("DATABASE_URL")
other_variable = os.environ.get("OTHER_VARIABLE")

from os.path import join, dirname should just be import os (for os.environ.get).

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework

Hello,
I'd like to use the cookiecutter-data-science framework for the project that I'm working on, but unfortunately I'm having trouble getting started. Specifically, I'm having trouble figuring out how to configure the make_dataset.py file to execute any python data-making scripts. I'm sure the fix is pretty basic, but I've been spinning my wheels for awhile trying to figure this issue out.

It would be great if you could provide a basic tutorial demonstrating a simple implementation of your framework that people like me could use to get started.
Thanks!

Include nosetests out of the box with top level testing dir

One of the main components that is different from my usual data science set-up is a top-level directory for unit and integration testing. Once a model moves to production, it is vital that it ship with unit and integration tests and insurance that the work does not break any other models. I recommend adding this section at the top level of the module so that forked projects can run the testing suite with access to all the proper sub-modules.

Great work; I appreciate the organization!

Unclear how to use AWS and `make data`

Analysis is a DAG. The sequence in this DAG is critical, so more prescription would be beneficial.

It's unclear how to incorporate AWS and the make sync_data_(to|from)_s3 commands into make data. In addition, the documentation doesn't describe how AWS should be used with the .env file.

  • Should make data call sync_data_from_s3?
  • How should variables from .env be exported so they are available to make sync_data_(to|from)_s3? A Python script, or something else?

How would this structure change for R?

I'm working on creating a similar standard for R at my company and was hoping to get some thoughts on if anything warrants changing to be R specific.

Compatibility of pack to create api driven projects

Hi there! Loved the project, this really reflects the maturity of data science projects and where we are standing. So good!

I rise this issue as I was wondering if the current structure can be adapted to an api-driven project. This is, a project in which the analysis and data flow may be related to an api definition.

If yes, what would it be? So we can document it (or point me out where it is)
If not, why? Some books have recommended having an api flow for analysis and process so our results and analysis are available for our mates in engineering. Even allowing for an easy scale up.

Thank you so much!

Swap out Sphinx for mkdocs

Sphinx is really good for projects where documentation lives in docstrings in the code. Mkdocs is easier to write from scratch, style, and deploy.

Also, I've got a preference for writing markdown over RST

Minor issue with self documenting make

In a fresh project, running make (Ubuntu) gives me:

$ make
/bin/sh: 1: test: Linux: unexpected operator
Available rules:

clean               Delete all compiled Python files 
create_environment  Set up python interpreter environment 
data                Make Dataset 
lint                Lint using flake8 
requirements        Install Python Dependencies 
sync_data_from_s3   Download Data from S3 
sync_data_to_s3     Upload Data to S3 
test_environment    Test python environment is setup correctly

It looks like a problem with the invocation of test (my uname is Linux).

Seems like this comes from the very last line of the self documenting rule:

	| more $(shell test $(shell uname) == Darwin && echo '--no-init --raw-control-chars')

Changing the == to = seems to get rid of the /bin/sh: 1: test: Linux: unexpected operator.

@pjbull Can you see if this works on OS X?

Add the installation folder to .env

Hi,

I am just starting to use the project. I noticed that most of the commands have relative paths like in the Makefile.

It is sometimes useful to have access to the full path. For example, when running cron jobs on some of the scripts inside the projects, getting the proper relative paths may be a bit tricky. One can use something like os.path.abspath(__file__) in the script to find the path but I thought that it would be easier if the project folder was dumped in the .env, the environment variable being then used for paths to the data or visualization folders.

Thanks.

ContextDecodingException

When running cookiecutter https://github.com/drivendata/cookiecutter-data-science in Anaconda 2.3.0 (Python 2.7.11) I get the following exception:

Traceback (most recent call last):
  File "/Users/bencook/anaconda/bin/cookiecutter", line 11, in <module>
    sys.exit(main())
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/cli.py", line 106, in main
    config_file=user_config
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/main.py", line 130, in cookiecutter
    extra_context=extra_context,
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/generate.py", line 102, in generate_context
    raise ContextDecodingException(our_exc_message)
cookiecutter.exceptions.ContextDecodingException: JSON decoding error while loading "/Users/bencook/.cookiecutters/cookiecutter-data-science/cookiecutter.json".  Decoding error details: "Expecting property name: line 9 column 1 (char 401)"

I get the same exception in a virtual environment with Python 2.7.9.

Here's what my cookiecutter.json looks like:

{
    "project_name": "project_name",
    "repo_name": "{{ cookiecutter.project_name|replace(' ', '_') }}",
    "author_name": "Your name (or your organization/company/team)",
    "description": "A short description of the project.",
    "year": "2016",
    "open_source_license": ["MIT", "BSD", "Not open source"],
    "s3_bucket": "[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')",
}

Set .gitignore for the data directory

Goal: keep the data/ folder in the project template for illustrative reasons, but by default ignore its contents once the cookiecutter has been instantiated and turned into a git repo.

Slightly adjust commands in Makefile

Just started using your cookiecutter for the first time. Thank you for the effort, it seems very valuable to me!

I had some comments and I'm happy to create pull requests if those are desired changes. I'm talking about targets in the Makefile here.

requirements:

pip now also allows a constraints file which seems more relevant to me in order to pin or require certain versions of dependencies.

clean:

Find can delete directly: find . -iname "*.pyc" -delete seems pretty clear to me. For Python 3 it could be useful to add find . -iname "__pycache__" -exec rm -rf {} +. The plus at the end, rather than the \; will pipe all found instances to the rm tool in one go rather than executing rm for each one individually.

lint:

Typically I only want to run flake8 on source code. So rather than excluding a bunch of directories why not specifically call flake8 on the src directory only?

make src home to not just Python code

I routinely have to use R code in my pipeline rules/targets. I propose changing the src organization as follows:

Change the current:

src
├── data
│   └── make_dataset.py
├── features
│   └── build_features.py
├── __init__.py
├── models
│   ├── predict_model.py
│   └── train_model.py
└── visualization
    └── visualize.py

to something akin to this:

src
├── python
│   ├── data
│   │   ├── __init__.py
│   │   └── make_dataset.py
│   ├── features
│   │   ├── build_features.py
│   │   └── __init__.py
│   ├── __init__.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── predict_model.py
│   │   └── train_model.py
│   ├── rules
│   │   ├── __init__.py
│   │   └── template_python_script.py
│   └── visualization
│       ├── __init__.py
│       └── visualize.py
└── R
    └── rules
        └── template_R_script.R

Thoughts?

Remove --recursive from s3 sync commands in Makefile

At least as of AWS CLI v 1.10.32 the aws s3 sync command does not have a --recursive flag. As such, running the sync_data_to_s3 or sync_data_from_s3 make rules throws the error

Unknown options: --recursive
make: *** [sync_data_to_s3] Error 255

The sync operation is recursive by default see the aws cli docs.

The --recursive flag should be removed from the default Makefile.

Add an opinion about making scripts chatty

I'm generally in favor of keeping the opinions section pithy, but I think this may be a fit.

  • Use real logging, not print statements (we have some boilerplate)
    • easy redirect to multiple places
    • timestamps and module for free
    • easy to see what happens on someone else's instance
  • Include tqdm by default

Add default config file to src/

Hi

Should we add a src/config.py or src/settings.py file? I believe this would make it easier to get paths to folders etc. in make_data.py for example.

# src/config.py
""" Storing config variables and other settings"""
from os.path import join, dirname, os, abspath
from dotenv import load_dotenv
import inspect

dotenv_path = join(dirname(__file__), '../.env')
load_dotenv(dotenv_path)

class ParamConfig:
    """Config variables for the project"""
    def __init__(self):
        self.kaggle_username = os.environ.get("KAGGLE_USERNAME")
        self.kaggle_password = os.environ.get("KAGGLE_PASSWORD")
        self.config_dir = dirname(abspath(inspect.getfile(inspect.currentframe())))
        self.root_dir = dirname(self.config_dir)

        # Data directories
        self.data_dir = os.path.join(self.root_dir, 'data')
        self.raw_data_dir = os.path.join(self.data_dir, 'raw')
        self.processed_data_dir = os.path.join(self.data_dir, 'processed')

config = ParamConfig()

I can then import the config variable like so:

# Selective excerpt from src/data/make_data.py as an example 
from src.settings import config 

def main(output_zip=False):
    """Create data!"""
    logger = logging.getLogger(__name__)
    logger.info('making final data set from raw data')

    # compression = 'gzip' if output_zip is True else

    # Read raw data (auto unzipping files!)
    train_sales = pd.read_csv(path.join(config.raw_data_dir, 'train.csv.zip'))
    test_sales = pd.read_csv(path.join(config.raw_data_dir, 'test.csv.zip'))
    stores = pd.read_csv(path.join(config.raw_data_dir, 'store.csv.zip'),
                         dtype={'CompetitionOpenSinceYear': str,
                                'CompetitionOpenSinceMonth': str,
                                'Promo2SinceWeek': str,
                                'Promo2SinceYear': str,})

However note that importing settings in this way also requires me to change the make file from this:

data:
    python -m src/data/make_dataset.py

to this:

data:
    python -m src.data.make_dataset

I'm not sure if this has any downsides to it. An alternative is also to add the src and/or settings file to the python path.

I'm still learning both Python and Data Science so please bear with me if what I'm suggesting or my code is Silly :)

Cookiecutter is now in Conda Forge

This works for installation, so you might want to change your creation process to something like :

$ conda config --add channels conda-forge
$ conda install cookiecutter

Workflow for working with figures and reports

I just started using this cookiecutter and I'm wondering how people are using this directory structure in order to generate figures and reports.

Here's what I'm doing currently:

  • do analysis and generate interesting figure, save them to /reports/figures/
  • write up final jupyter notebook report from within /notebooks/reports/, any reference to figures are going to be ../../reports/figures/fig.png
  • export the report as report.html and place in /reports/

The issue now, is that when I view the report.html, the figures don't have the proper path. How are people getting around this?

docker support

Hi,
Any chance someone can add docker support + sql docker support like in this cookiecutter django project?

The benefits are:

  1. reproducible environment for running the code -> easier to deply
  2. reproducible database (if needed)

I am new to Docker&Cookie-cutter, otherwise I would do this myself.

Test suite

We should setup a simple test suite. Mostly focused on config testing, I would guess:

  • CI Server that runs tests on multiple OSes: osx/linux/windows
  • Tox for multiple Python versions

how will cookiecutter handle Database driven projects

I see there is s3 syncing but for people using SQL Databases or HDFS? a few useful thoughts:

  1. There should be a place for database connection strings, and connections to be established
  2. inside of src/data we should store python scripts, but we can have a subdirectory, database_scripts for .sql, .hql, etc. This would cover all database insertion, ETL, in database data munging etc.

Does this seem sensible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.