drivendataorg / cookiecutter-data-science Goto Github PK

View Code? Open in Web Editor NEW

7.8K 118.0 2.4K 1.15 MB

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Home Page: https://cookiecutter-data-science.drivendata.org/

License: MIT License

Makefile 18.11% Python 72.78% Shell 9.11%

cookiecutter-data-science cookiecutter cookiecutter-template data-science machine-learning ai

cookiecutter-data-science's Introduction

Cookiecutter Data Science

A logical, reasonably standardized but flexible project structure for doing and sharing data science work.

Cookiecutter Data Science (CCDS) is a tool for setting up a data science project template that incorporates best practices. To learn more about CCDS's philosophy, visit the project homepage.

ℹ️ Cookiecutter Data Science v2 has changed from v1. It now requires installing the new cookiecutter-data-science Python package, which extends the functionality of the cookiecutter templating utility. Use the provided ccds command-line program instead of cookiecutter.

Installation

Cookiecutter Data Science v2 requires Python 3.8+. Since this is a cross-project utility application, we recommend installing it with pipx. Installation command options:

# With pipx from PyPI (recommended)
pipx install cookiecutter-data-science

# With pip from PyPI
pip install cookiecutter-data-science

# With conda from conda-forge (coming soon)
# conda install cookiecutter-data-science -c conda-forge

Starting a new project

To start a new project, run:

ccds

The resulting directory structure

The directory structure of your new project will look something like this (depending on the settings that you choose):

├── LICENSE            <- Open-source license if one is chosen
├── Makefile           <- Makefile with convenience commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default mkdocs project; see www.mkdocs.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── pyproject.toml     <- Project configuration file with package metadata for 
│                         {{ cookiecutter.module_name }} and configuration for tools like black
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.cfg          <- Configuration file for flake8
│
└── {{ cookiecutter.module_name }}   <- Source code for use in this project.
    │
    ├── __init__.py             <- Makes {{ cookiecutter.module_name }} a Python module
    │
    ├── config.py               <- Store useful variables and configuration
    │
    ├── dataset.py              <- Scripts to download or generate data
    │
    ├── features.py             <- Code to create features for modeling
    │
    ├── modeling                
    │   ├── __init__.py 
    │   ├── predict.py          <- Code to run model inference with trained models          
    │   └── train.py            <- Code to train models
    │
    └── plots.py                <- Code to create visualizations

Using v1

If you want to use the old v1 project template, you need to have either the cookiecutter-data-science package or cookiecutter package installed. Then, use either command-line program with the -c v1 option:

ccds https://github.com/drivendataorg/cookiecutter-data-science -c v1
# or equivalently
cookiecutter https://github.com/drivendataorg/cookiecutter-data-science -c v1

Contributing

We welcome contributions! See the docs for guidelines.

Installing development requirements

pip install -r dev-requirements.txt

Running the tests

pytest tests

cookiecutter-data-science's People

Contributors

Stargazers

Watchers

Forkers

rynmccrmck jmrinaldi jbrambledc perif wanjinchang alexliberzonlab bwallin tmadl janusnic rrbarbosa codyrioux codeaudit phamnamkhanh bag-of-projects asong92 denisekgosnell iglpdc opoudel joordiaz eagles0607 dzinoviev mrb1b0 mrbell hydrosquall westonstearns paulhendricks xguse scottsnapperlab zds0 directorscut82 janmtl thuyen zbessinger tomaugspurger huyng rbeagrie verypossible liudonghs tschimoler burkesquires cthorey rmax-contrib vieira-rafael ohenrik sarath234 samesense willpatterson colaborati sampathweb damontarlaei niloch qlycool bynr deep-introspection felipeam86 hwartig keita1 adamkgoldfarb cjpetrus rmax jfear tgebhart jheymann85 fmfn aliceyap bigdatarepublic courcelm vmuriart dancingquanta ausiddiqui ehdev fatmai mnarayan ppries wunlung neuromusic jradaelli sunilrebel theetkinlab kirill84 kodexp lucianosb jquacinella hipconsult launchpadrecruits blueogive whitesymmetry hengrumay jeroendecroos olivierh59500 teoguso austinburks amlanlimaye verginer peter-ki sanaiqbalw datarevenue-berlin jcardenasrdz fortiema zscore

cookiecutter-data-science's Issues

Update the python-dotenv requirement

I have encountered a bug the the dotenv module (not finding the .env file with load_dotenv(find_dotenv())). This was fixed in this issue by installing the latest master.

So it may be worth upgrading the requirements file with:

click
Sphinx
coverage
awscli
flake8
https://github.com/theskumar/python-dotenv/archive/master.zip

Make on Windows link not working

Thanks to @epogrebnyak for reporting.

Make default repo_name lowercase

Right now the default repo_name is simply the provided project_name, replacing spaces with underscores: "repo_name": "{{ cookiecutter.project_name|replace(' ', '_') }}".

It would be nice if the default also converted the project_name to lowercase as well: {{ cookiecutter.project_name.lower().replace(' ', '_') }}.

Thoughts?

Add option to use different DAG manager

We opt for a Makefile and GNU make by default. Could give options for some of these other projects:

Paver
Luigi
Airflow
Snakemake
Ruffus
Joblib

Create example projects that utilize Docker/Vagrant

It would be helpful to have examples of possible configurations. so we can show the community how to make this play well with docker/vagrant.

Add examples page and list to examples of analyses that use the project

Rename IPython NB to Jupyter NB

Alright, this is SUPER nitpicky. Jupyter Notebooks is the new name for IPython Notebooks. The comment above .ipynb_checkpoints/ in the .gitignore should be changed from # IPython NB Checkpoints to Jupyter NB Checkpoints.

I'm going to submit a PR to make the change (like I said, really nitpicky).

Make separate docs pages instead of one monolithic page

Especially if we're adding more content (e.g., #18), we may want to have a few separate pages. Possible segmentation could be:

Project introduction and documentation
Directory layout
Opinions and philosophy
Workflow components and the technologies that are chosen (or are options)
Extension strategies (e.g., #16)
Links to examples of projects that use the template

Imports not right for dotenv snippet in docs

# src/data/dotenv_example.py
from os.path import join, dirname
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)

database_url = os.environ.get("DATABASE_URL")
other_variable = os.environ.get("OTHER_VARIABLE")

from os.path import join, dirname should just be import os (for os.environ.get).

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework

Hello,
I'd like to use the cookiecutter-data-science framework for the project that I'm working on, but unfortunately I'm having trouble getting started. Specifically, I'm having trouble figuring out how to configure the make_dataset.py file to execute any python data-making scripts. I'm sure the fix is pretty basic, but I've been spinning my wheels for awhile trying to figure this issue out.

It would be great if you could provide a basic tutorial demonstrating a simple implementation of your framework that people like me could use to get started.
Thanks!

Include nosetests out of the box with top level testing dir

One of the main components that is different from my usual data science set-up is a top-level directory for unit and integration testing. Once a model moves to production, it is vital that it ship with unit and integration tests and insurance that the work does not break any other models. I recommend adding this section at the top level of the module so that forked projects can run the testing suite with access to all the proper sub-modules.

Great work; I appreciate the organization!

Unclear how to use AWS and `make data`

Analysis is a DAG. The sequence in this DAG is critical, so more prescription would be beneficial.

It's unclear how to incorporate AWS and the make sync_data_(to|from)_s3 commands into make data. In addition, the documentation doesn't describe how AWS should be used with the .env file.

Should make data call sync_data_from_s3?
How should variables from .env be exported so they are available to make sync_data_(to|from)_s3? A Python script, or something else?

Add explantion of using `.env` to docs

E.g., deployment secrets, access keys, SQL connection strings

How would this structure change for R?

I'm working on creating a similar standard for R at my company and was hoping to get some thoughts on if anything warrants changing to be R specific.

Compatibility of pack to create api driven projects

Hi there! Loved the project, this really reflects the maturity of data science projects and where we are standing. So good!

I rise this issue as I was wondering if the current structure can be adapted to an api-driven project. This is, a project in which the analysis and data flow may be related to an api definition.

If yes, what would it be? So we can document it (or point me out where it is)
If not, why? Some books have recommended having an api flow for analysis and process so our results and analysis are available for our mates in engineering. Even allowing for an easy scale up.

Thank you so much!

Swap out Sphinx for mkdocs

Sphinx is really good for projects where documentation lives in docstrings in the code. Mkdocs is easier to write from scratch, style, and deploy.

Also, I've got a preference for writing markdown over RST

Minor issue with self documenting make

In a fresh project, running make (Ubuntu) gives me:

$ make
/bin/sh: 1: test: Linux: unexpected operator
Available rules:

clean               Delete all compiled Python files 
create_environment  Set up python interpreter environment 
data                Make Dataset 
lint                Lint using flake8 
requirements        Install Python Dependencies 
sync_data_from_s3   Download Data from S3 
sync_data_to_s3     Upload Data to S3 
test_environment    Test python environment is setup correctly

It looks like a problem with the invocation of test (my uname is Linux).

Seems like this comes from the very last line of the self documenting rule:

	| more $(shell test $(shell uname) == Darwin && echo '--no-init --raw-control-chars')

Changing the == to = seems to get rid of the /bin/sh: 1: test: Linux: unexpected operator.

@pjbull Can you see if this works on OS X?

Add package list that populates `requirements.txt` to the workflow

This issue is for adding a user editable list during project creation that will allow a user to name packages. E.g.:

...
List packages to include [default: pandas,numpy,scikit-learn]:  
...

Add the installation folder to .env

Hi,

I am just starting to use the project. I noticed that most of the commands have relative paths like in the Makefile.

It is sometimes useful to have access to the full path. For example, when running cron jobs on some of the scripts inside the projects, getting the proper relative paths may be a bit tricky. One can use something like os.path.abspath(__file__) in the script to find the path but I thought that it would be easier if the project folder was dumped in the .env, the environment variable being then used for paths to the data or visualization folders.

Thanks.

Add testing boilerplate/docs

Make default README better

ContextDecodingException

When running cookiecutter https://github.com/drivendata/cookiecutter-data-science in Anaconda 2.3.0 (Python 2.7.11) I get the following exception:

Traceback (most recent call last):
  File "/Users/bencook/anaconda/bin/cookiecutter", line 11, in <module>
    sys.exit(main())
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 700, in __call__
    return self.main(*args, **kwargs)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 680, in main
    rv = self.invoke(ctx)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 873, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/click-5.1-py2.7.egg/click/core.py", line 508, in invoke
    return callback(*args, **kwargs)
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/cli.py", line 106, in main
    config_file=user_config
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/main.py", line 130, in cookiecutter
    extra_context=extra_context,
  File "/Users/bencook/anaconda/lib/python2.7/site-packages/cookiecutter/generate.py", line 102, in generate_context
    raise ContextDecodingException(our_exc_message)
cookiecutter.exceptions.ContextDecodingException: JSON decoding error while loading "/Users/bencook/.cookiecutters/cookiecutter-data-science/cookiecutter.json".  Decoding error details: "Expecting property name: line 9 column 1 (char 401)"

I get the same exception in a virtual environment with Python 2.7.9.

Here's what my cookiecutter.json looks like:

{
    "project_name": "project_name",
    "repo_name": "{{ cookiecutter.project_name|replace(' ', '_') }}",
    "author_name": "Your name (or your organization/company/team)",
    "description": "A short description of the project.",
    "year": "2016",
    "open_source_license": ["MIT", "BSD", "Not open source"],
    "s3_bucket": "[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')",
}

Explain how to force add raw data when small & necessary

Thanks to @epogrebnyak for reporting.

Add citation inside cookiecutter to be able to find publicly generated projects

By default, add a link back from the README.md so that we can better find examples where people are using the template.

Set .gitignore for the data directory

Goal: keep the data/ folder in the project template for illustrative reasons, but by default ignore its contents once the cookiecutter has been instantiated and turned into a git repo.

Rename src/model to src/models - Should be plural right?

Add doc strings to included.py files

Add option to choose different data storage back ends

S3 (get AWS settings)
Git Large File Storage
Git Annex
dat

Slightly adjust commands in Makefile

Just started using your cookiecutter for the first time. Thank you for the effort, it seems very valuable to me!

I had some comments and I'm happy to create pull requests if those are desired changes. I'm talking about targets in the Makefile here.

requirements:

pip now also allows a constraints file which seems more relevant to me in order to pin or require certain versions of dependencies.

clean:

Find can delete directly: find . -iname "*.pyc" -delete seems pretty clear to me. For Python 3 it could be useful to add find . -iname "__pycache__" -exec rm -rf {} +. The plus at the end, rather than the \; will pipe all found instances to the rm tool in one go rather than executing rm for each one individually.

lint:

Typically I only want to run flake8 on source code. So rather than excluding a bunch of directories why not specifically call flake8 on the src directory only?

make src home to not just Python code

I routinely have to use R code in my pipeline rules/targets. I propose changing the src organization as follows:

Change the current:

src
├── data
│   └── make_dataset.py
├── features
│   └── build_features.py
├── __init__.py
├── models
│   ├── predict_model.py
│   └── train_model.py
└── visualization
    └── visualize.py

to something akin to this:

src
├── python
│   ├── data
│   │   ├── __init__.py
│   │   └── make_dataset.py
│   ├── features
│   │   ├── build_features.py
│   │   └── __init__.py
│   ├── __init__.py
│   ├── models
│   │   ├── __init__.py
│   │   ├── predict_model.py
│   │   └── train_model.py
│   ├── rules
│   │   ├── __init__.py
│   │   └── template_python_script.py
│   └── visualization
│       ├── __init__.py
│       └── visualize.py
└── R
    └── rules
        └── template_R_script.R

Thoughts?

Remove --recursive from s3 sync commands in Makefile

At least as of AWS CLI v 1.10.32 the aws s3 sync command does not have a --recursive flag. As such, running the sync_data_to_s3 or sync_data_from_s3 make rules throws the error

Unknown options: --recursive
make: *** [sync_data_to_s3] Error 255

The sync operation is recursive by default see the aws cli docs.

The --recursive flag should be removed from the default Makefile.

Add an opinion about making scripts chatty

I'm generally in favor of keeping the opinions section pithy, but I think this may be a fit.

Use real logging, not print statements (we have some boilerplate)
- easy redirect to multiple places
- timestamps and module for free
- easy to see what happens on someone else's instance
Include tqdm by default

Ensure template is Python 2/3 compatible

Currently, we use s3cmd for data syncing, but it is not Python 3 compatible:
s3tools/s3cmd#335

We should replace with the awscli package.

Initialize Git repository on creation

Could we add a post-generate hook to run git init after repository creation? It's not a make-or-break, just a nice convenience.

Add docs page with links to forks for more specific purposes (so people can PR a link to their projects)

add directory for visualization scripts in src

I can eventually open a PR for this if no one gets to it.

Add default config file to src/

Should we add a src/config.py or src/settings.py file? I believe this would make it easier to get paths to folders etc. in make_data.py for example.

# src/config.py
""" Storing config variables and other settings"""
from os.path import join, dirname, os, abspath
from dotenv import load_dotenv
import inspect

dotenv_path = join(dirname(__file__), '../.env')
load_dotenv(dotenv_path)

class ParamConfig:
    """Config variables for the project"""
    def __init__(self):
        self.kaggle_username = os.environ.get("KAGGLE_USERNAME")
        self.kaggle_password = os.environ.get("KAGGLE_PASSWORD")
        self.config_dir = dirname(abspath(inspect.getfile(inspect.currentframe())))
        self.root_dir = dirname(self.config_dir)

        # Data directories
        self.data_dir = os.path.join(self.root_dir, 'data')
        self.raw_data_dir = os.path.join(self.data_dir, 'raw')
        self.processed_data_dir = os.path.join(self.data_dir, 'processed')

config = ParamConfig()

I can then import the config variable like so:

# Selective excerpt from src/data/make_data.py as an example 
from src.settings import config 

def main(output_zip=False):
    """Create data!"""
    logger = logging.getLogger(__name__)
    logger.info('making final data set from raw data')

    # compression = 'gzip' if output_zip is True else

    # Read raw data (auto unzipping files!)
    train_sales = pd.read_csv(path.join(config.raw_data_dir, 'train.csv.zip'))
    test_sales = pd.read_csv(path.join(config.raw_data_dir, 'test.csv.zip'))
    stores = pd.read_csv(path.join(config.raw_data_dir, 'store.csv.zip'),
                         dtype={'CompetitionOpenSinceYear': str,
                                'CompetitionOpenSinceMonth': str,
                                'Promo2SinceWeek': str,
                                'Promo2SinceYear': str,})

However note that importing settings in this way also requires me to change the make file from this:

data:
    python -m src/data/make_dataset.py

to this:

data:
    python -m src.data.make_dataset

I'm not sure if this has any downsides to it. An alternative is also to add the src and/or settings file to the python path.

I'm still learning both Python and Data Science so please bear with me if what I'm suggesting or my code is Silly :)

Add description of virtualenv tools to the documentation

I believe Conda can handle what virtualenv + pip is doing in a better way? Should we consider adding this functionality?

Click seems to be flawed on Python 3 - Consider using docopt

I don't know much about click or docopt yet, so don't shoot me if i'm lost here. Click seems to be handling Python 3 a bit badly. Should we consider switching to http://docopt.org/ ?

By the way i manage to get Click working by running, export LC_ALL=no_NO.utf-8 and export LANG=no_NO.utf-8. But It seems like there might be more issues witch Click and Python 3:
http://click.pocoo.org/6/python3/

Cookiecutter is now in Conda Forge

This works for installation, so you might want to change your creation process to something like :

$ conda config --add channels conda-forge
$ conda install cookiecutter

Workflow for working with figures and reports

I just started using this cookiecutter and I'm wondering how people are using this directory structure in order to generate figures and reports.

Here's what I'm doing currently:

do analysis and generate interesting figure, save them to /reports/figures/
write up final jupyter notebook report from within /notebooks/reports/, any reference to figures are going to be ../../reports/figures/fig.png
export the report as report.html and place in /reports/

The issue now, is that when I view the report.html, the figures don't have the proper path. How are people getting around this?

docker support

Hi,
Any chance someone can add docker support + sql docker support like in this cookiecutter django project?

The benefits are:

reproducible environment for running the code -> easier to deply
reproducible database (if needed)

I am new to Docker&Cookie-cutter, otherwise I would do this myself.

Add opinions section to the documentation

Test suite

We should setup a simple test suite. Mostly focused on config testing, I would guess:

CI Server that runs tests on multiple OSes: osx/linux/windows
Tox for multiple Python versions

Add examples page and list to examples of analyses that use the project

Is there some way to link to a couple of projects/pipelines implementing this template? I know there are resources for every question I might have, but to see a whole project where all the parts you suggest here come together would be really helpful.

Update cookiecutter.json to current year.

cookiecutter-data-science/cookiecutter.json

year [2016]:

year [2017]:

how will cookiecutter handle Database driven projects

I see there is s3 syncing but for people using SQL Databases or HDFS? a few useful thoughts:

There should be a place for database connection strings, and connections to be established
inside of src/data we should store python scripts, but we can have a subdirectory, database_scripts for .sql, .hql, etc. This would cover all database insertion, ETL, in database data munging etc.

Does this seem sensible?