dssg / triage Goto Github PK

View Code? Open in Web Editor NEW

177.0 38.0 61.0 334.62 MB

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems

License: Other

Python 25.37% PLpgSQL 0.07% Mako 0.01% Jupyter Notebook 74.21% Shell 0.26% Dockerfile 0.07%

triage early-warning-systems inspection-prioritization machine-learning python artificial-intelligence dssg tool

triage's Introduction

Triage

ML/Data Science Toolkit for Social Good and Public Policy Problems

Building ML/Data Science systems requires answering many design questions, turning them into modeling choices, which in turn define and machine learning models. Questions such as cohort selection, unit of analysis determination, outcome determination, feature (explanatory variables or predictors) generation, model/classifier training, evaluation, selection, bias audits, interpretation, and list generation are often complicated and hard to make design choices around apriori. In addition, once these choices are made, they have to be combined in different ways throughout the course of a project.

Triage is designed to:

Guide users (data scientists, analysts, researchers) through these design choices by highlighting critical operational use questions.
Provide an integrated interface to components that are needed throughout a ML/data science project workflow.

Getting Started with Triage

Are you completely new to Triage? Run through a quick tutorial hosted on google colab (no setup necessary) to see what triage can do! Tutorial hosted on Google Colab
Runj it locally on an example problem and data set from Donors Choose
Dirty Duck Tutorial - Want a more in-depth walk through of triage's functionality and concepts? Go through the dirty duck tutorial that you can install on your local machine with sample data
QuickStart Guide - Try Triage out with your own project and data
Triage Documentation Site - Used Triage before and want more reference documentation?
Development - Contribute to Triage development.

Installation

To install Triage locally, you need:

Ubuntu/RedHat
Python 3.8+
A PostgreSQL 9.6+ database with your source data (events, geographical data, etc) loaded.
- NOTE: If your database is PostgreSQL 11+ you will get some speed improvements. We recommend updating to a recent version of PostgreSQL.
Ample space on an available disk, (or for example in Amazon Web Services's S3), to store the matrices and models that will be created for your experiments

We recommend starting with a new python virtual environment and pip installing triage there.

$ virtualenv triage-env
$ . triage-env/bin/activate
(triage-env) $ pip install triage

If you get an error related to pg_config executable, run the following command (make sure you have sudo access):

(triage-env) $ sudo apt-get install libpq-dev python3.9-dev

Then rerun pip install triage

(triage-env) $ pip install triage

To test if triage was installed correctly, type:

(triage-env) $ triage -h

Data

Triage needs data in a postgres database and a configuration file that has credentials for the database. The Triage CLI defaults database connection information to a file stored in 'database.yaml' (example in example/database.yaml).

If you don't want to install Postgres yourself, try triage db up to create a vanilla Postgres 12 database using docker. For more details on this command, check out Triage Database Provisioner

Configure Triage for your project

Triage is configured with a config.yaml file that has parameters defined for each component. You can see some sample configuration with explanations to see what configuration looks like.

Using Triage

Via CLI:

triage experiment example/config/experiment.yaml

Import as a python package:

from triage.experiments import SingleThreadedExperiment

experiment = SingleThreadedExperiment(
    config=experiment_config, # a dictionary
    db_engine=create_engine(...), # http://docs.sqlalchemy.org/en/latest/core/engines.html
    project_path='/path/to/directory/to/save/data' # could be an S3 path too: 's3://mybucket/myprefix/'
)
experiment.run()

There are a plethora of options available for experiment running, affecting things like parallelization, storage, and more. These options are detailed in the Running an Experiment page.

Development

Triag was initially developed at University of Chicago's Center For Data Science and Public Policy and is now being maintained at Carnegie Mellon University.

To build this package (without installation), its dependencies may alternatively be installed from the terminal using pip:

pip install -r requirement/main.txt

Testing

To add test (and development) dependencies, use test.txt:

pip install -r requirement/test.txt [-r requirement/dev.txt]

Then, to run tests:

pytest

Development Environment

To quickly bootstrap a development environment, having cloned the repository, invoke the executable develop script from your system shell:

./develop

A "wizard" will suggest set-up steps and optionally execute these, for example:

(install) begin

(pyenv) installed

(python-3.9.10) installed

(virtualenv) installed

(activation) installed

(libs) install?
1) yes, install {pip install -r requirement/main.txt -r requirement/test.txt -r requirement/dev.txt}
2) no, ignore
#? 1

Contributing

If you'd like to contribute to Triage development, see the CONTRIBUTING.md document.

triage's People

Stargazers

Watchers

triage's Issues

First-pass list generation/management utility

Feature aggregations are accidentally included

Running the pipeline results in multiple copies of all the collate features, which likely come from the feature aggregation tables not being excluded.

Support model comment

The pipeline should pass a model_comment to the ModelTrainer (via misc_db_parameters)

Allow dynamic categorical lists

The FeatureGenerator should have some way of dynamically computing choices for categoricals. One possible interface could be to have a choice_query option, for use in place of a choices option.

Support new collate arithmetic and subquery functionality

dssg/collate#51
dssg/collate#63

We will want to support these. Probably wait until they are merged.

Parallelize collate feature generation

As projects scale, feature generation takes longer. An easy place to parallelize is to run each aggregation, or even group, in its own process. Since each aggregation/group writes to separate tables, there should be no problems with table locking and it should be able to take advantage of multiple cores on the database level.

Manually run commands within collate instead of execute

We don't want to use the aggregation tables, but by calling execute we still create them. Eventually this will be addressed in collate, but for now we should save time by manually calling the body of .execute and leaving off the aggregation tables.

Incorporate timechop

Timechop is close to done, once it is done we should incorporate it in here instead of all the placeholder timechop madness in triage.pipeline.

Standardize PipelineBase on factory pattern

Utilizing the factory pattern for all serializable component (ie FeatureGenerator, LabelGenerator, ModelTrainer) arguments makes parallel processing much easier. The prediction/evaluation loop is implemented in a more hackneyed and verbose way; we can convert this, as well as all other components, in one fell swoop to improve readability and future parallelization.

Support n_jobs for models that allow them

Currently, adding n_jobs to model parameters yields:

INFO:root:Training models
Traceback (most recent call last):
  File "run.py", line 31, in <module>
    pipeline.run()
  File "/mnt/data/economic_development/san_jose_housing/venv/lib/python3.5/site-packages/triage/pipeline.py", line 149, in run
    misc_db_parameters=dict(test=False)
  File "/mnt/data/economic_development/san_jose_housing/venv/lib/python3.5/site-packages/triage/model_trainers.py", line 358, in train_models
    replace
  File "/mnt/data/economic_development/san_jose_housing/venv/lib/python3.5/site-packages/triage/model_trainers.py", line 355, in <listcomp>
    model_id for model_id in self.generate_trained_models(
  File "/mnt/data/economic_development/san_jose_housing/venv/lib/python3.5/site-packages/triage/model_trainers.py", line 295, in generate_trained_models
    for class_path, parameters in self._generate_model_configs(grid_config):
  File "/mnt/data/economic_development/san_jose_housing/venv/lib/python3.5/site-packages/triage/model_trainers.py", line 88, in _generate_model_configs
    for parameters in ParameterGrid(parameter_config):
  File "/mnt/data/economic_development/san_jose_housing/venv/lib/python3.5/site-packages/sklearn/grid_search.py", line 116, in __iter__
    for v in product(*values):
TypeError: 'int' object is not iterable

Comment example_experiment_config.yaml

The experiment config should have more guidance about what each section does, and how to properly configure it.

Parallelize matrix creation in LocalParallelPipeline

Matrix creation can be a bottleneck, and is highly parallelizable. Incorporate this into the LocalParallelPipeline.

ModelTrainer not replacing old versions of models, instead adding new ones

Running the same experiment config over and over again doesn't replace the results for existing models. Instead, new rows are created.

The same models each time are supposed to hash to the same unique id, so it should replace the old versions in the models table. Investigate!

ModelTrainer: If model metadata is in db, but pickle is not available, use old metadata by default

Here:

https://github.com/dssg/triage/blob/master/triage/model_trainers.py#L136

We should early return and use the old metadata by default

Add in feature dictionary generation

Feature Dictionary generation is being removed from timechop and needs to be integrated here. Architect.chop_data now takes in a feature_dictionary

Fix scoring and add new functions

Fix two bugs in the scoring function:

the parameters passed to generate_binary_at_x
pass the sorted labels to the scores for metrics with thresholds, to keep the order of the binary and labels consistent.

Adding the following functions:

accuracy
roc_auc
average precision
true positives@
true negatives@
false positives@
false negatives@

Handle evaluations for inspections-style as_of_times

Currently the pipeline bypasses the evaluations step if there are multiple as_of_times present in the test matrix. This should be fixed. Should we start by generating metrics for each as_of_date present?

Pipeline should allow injectable components

There are some components used by the pipeline that may require different implementations but can implement the same interface, but need to be executed in the model loop. For instance, the LabelGenerator we have is quite simple and may not work that way for everybody. When instantiating the pipeline, the user should be able to inject any needed overriding classes.

Model group SQL function creation doesn't work when module is imported

When triage is imported as a module, the triage/model_group_stored_procedure.sql is not copied over, and so the creation fails.

It needs to be added to MANIFEST.in. We may need to move it to the root-level directory, instead of triage/ as well (unsure of this)

ModelTrainer should yield trained models

To support a storage-constrained environment, often modeling loops want to process the model without storing it anywhere. We can in theory support this with the InMemoryModelStorageEngine being added in #9 already, which can be passed to the Predictor.

But to enable these to be processed one at a time, we should yield model_ids from train_models instead of returning a big list.

To do this right, we will probably have to make the InMemoryModelStorageEngine get rid of references somehow, otherwise it will just keep them around by default. The quickest way is probably to expose a .delete_store method that can be called when the project is done with all of the test dates for that model. There's no reason to limit this to in-memory, either. Deleting s3-cached or filesystem-cached models also makes sense if the user is trying to save space/money.

model group: Add label_config

Both the db model and stored procedure should support label_config. This is a dict/json meant to encode any configuration used to create labels for this model group. The contents will vary from project to project.

Don't call model.predict

Restrict to binary subsetting of predict_proba. Also use

if (self.model_name == "SGDClassifier" and (clf.loss =="hinge" or clf.loss == "perceptron")) or self.model_name == "linear.SVC":
            res = list(clf.decision_function(test_x))
        else:
            res = list(clf.predict_proba(test_x)[:,1])

Create utility for generating config from config directory structure

As features are added, this section of the config quickly grows to several hundred to a few thousand lines. It might be more readable to source features from separate config files, either a single config for all features or one file per from_obj. (See https://github.com/dssg/san_jose_housing/pull/6 for an example config with hundreds of lines of features from a single source table.)

Switch back to outcome_date

When generating labels, we currently look for a 'date' column but this is wrong. Use outcome_date, to match the yaml comments

Support parallelism

Although it is possible to support parallelism of triage components by using specific triage components instead of the whole pipeline, it should also support some simple version of this out of the box.

An injected executor could be a good way to do this (ie SimpleExecutor, MultiprocessingExecutor).

A better first pass, now that pipeline.run() is more compact, would be to subclass Pipeline and only reimplement .run().

Add in feature subsetting

We need two classes, that can be configured via the experiment config (names subject to change):

FeatureGroup

FeatureSubsettingStrategy

These are configurable per experiment, and thus their definition must be YAML-serializable. The former defines a way to create feature groups. Whether it be by table name, time, aggregation, etc. To start we are just implementing table names. The class will take in its configuration in the constructor, and have a method that takes in a feature dictionary and outputs the subset of that feature dictionary.

Lists of groups are paired with a strategy (leave one out/leave one in/use all combinations). The strategy, given a list of groups, can output an iterator/list of feature dictionaries.

Pass matrix directory to Architect constructor

As of: dssg/timechop#32

the Architect will take in a matrix_directory instead of storing the matrices in the run directory. We should pass in the project path's matrix directory.

Modularize Collate Generation Step

Allow a user to output a list of all the collate objects that would be created from the yaml. This would be useful for debugging and improving queries.

Pipeline should take a ModelStorageEngine instance, not a class

The constructor arguments for different model storage engines differ, so it's not really enough to expect triage.pipeline to instantiate the engine based on a class name. The S3ModelStorageEngine is actually unusable in the pipeline for this reason (it needs an s3_conn). Instead, the pipeline should take an instantiated engine.

Instantiate pipeline components in constructor

Although most individual pipeline components work well instantiating before any work is done, they are still instantiated in the loop. This makes the function fairly hard to read. They should instead be instantiated in the constructor and just called in .run

Model hash

The current model hash contains characters like "-0x", which causes troubles with some unix commands. Lets change it to a more human readable version of capital letters and numbers eg"J4345GJDK"

ModelTrainer: If model metadata is in db, but pickle is not available, and flag is set, update foreign keys

Here:

https://github.com/dssg/triage/blob/master/triage/model_trainers.py#L136

We delete the model but don't update the old dependencies. #37 is changing the default behavior, but we should support the replace-and-update use case with a flag, and correctly update foreign keys.

Make example_experiment_config readability changes

Some feedback:

YAML multi-line list format is a bit surprising; it's very easy to miss that high-level dash.
Usage of some collate features is not well explained

This mostly applies to the collate section.

Ideas to improve:

Use one-line list format (ie, intervals: ['1y', '2y']) whenever possible
Use two examples whenever there is a list, especially a multi-line list. For instance, there should be two feature_aggregations here; one that is a single table source, and one that is a multiple table source that is a good example of why you would want to use two tables in one feature aggregation, like similar data that is split between two tables. This helps both explain the collate end and the list end.

Test against timechop expanded test matrices branch

Need to make sure triage works correctly with this branch:

https://github.com/dssg/timechop/tree/expand_test_matrices

Support order by in collate Aggregate

Third argument in collate.Aggregate is an order by clause, which can be used for things like medians. This should be usable via the YAML wrapper.

Join meta-feature tables from collate with non-collate features to make a full feature table

Add test_durations parameter

Update FeatureGenerator to support new collate interface

Collate has gone through a bunch of changes since triage.feature_generators was first implemented. The addition of Categorical, and the overall interface changes are pretty important to access here. The YAML wrapper should support all of this.

Fix scoring memory leak

There appears to be a memory leak in the scoring section of triage. It is using a large amount of memory for scoring some small models/prediction sizes.

Fix requirements

Metta and timechop, not being in pypi, mess up the installation of triage. They should just be in requirements_dev for now, and projects that import it must import themselves.

Helpfully report obvious problems with config file

There are some problems with input config files that can be easily and helpfully reported on. For instance: if no aggregates or categoricals are present in a feature_aggregation, that feature_aggregation will produce nothing.

Implementation idea: each component could define a common function to validate a config file without doing anything, so it can, for instance, report a problem with model scoring config before it spends a ton of time training models.

Checklist of items to validate:

Ensure that items required to be lists are actually lists (e.g., update windows)

Pass schema onto collate

This is less urgent than when initially created; it turns out that schema in collate refers to the output schema, not input. But this feature in the collate wrapper should still exist.

Straighten out collate 'column' naming

There is some confusion between the categorical 'column' and aggregate 'quantity'. Merge names?

Use NullPool when creating SQLAlchemy engine

We have been getting connection pool errors, of the type which can be easily fixed by connecting using NullPool. We want to apply this to triage.db.connect()

Test for thresholding bug

It's possible that the code in ModelScorer has a bug with thresholding, where entities with the same prediction value both above and below the thresholding line, the ones above will not be included in the threshold and thus any metrics based on that threshold may be inflated.

Write a test for this specifically (and fix a bug if it's there).

Perform evaluations/model scoring

Add codeclimate badge

Codeclimate integration is free for open-source repos and can help us get a handle on code quality.

Update triage.pipeline to reflect new components

triage.pipeline should serve as a basic example of the pipeline, so it's important to keep this up to date with basic usage of the new components (ModelTrainer, Predictor, ModelScorer)

The README should be updated with these contents as well, and a reference to triage.pipeline module in general.

Modify feature importance ranking to use method 'min'

Create true end-to-end integration test designed to catch insidious pipeline bugs

One of the (many) goals of this project is to prevent data science user errors, like temporal leakage or bad metric calculation. We should take care to avoid reproducing such errors here, or using constituent components (ie collate) in a way that can create those errors.

To help with this, we should create an end-to-end test with a synthetic dataset and grid/scoring config containing some landmines that could trip up a badly-designed version of this pipeline. Creating this dataset is not trivial, but should be worth it