Giter Club home page Giter Club logo

senti_anal's Introduction

Sentiment Analysis

Final project for courseMLOps course at DTU.

codecovCI pytest build-docs

Read the docs

Project Description

Overall goal of the project:

Building and running a sentimental analysis model using a pretrained model "DistilBERT" from the huggingface/transformer framework based on the dataset amazon_polarity. The dataset contains of about ~35 mio. reviews from Amazon up to March 2013 (in total about 18 years of reviews). As a result of the project, the model should analyse new Amazon reviews and classify them either as positive or negative rating. The overall goal is to learn working with the huggingface/transformer library, applying the various taught tools/frameworks from SkafteNicki/dtu_mlops to setup a proper ML operations project.

As already mentioned above, we are using the Transformer framework to access the pretrained DisttilBERT embeddings and to use the preprocessing tools (e.g. tokenizer) for the sentimental analysis. The dataset is directly loaded from the huggingface hub. Initially, we used the frozen embeddings of BERT and add a final classification layer as proposed in this jupyter notebook. However, as training ended up to be too long with BERT we switched to DistilBERT since it has only 66 mio. parameters compared with 340mio parameters from BERT.

Tools planned (or already implemented) to be used in the project:

Tools/ Frameworks/
Configurations/ Packages
Purpose
Conda environement Closed environment to facilitate package handling
Wandb Experiment logging
Hydra Managing of config files for training
Cookiecutter Setup the project environement
black flake8 Coding style
isort Sorting of imports
dvc Data Versioning
Google cloud File storage, training, deployment
docker Building train and prediction containers for deployment
fastAPI Project API for prediction interface
Huggingface Pretrained model, datamodule

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external            <- Data from third party sources.
│   ├── interim             <- Intermediate data that has been transformed.
│   ├── processed           <- The final, canonical data sets for modeling.
│   └── raw                 <- The original, immutable data dump.
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
├── models             <- Trained and serialized models, model predictions, or model summaries
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── config             <- Source code for use in this project.
│   ├── data                <- Config files defining the datamodule 
│   ├── hydra               <- Config files defining hydra setup
│   ├── logging             <- Config files defining logging in gcp, wandb
│   ├── model               <- Config files defining used model
│   ├── optim               <- Config files defining model optimizer
│   └── train               <- Config files defining train setup (pl.Trainer, metric, early stopping)
├── models              <- Folder to store pretrained models locally
├── opensentiment      <- Source code for use in this project.
│   ├── __init__.py         <- Makes src a Python module
│   ├── data                <- Script to download or generate data
│   │   └── make_dataset_pl.py
│   ├── gcp                 <- Scripts to define settings for google cloud handle
│   │   └── build_features.py
│   ├── models              <- Scripts to define and train  model and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model_pl.py
│   │   └── train_model_pl.py
│   │   └── bert_model_pl.py
│   └── api                 <- Scripts to create fastAPI
├── setup               <- Files to setup docker (.sh, .yaml, .dockerfile) and pip requirements for cpu and gpu use
│   ├── docker              <- Folder containing all files to build docker images
│   ├── pip                 <- Folder containing all files for correct pip setup depening on cpu or gpu
├── requirements.txt               <- General requirements file for project
├── requirements_gpu.txt               <- Additional requirements file for gpu handling

└── tox.ini             <- tox file with settings for running tox; see tox.readthedocs.io

Minimal Installation

Default configuration (Conda 5.10 / Ubuntu 20.04):

conda create -y --name py39senti python=3.9 pip
conda activate py39senti

# GPU below
pip install -r requirements.txt
# CUDA 11.3 configuration
# pip install -r requirements_gpu.txt

# git hooks
pre-commit install
# get data
dvc pull
# verify everything is working
coverage run -m --source=./opensentiment pytest tests

senti_anal's People

Contributors

michaelfeil avatar johannespischinger avatar max-27 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

senti_anal's Issues

create sample dataset

create a demo dataset with size 1-10mb, so we can test / pytest / iterate faster.

Maybe put a copy of it in the repository.

hydra model train fails due to relative path

self._accessor.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: '../../models'
(py39senti) michi@lenovo-michi:~/senti_anal$ 

fails,
but

(py39senti) michi@lenovo-michi:~/senti_anal/opensentiment/models$ python train_model.py
works.

Proposal: We should not overwrite the hydra: dir: path setting, and leave it to hydra instead

protect master branch

amazon_polarity doesn't contain any data file

`E FileNotFoundError: The directory at amazon_polarity doesn't contain any data file

../anaconda3/envs/py39senti/lib/python3.9/site-packages/datasets/data_files.py:295: FileNotFoundError
=================================================================== short test summary info ===================================================================
FAILED tests/data/test_make_dataset.py::TestMakeDataset::test_make_dataset - FileNotFoundError: The directory at amazon_polarity doesn't contain any data file`

coverage run -m pytest tests -m "not long"

write models/predict_model.py and test_predict_model

idea:

Class Predict:
       __init__(hydra_config, path_to_chph):
               self.tokenizer = ..
               self.model = .. load(ckph)
       predict(string):
               tokenized = self.tokenizer(string)
               preds = self.model(tokenized)

use this class then in cloud function local / anywhere else.

Data handling

This issue handles all tasks relevant to data loading and preprocessing.

requested issue tracking

TODOs

Week 1

  • Create a git repository
  • Make sure that all team members have write access to the github repository
  • Create a dedicated environment for you project to keep track of your packages (using conda)
  • Create the initial file structure using cookiecutter
  • Fill out the make_dataset.py file such that it downloads whatever data you need and
  • Add a model file and a training script and get that running
  • Remember to fill out the requirements.txt file with whatever dependencies that you are using
  • Remember to comply with good coding practices (pep8) while doing the project
  • Do a bit of code typing and remember to document essential parts of your code
  • Setup version control for your data or part of your data
  • Construct one or multiple docker files for your code
  • Build the docker files locally and make sure they work as intended
  • Write one or multiple configurations files for your experiments
  • Used Hydra to load the configurations and manage your hyperparameters
  • When you have something that works somewhat, remember at some point to to some profiling and see if you can optimize your code
  • Use wandb to log training progress and other important metrics/artifacts in your code
  • Use pytorch-lightning (if applicable) to reduce the amount of boilerplate in your code

Week 2

  • Write unit tests related to the data part of your code
  • Write unit tests related to model construction
  • Calculate the coverage.
  • Get some continuous integration running on the github repository
  • (optional) Create a new project on gcp and invite all group members to it
  • Create a data storage on gcp for you data
  • Create a trigger workflow for automatically building your docker images
  • Get your model training on gcp
  • Play around with distributed data loading
  • (optional) Play around with distributed model training
  • Play around with quantization and compilation for you trained models

Week 3

  • Deployed your model locally using TorchServe
  • Checked how robust your model is towards data drifting
  • Deployed your model using gcp
  • Monitored the system of your deployed model
  • Monitored the performance of your deployed model

Additional

  • Revisit your initial project description. Did the project turn out as you wanted?
  • Make sure all group members have a understanding about all parts of the project
  • Create a presentation explaining your project
  • Uploaded all your code to github
  • (extra) Implemented pre-commit hooks for your project repository
  • [not_targeted] (extra) Used Optuna to run hyperparameter optimization on your model

Additional group defined

  • get docs deployed with every production release
  • log coverage with pipeline runs on codecov.io

open backlock

  • Finalize document tree in readme.md at the end of the project

bug loading dataset from huggingface google drive

When we use the training_pl.py inside a docker container the cache files from the dataset are not working since they are dependent on the local path:
File "/usr/local/lib/python3.9/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums raise NonMatchingChecksumError(error_msg + str(bad_urls)) datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://drive.google.com/u/0/uc?id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM&export=download']
I would suggest to save the train, val and test datasets in the processed folder as pickle files and load them with torch.load()

data loading bug with dvc

load_dataset function from hugging face can't access the dvc tracked data directory
--> OSError: [Errno 30] Read-only file system: '/data'

docker-refactoring

Refactor docker files based on new config structure for torch lightning training

deploy this model from GS storage

Trained a distilbert model

@max-27 thats how the output files under ./cache look like.
https://console.cloud.google.com/storage/browser/model_senti_anal/pretrained-distilbert-2022-01-18-13-50-17/

from google.cloud import storage
import pickle

BUCKET_NAME = ...
MODEL_FILE = ...
Localpath = ./model_checkpoint.chpt

client = storage.Client()
bucket = client.get_bucket(BUCKET_NAME)
blob = bucket.get_blob(MODEL_FILE)
blob.download_to_filename(Localpath)

# load one of these options
# 1
model2 = model.load_from_checkpoint(Localpath)

# 2
model_def: pl.LightningModule = hydra.utils.instantiate(
        cfg.model,
        **{
            "model_name_or_path": cfg.data.datamodule.model_name_or_path,
            "train_batch_size": cfg.data.datamodule.batch_size.train,
        },
        optim=cfg.optim,
        data=cfg.data,
        logging=cfg.logging,
        _recursive_=False,
    )

model2 = model.load_from_checkpoint(Localpath)

also requirements.txt / config hydra etc is uploaded.

Training model

This issue handles all task relevant for training the model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.