Giter Club home page Giter Club logo

i-tagger's Introduction

Introduction

A simple and modular Tensorflow model development environment to handle sequence-to-sequence models.

Developing models to solve a problem for a data set at hand, requires lot of trial and error methods. Which includes and not limited to:

  • Preparing the ground truth or data set for training and testing
    • Collecting the data from online or open data sources
    • Getting the data from in-house or client database
  • Pre-processing the data set
    • Text cleaning
    • NLP processing
    • Meta feature extraction etc.,
  • Data iterators, loading and looping the data examples for model while training and testing
    • In memory - All data is held in RAM and looped in batches on demand
    • Reading from the disk on demand in batches
    • Maitaining different feature sets (i.e number of features and its types) for the model
  • Models
    • Maintaining different models for same set of features
    • Good visulaizing and debugging environment/tools
    • Start and pause the training at will
  • Model Serving
    • Load a particular model from the pool of available models for a particular data set

Related Work

Below two Git repos got our attention:

Both of the projects are excellent in their own way, however they lack few things like support for different dataset and models in a modular way, which plays a key role in a customer facing project(s). Where nature of the data changes as the project evolves.

Problem Statement

  • To come up with an software architecture to try different models on different dataset
  • Which should take care of:
    • Pre-processing the data
    • Preparing the data iterators for training, validation and testing for set of features and their types
    • Use a model that aligns with the data iterator feature type
    • Train the model in an iterative manner, with fail safe
    • Use the trained model to predict on new data
  • Keep the model core logic independent of the current architecture

Solution or proposal

A few object-oriented principles are used in the python scripts for ease of extensibility and maintenance.

What we solved using this code?

  • Top level accuracies on open Conll dataset 2003
  • Extract information from patent documents for form filling, from historical data entries from the Database records.

Current Architecture

  • Handling Dataset and Preprocessing
  • Data iterators
    • Dataset may have one or more features like words, characters, positional information of words etc.,
    • Extract those and convert word/characters to numeric ids, pad them etc.,
    • Enforces number of features and their types, so that set of models can work on down the line
  • Models should agree with data iterator features types and make use of the aviable features to train the data

Directory Details

Each experiment starts based on a dataset.

Let use CoNLL data set, since it is provided as part this repo

  • conll_csv_experiments
    • config
      • config.ini # all one time config goes here
    • data
      • train.txt
      • test.txt
      • val.txt
    • preprocessed_data
      • train/
      • val/
      • test/
    • data_iterator_1
      • model_v0
        • config_1
        • config_2

Available Models:

Validation

The whole pacakage is tested on CoNLL data set for software integrity, and results are not tuned yet!

Check here for more details on how to rest it on CoNLL data set.


Setup

Requirements:

  • Python 3.5
  • tensorflow-gpu r1.4
  • spaCy
  • tqdm
  • tmux
  • overrides

How run on GPU server: (Imginea Specific)

#run following command for one time password verification
ssh-copy-id "[email protected]"

ssh [email protected]

# One time setup
tmux new -s your_name
export PATH=/home/rpx/anaconda3/bin:$PATH

### Note following environment is already setup, 
### no need to replicate unles you wanted different versions
conda create -n tensorflow-gpu python=3.5 anaconda
export LD_LIBRARY_PATH=/home/rpx/softwares/cudnn6/cuda/lib64:$LD_LIBRARY_PATH
source activate tensorflow-gpu
python --version

Anaconda Environmnet setup: (General Users)

conda create -n tensorflow-gpu python=3.5 anaconda
source activate tensorflow-gpu

Environment setup:

pip install tensorflow_gpu
pip install spacy
python -m spacy download en_core_web_md
pip install tqdm
pip install overrides

Tmux (Imginea Specific)

cd ~/experiments/
mkdir your_name
cd your_name

git clone https://gitlab.pramati.com/imaginea-labs/i-tagger

Day to day use


tmux a -t your_name

### run only if you previous tmux session was closed completly
source activate tensorflow-gpu
export PATH=/home/rpx/anaconda3/bin:$PATH
export LD_LIBRARY_PATH=/home/rpx/softwares/cudnn6/cuda/lib64:$LD_LIBRARY_PATH

Learning Materials

!!!!!! WORK IN PROGRESS !!!!!!

Imaginea Patent Tagging

python src/commands/patent_dataset.py --mode=preprocess
python src/commands/patent_dataset.py --mode=train
python src/commands/patent_dataset.py --mode=retrain --model-dir=PATH TO Model
python src/commands/patent_dataset.py --mode=predict --model-dir=PATH TO Model --predict-dir=PATH to Prediction files

TODOs:

  • Remove all default params
  • Tune the model for CoNLL dataset
  • Test code and Documentation
  • Cleaning of the code
  • More on LSTM basics/tutorials

i-tagger's People

Contributors

mageswaran1989 avatar srikumarks avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.