Giter Club home page Giter Club logo

reuters-21578's Introduction

Getting Started

There are several ways to get started with the project.

There's a simple Dockerfile bundled with the project to make running the project easier. Build a docker image in the root of the project (where you can find the Dockerfile) and wait for the image build to finish. Once finished, you can run the project using

docker build . -t whatever
docker run -it whatever --dataset-dir=reuters/dataset

If you don't want to go the docker way, just install the dependencies in a virtual environment, or your global python installation and after it's done, run the project, the main script to run is starter.py.

pip install -r requirements/dev.txt 
python starter.py --dataset-dir=reuters/dataset

Just have in mind that the first time you run the project, it's gonna take sometime to pre-process all the raw data. It will be cached later on and subsequent runs of the model should be relatively speedy. You can try switching Apache Beam's DirectRunner for a DataflowRunner and see how much it can speed data pre-processing up.

Dataset will be downloaded on the first use.

How to run tests

Run the following in the root of the project. Testing for now is very limited.

python -m unittest discover

About

  1. Initially all data passes through several steps in Beam. Doing so features are transformed in a distributed fashion, and can later be put to production to handle billions of rows of data, without requiring us to re-write pre-processing steps anew. Also TF-Transform can be easily employed to remove train-test skew in realtime or batch loads in huge scale, since we're using beam.
  2. To parse SGML format, a python library called BeautifulSoup is used.
  3. One-hot-encoded vectors are used for labels, while an embedding is learned for the input texts.
  4. We do multi-label classification, running the training algorithm for 1 epoch reaches a top_5 accuracy of nearly +83% which is fine, it can certainly improve with more epochs.
  5. A 2 layer stacked LSTM model is used to classify the data, with 2 Dropout layers to reduce over-fitting.
  6. There are a lot of ways everything could be improved here. Consider it as a PoC kinda thing.

Future Directions

  • Train for more epochs, and on better infrastructure (e.g. GPUs, GCP AI Platform, ...)
  • Use better word/sentence embeddings pre-trained on larger outsider datasets, like word2vec or Universal Sentence Encoder
  • Use a better recurrent model architecture, probably with windowing
  • Test out other models as well, 1D CNNs might also be good fits for this task, TfIDF or other relevant ideas
  • Use features like "places" or other relevant information in the dataset to improve classification
  • Use BayesianOptimization or a GridSearch to figure out a good set of hyperparameters for the models and vocabularies

Contact

I usually am instantaneously available by email, [email protected]!

reuters-21578's People

Contributors

alirezasadeghi avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.