Giter Club home page Giter Club logo

helpfulness_prediction's Introduction

Data Description

  • data/json/pos -- All positive original json format file are saved in this directory. Take a look in of the sample data and make sure the data format matchs with the sample.
  • data/json/neg -- ALl negative original json format file are saved in this directory. Take a look in of the sample data and make sure the data format matchs with the sample.
  • data/raw -- Review text will be extract from the json file and be saved in this directory after being cleaned.
  • data/train -- Data in data/raw will be merged and transformed in the special format for FastText training. And then saved in this directory.
  • data/model -- Model file will be saved here.

How to Prepare Training Data

Run following command in terminator to prepare training data:

    # under the directory 'data/json'
    $ cat pos/* | sed '/\]\[/d' | sed 's/}$/},/g' | tail -r | sed '2s/},$/}/g'| tail -r > reviews_pos.json
    $ cat neg/* | sed '/\]\[/d' | sed 's/}$/},/g' | tail -r | sed '2s/},$/}/g'| tail -r > reviews_neg.json

    # under the directory 'data_prep'
    $ python3 run_prep.py -ip reviews_pos.json -in reviews_neg.json -o reviews_train.txt
    # The first param is original positive json file name.
    # The second param is original negative json file name.
    # The third param is last-generated training file.

How to Train the Classifier

First, you need to intstall FastTest library into your python3 environment:

    $ git clone https://github.com/facebookresearch/fastText.git
    $ cd fastText
    $ pip install .

Second, you may need to download pre-trained word vector file form FastText official website.

At last, we need to divide the whole dataset into training set and validation set, and train on the training set:

    # under the directory 'data/train'
    $ wc reviews_train.txt
    # assume the output is 1811  100354  555352 reviews_train.txt
    $ head -n 1511 reviews_train.txt > reviews.train
    $ tail -n 300 reviews_train.txt > reviews.valid

    # under the directory 'main'
    $ python3 main.py --mode train --model ../data/model/reviews.bin --train ../data/train/reviews.train --wordvector ../data/wiki-news-300d-1M-subword.vec
    # we do not provide hyperparameters of training process here, but you can directly modify it in the main.py

How to Test the Classifier

    # under the directory 'main'
    $ python3 main.py --mode test --model ../data/model/reviews.bin --test ../data/train/reviews.valid
    # 2 files (pridicted correctly and predicted wrongly) will be generated at the same path of test file

helpfulness_prediction's People

Contributors

rannichan avatar

Forkers

ziqianpei

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.