Giter Club home page Giter Club logo

demo_text_binary_classification_bert's Introduction

How to Do Text Binary Classification with BERT in Less than Ten Lines of Code?

Demand

We all know BERT is a compelling language model which has already been applied to various kinds of downstream tasks, such as sentiment analysis and QA.

Have you ever tried it on text binary classification?

Honestly, till the beginning of this week, my answer was still NO.

Why?

Because the example code on BERT's official GitHub repo was not very friendly.

Firstly, I want an IPython Notebook, instead of a Python script file, for I want to get instant feedback when I run a code chunk. Of course, a Google Colab Notebook would be better, for I can use the code right away with the free GPU.

Secondly, I don't want to know the detail except for the ones I care. For example, I want to control the useful parameters, such as the number of epochs and batch size. However, do I need to know all the "processors," "flags" and logging functions?

I am a spoiled machine learning user after I tried all other friendly frameworks.

For example, in Scikit-learn, if you try to build a tree classifier, here is (almost) all your code.

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

If you want to do image classification in fast.ai, you need to input these lines.

!git clone https://github.com/wshuyi/demo-image-classification-fastai.git
from fastai.vision import *
path = Path("demo-image-classification-fastai/imgs/")
data = ImageDataBunch.from_folder(path, test='test', size=224)
learn = cnn_learner(data, models.resnet18, metrics=accuracy)
learn.fit_one_cycle(1)
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_top_losses(9, figsize=(8, 8))

Not only you can get the classification result, but an activation map as well.

Why on earth cannot Google Developers give us a similar interface to use BERT for text classification?

On Monday, I found this Colab Notebook. It's an example of predicting sentiment of movie reviews.

I was so excited, for I learned BERT is now included in Tensorflow Hub.

However, when I opened it, I found there are still too many details for a user who only cares about the application of text classification.

So I tried to refactor the code, and I made it.

Notebook

Please follow this link and you will see the IPynb Notebook file on github.

Click the "Open in Colab" Button. Google Colab will be opened automatically.

You need to save a copy to your own Google Drive by clicking on the "COPY TO DRIVE" button.

You only need to do three things after that.

  1. Prepare the data in Pandas Data frame format. I guess it's easy for most deep learning users.
  2. Adjust four parameters if necessary.
  3. Run the notebook and get your result displayed.

I will explain to you in detail.

When you open the notebook, you may feel angry.

You Liar! You promised there are only less than 10 lines!

Calm down.

You don't need to change anything until this line.

So go to this line, and Click the Run before button.

Let us focus on the really important part.

You need to get the data ready.

My example is a sample dataset of IMDB reviews. It contains 1000 positive and 1000 negative samples in training set, while the testing set contains 500 positive and 500 negative samples.

!wget https://github.com/wshuyi/info-5731-public/raw/master/imdb-sample.pickle

with open("imdb-sample.pickle", 'rb') as f:
    train, test = pickle.load(f)

I used it in my INFO 5731 class at UNT to let students compare the result of textblob package, Bag of Words model, simple LSTM with word embedding, and ULMfit.

Now I think I can add BERT into the list, finally.

You need to run the following line to make sure the training data is shuffled correctly.

train = train.sample(len(train))

Now you can look into your data, see if everything goes smoothly.

train.head()

Your dataset should be stored in Pandas Data Frame. There should be one training set, called train and one testing set, called test.

Both of them should at least contain two columns. One column is for the text, and the other one is for the binary label. It is highly recommended to select 0 and 1 as label values.

Now that your data is ready, you can set the parameters.

myparam = {
        "DATA_COLUMN": "text",
        "LABEL_COLUMN": "sentiment",
        "LEARNING_RATE": 2e-5,
        "NUM_TRAIN_EPOCHS":10
    }

The first two parameters are just the name of columns of your data frame. You can change them accordingly.

The third parameter is the learning rate. You need to read the original paper to figure out how to select it wisely. Alternatively, you can use this default setting.

The last parameter is to set how many epochs you want BERT to run. I chose 10 here, for the training dataset is very small, and I don't want it overfits.

Okay. Now you can run BERT!

result, estimator = run_on_dfs(train, test, **myparam)

Warning! This line takes you some time to run.

When you see some message like this, you know the training phase has finished.

So you can run the last line to get evaluation result of your classification model (on BERT) in a pretty form.

pretty_print(result)

For such a small training set, I think the result is quite good.

That's all.

Now you can use the state of the art language modeling technique to train your text binary classifier too!

By the way, if you are interested, please help me to package the code before that line so that it looks even more straightforward. Thanks!

Related Blogs

If you are interested in this blog article, you may also want to read the following ones:

demo_text_binary_classification_bert's People

Contributors

wshuyi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.