ibm / deep-learning-language-model Goto Github PK

A Code Pattern focusing on how to train a machine learning language model while using Keras and Tensorflow

Home Page: https://developer.ibm.com/code/patterns/generate-restaurant-reviews-with-deep-learning/

Jupyter Notebook 100.00%

deep-learning-language-model's Introduction

Read this in other languages: ** - Español.

Training a Deep Learning Language Model Using Keras and Tensorflow

This Code Pattern will guide you through installing Keras and Tensorflow, downloading data of Yelp reviews and training a language model using recurrent neural networks, or RNNs, to generate text.

This code pattern was inspired from a Hacknoon blog post and made into a notebook. The approach for generating text is similar to the Keras text generation demo with a few changes to suit this scenario.

Using deep learning to generate information is a hot area of research and experimentation. In particular, the RNN deep learning model approach has seemingly boundless amounts of areas it can be applied to, whether it be to solutions to current problems or applications in budding future technologies. One of these areas of application is text generation, which is what this Code Pattern introduces. Text generation is used in language translation, machine translation and spell correction. These are all created through something called a language model. This Code Pattern runs through exactly that: a language model using a Long Short Term Memory (LSTM) RNN. For some context, RNNs use networks with hidden layers of memory to predict the next step using the highest probability. Unlike Convolutional Neural Networks (CNNs) which use forward propagation, or rather, move forward through its pipeline, RNNs utilize backpropogation, or circling back through the pipeline to make use of the "memory" mentioned above. By doing this, RNNs can use the text inputed to learn how to generate the next letters or characters as its output. The output then goes on to form a word, which eventually ends up as a collection of words, or sentences and paragraphs. The LSTM part of the model allows you to build an even bigger RNN model with improved learning of long-term dependencies, or better memory. This means an improved performance for those words and sentences we're generating!

This model is relevant as text generation is increasingly in demand to solve translation, spell and review problems across several industries. In particular, fighting fake reviews plays a big role in this area of research. Fake reviews are a very real problem for companies such as Amazon and Yelp, who rely on genuine reviews from users to vouch for their products and businesses which are featured on their sites. As of writing this, it is very easy for businesses to pay for fake, positive reviews, which ultimately end up elevating their sales and revenue. It is equally as easy to generate negative reviews for competing businesses. Unfortunately this leads users to places and products fraudulently and can potentially lead to someone having a negative experience or worse. In order to combat these abuses and illegal activity, text generation can be used to detect what a review looks like when it is generated versus a genuine review written by an authentic user. This Code Pattern walks you through the steps to create this text generation at a high level. We will run through how to approach the problem with a small portion of Yelp review data, but afterwards you can apply it to a much larger set and test what output you get!

The original Yelp data used in this Code Pattern can be found on Kaggle and as a zip file in this repository.

But what is Keras and Tensorflow?

If you've clicked on this Code Pattern I imagine you range from a deep learning intermediate to a beginner who's interested in learning more. Wherever you are on the spectrum, you probably think machine learning and deep learning are awesome (which I agree with) and have probably been exposed to some examples and terminology. You probably also understand some python and the fact that deep learning is a subset of machine learning (which is a subset of artificial intelligence as a whole). With those assumptions in mind, I can explain that Tensorflow is an open-source software library for machine learning, which allows you to keep track of all of your models and also see them with cool visualizations.

Keras is a deep learning library that you can use in conjunction with Tensorflow and several other deep learning libraries. Keras is very user-friendly in that it allows you to complete a model (for example using RNNs) with very few lines of code, which makes every data scientist's (including myself) life much easier! This project highlights exactly that feature with its relatively small amount of code. Keras also allows you to switch between libraries depending on what you're trying to do, saving you a great deal of headache and time. Let's get started shall we?

Flow

After installing the requisites, Keras and Tensorflow, the user executes the notebook.
The training data is used to train a language model.
New text is generated based on the model and returned to the user.

Included components

Keras: The Python Deep Learning library.
Tensorflow: An open-source software library for Machine Intelligence.

Featured technologies

Cloud: Accessing computer and information technology resources through the Internet.
Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Prerequisites

Ensure Python 3.0 or greater is installed.
Ensure the following system libraries are installed:
- pip (to install Python libraries)
- gcc (to compile SciPy)
  - For a mac, pip comes installed when you install python. Example:
```
brew install python
brew install gcc
```
  - For other operating systems, try:
```
sudo easy install pip
sudo easy install gfortran
```

Ensure the following Python libraries are installed:

NumPy
SciPy
pandas
zipfile36

h5py

pip install numpy
pip install scipy
pip install pandas
pip install zipfile36
pip install h5py

Steps

This pattern runs through the steps below. Check out the notebook for the code!

Download and install TensorFlow and Keras
Clone the repository
Train a model
Analyze the results

1. Download and install TensorFlow and Keras

Install TensorFlow.

See the TensorFlow install documentation for directions on how to install TensorFlow on all supported operating systems. In most cases, a simple pip install tensorflow will install the TensorFlow python library.
Install Keras.

See the Keras install documentation for directions on how to install Keras on all supported operating systems. In most cases, a simple pip install keras will install the Keras python library.

2. Clone the repository

Clone this repository and change into the new directory:

git clone https://github.com/IBM/deep-learning-language-model
cd deep-learning-language-model

A few things to mention about the contents of the repository:

transfer_learn.ipynb: This is the notebook we will be running.
yelp_test_set.zip: A complete data set that contains reviews from Yelp! from Kaggle
yelp_100_3.txt: A snippet of the above data set.
transfer weights: A model that we will use as a basis for this exercise.
indices_char.txt and char_indices.txt: These two text files define how letters and punctuation correspond to each other.

A quick note about the test data: This data allows us to use authentic Yelp reviews as input to our language model. This means that our model will iterate over the reviews and generate similar Yelp reviews. If a different dataset was used, like a novel by Hemingway, we would then generate text that was similar stylistically to Hemingway. The model we build in the notebook will consider certain features and their relevance to the English language and how those features contribute to building reviews. Once the reader feels familiar enough with the notebook, they can use the entire dataset.

A quick note about the transfer weights model: The weights are what allow us to fine tune the model and increase accuracy as our model learns. You won't have to worry about adjusting the weights here as the model will automatically do it when it saves to transfer_weights after it runs. Now let's get to training the model.

3. Train a model

Make sure you collect all of the files that you downloaded into the same folder.
Run transfer_learn.ipynb by running the cell with the code in it.
Once you've executed you should see TensorFlow start up and then various epochs running on your screen followed by generated text with increasing diversities.

The figure above shows a user running the notebook locally

To help understand what's going on here, in the file transfer_learn.ipynb we use a Keras sequential model of the LSTM variety, mentioned earlier. We use this variety so that we can include hidden layers of memory to generate more accurate text. Here the maxlen is automatically set to none. The maxlen refers to the maximum length of the sequence and can be none or an integer. We then use the Adam optimizer with categorical_crossentropy and begin by loading our transfer_weights. We define the sample with a temperature of 0.6. The temperature here is a parameter than can be used in the softmax function which controls the level of newness generated where 1 constricts sampling and leads to less diverse/more repetitive text and 0 has completely diverse text. In this case we are leaning slightly more towards repetition, though, in this particular model, we generate multiple versions of this with a sliding scale of diversities which we can compare to one another and see which works best for us. You'll also see the use of enumerate, which ultimately allows us to create an automatic counter and loop over information. We then train the model and save our weights into transfer_weights. Every time you train the model, you will save your learned weights to help improve the accuracy of your text generation.

As you can see in the diagram above, the inputed text is sent through several layers of the LSTM model (forward and backward) and then sent through a dense layer followed by a softmax layer. This considers information and iterates over the data at a character level. This means it considers context such as the previous characters to determine the next characters, ultimately forming words and then sentences to generate an entire review.

4. Analyze the results

As you can see in the image below, you should expect to see text being generated with different diversities and then saved back into the weights. By this output you can see what different outputs are based on different diversities of text (more diverse vs less/more repetitive).

The figure above shows the result of running the notebook, a generated Yelp! review.

Congrats! Now you've learned how to generate text based on the data you've given it. Now you can challenge yourself by trying out the entire yelp dataset or other text data! You got this!

Learn more

Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
Watson Studio: Master the art of data science with IBM's Watson Studio
Spark on IBM Cloud: Need a Spark cluster? Create up to 30 Spark executors on IBM Cloud with our Spark service

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ

deep-learning-language-model's People

Contributors

Stargazers

Watchers

deep-learning-language-model's Issues

questions about the files in 'code'

hey @MadisonJMyers -- i have a few questions about some of the files in 'code', and in general how this all works.

Is DSX used? I'm not seeing a Jupyter notebok in the repo. Note that DSX isn't a requirement, but just wanted to make sure since I thought I saw it mentioned in the README.
Can you add comments to https://github.com/MadisonJMyers/deep-learning-language-model/blob/master/code/transfer_learn.py ? I can read the code and understand what you're doing, but I don't understand why things are being done. Comments could help in that regard
The https://github.com/MadisonJMyers/deep-learning-language-model/blob/master/code/transfer_weights_link file is weird -- would it make more sense to just link out to the box note from the README?
The box locations lets me download a file, which is cool, but the file does not have an extension, it's just transfer_weights, it doesn't open in a text editor so i assume it's a binary file. Can you elaborate on this?

cc @rhagarty

Spanish version for this tutorial

Hi there, is it Ok if I fork this project and translate everything to Spanish? I saw there's only support for Chinese.
Can I pull-request to this repo?

Thanks
JD

Screen shots

Can we add some descriptive text around the screen shots shown in steps 1 and 2? Not sure what they are trying to convey to the user.

Also, screen shot in step 5 is way too small to read.

add an ATTRIBUTIONS.md file to the data folder

See https://github.com/IBM/powerai-vision-object-detection/blob/master/data/ATTRIBUTIONS.md as an example.

I checked out the kaggle link, and i think we can re-use the data, at minimum we should give them proper attribution.

Get the user to just clone the repo?

In #11 i fixed some syntax and style errors, but as I read through the steps I realized that a lot of them had to deal with "download the data folder", and "download the notebook", and "download the code folder".

I suggest we place as many things in one folder as possible, and just get the reader the clone the repo. It would be fewer steps and less chance of going wrong.

With that said, I've also filed #6 and #10 to try and remove the "download the code" and "download the data" steps.

Update README to reflect DSX-->Watson Studio

Replace mentions of DSX with Watson Studio in the README

Format issue in Prerequisites section

steps are repeated...

Steps need to be links

Format 'Steps' to be links to associated details. For example -

use gcc instead of gofortran

While testing this out, gofortran is no long installable via brew:

smartinelli-mac deep-learning-language-model $ brew install gfortran
Updating Homebrew...
==> Auto-updated Homebrew!

Error: No available formula with the name "gfortran" 
GNU Fortran is now provided as part of GCC, and can be installed with:
  brew install gcc

Using gcc is recommended. After using gcc, everything else was fine, including the prereqs (tensorflow and keras).

smartinelli-mac deep-learning-language-model $ brew install gcc
==> Installing dependencies for gcc: gmp, mpfr, libmpc, isl
==> Installing gcc dependency: gmp
==> Downloading https://homebrew.bintray.com/bottles/gmp-6.1.2_1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring gmp-6.1.2_1.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/gmp/6.1.2_1: 18 files, 3.1MB
==> Installing gcc dependency: mpfr
==> Downloading https://homebrew.bintray.com/bottles/mpfr-4.0.1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring mpfr-4.0.1.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/mpfr/4.0.1: 28 files, 4.6MB
==> Installing gcc dependency: libmpc
==> Downloading https://homebrew.bintray.com/bottles/libmpc-1.1.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libmpc-1.1.0.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/libmpc/1.1.0: 12 files, 353.8KB
==> Installing gcc dependency: isl
==> Downloading https://homebrew.bintray.com/bottles/isl-0.18.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring isl-0.18.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/isl/0.18: 80 files, 3.8MB
==> Installing gcc
==> Downloading https://homebrew.bintray.com/bottles/gcc-7.3.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring gcc-7.3.0.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/gcc/7.3.0: 1,486 files, 284.9MB
smartinelli-mac deep-learning-language-model $ sudo pip install numpy
Requirement already satisfied: numpy in /usr/local/lib/python2.7/site-packages
smartinelli-mac deep-learning-language-model $ sudo pip install scipy
Collecting scipy
  Downloading scipy-1.0.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (16.8MB)
    100% |████████████████████████████████| 16.8MB 93kB/s 
Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python2.7/site-packages (from scipy)
Installing collected packages: scipy
Successfully installed scipy-1.0.0
smartinelli-mac deep-learning-language-model $ sudo pip install pandas
Requirement already satisfied: pandas in /usr/local/lib/python2.7/site-packages
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /usr/local/lib/python2.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python2.7/site-packages (from python-dateutil->pandas)
smartinelli-mac deep-learning-language-model $ sudo pip install zipfile36
Collecting zipfile36
  Downloading zipfile36-0.1.3.tar.gz (61kB)
    100% |████████████████████████████████| 71kB 1.6MB/s 
Installing collected packages: zipfile36
  Running setup.py install for zipfile36 ... done
Successfully installed zipfile36-0.1.3
smartinelli-mac deep-learning-language-model $ sudo pip install tensorflow
Requirement already satisfied: tensorflow in /usr/local/lib/python2.7/site-packages
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: bleach==1.5.0 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: wheel in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: mock>=2.0.0 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: html5lib==0.9999999 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: protobuf>=3.2.0 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: backports.weakref==1.0rc1 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: werkzeug>=0.11.10 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python2.7/site-packages (from tensorflow)
Requirement already satisfied: funcsigs>=1; python_version < "3.3" in /usr/local/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow)
Requirement already satisfied: pbr>=0.11 in /usr/local/lib/python2.7/site-packages (from mock>=2.0.0->tensorflow)
Requirement already satisfied: setuptools in /usr/local/lib/python2.7/site-packages (from protobuf>=3.2.0->tensorflow)
smartinelli-mac deep-learning-language-model $ sudo pip install keras
Collecting keras
  Downloading Keras-2.1.3-py2.py3-none-any.whl (319kB)
    100% |████████████████████████████████| 327kB 2.4MB/s 
Requirement already satisfied: pyyaml in /usr/local/lib/python2.7/site-packages (from keras)
Requirement already satisfied: numpy>=1.9.1 in /usr/local/lib/python2.7/site-packages (from keras)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python2.7/site-packages (from keras)
Requirement already satisfied: scipy>=0.14 in /usr/local/lib/python2.7/site-packages (from keras)
Installing collected packages: keras
Successfully installed keras-2.1.3

h5py prereq needs to be mentioned

Ran into this error while running the notebook, apparently h5py is required. it should be added to the pre-reqs section. Also, this makes me think this file should be named .h5 all of my googling leads me to believe that most keras models have a .h5 extension: https://machinelearningmastery.com/save-load-keras-deep-learning-models/

anyway, here's the error:

Build model...
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-8324d558f45d> in <module>()
     53 model.compile(loss='categorical_crossentropy', optimizer=optimizer)
     54 
---> 55 model.load_weights("transfer_weights")
     56 
     57 def sample(preds, temperature=.6):

/usr/local/lib/python2.7/site-packages/keras/models.pyc in load_weights(self, filepath, by_name, skip_mismatch)
    721     def load_weights(self, filepath, by_name=False, skip_mismatch=False):
    722         if h5py is None:
--> 723             raise ImportError('`load_weights` requires h5py.')
    724         f = h5py.File(filepath, mode='r')
    725         if 'layer_names' not in f.attrs and 'model_weights' in f:

ImportError: `load_weights` requires h5py.

I installed h5py and everything worked after that.

smartinelli-mac deep-learning-language-model $ sudo pip install h5py
Collecting h5py
  Downloading h5py-2.7.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (4.7MB)
    100% |████████████████████████████████| 4.7MB 338kB/s 
Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python2.7/site-packages (from h5py)
Requirement already satisfied: six in /usr/local/lib/python2.7/site-packages (from h5py)
Installing collected packages: h5py
Successfully installed h5py-2.7.1

With h5py installed...

Build model...

--------------------------------------------------
Iteration 1
/usr/local/lib/python2.7/site-packages/keras/models.py:944: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
  warnings.warn('The `nb_epoch` argument in `fit` '
Epoch 1/1
  384/23665 [..............................] - ETA: 1:10:24 - loss: 1.5173
In [ ]:

I'll submit another issue for the warning...

remove char_index and index_char files

Currently the char_indices.txt and indices_char.txt files are loaded from disk.

char_indices = json.loads(open('char_indices.txt').read())
indices_char = json.loads(open('indices_char.txt').read())
chars = sorted(char_indices.keys())

Upon looking at them, they are the same key & value pairs swapped. I think these can be removed and placed in the notebook, then transposed. Kinda like this:

smartinelli-mac $ python
Python 2.7.10 (default, Jun 10 2015, 19:42:47)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> indices_char ={
"0": "\n",
"1": " ",
"2": "!",
"3": "\"",
"4": "#",
"5": "$",
"6": "%",
"7": "&",
"8": "'",
"9": "(",
"10": ")",
"11": "*",
"12": "+",
"13": ",",
"14": "-",
"15": ".",
"16": "/",
"17": "0",
"18": "1",
"19": "2",
"20": "3",
"21": "4",
"22": "5",
"23": "6",
"24": "7",
"25": "8",
"26": "9",
"27": ":",
"28": ";",
"29": "=",
"30": "?",
"31": "[",
"32": "]",
"33": "a",
"34": "b",
"35": "c",
"36": "d",
"37": "e",
"38": "f",
"39": "g",
"40": "h",
"41": "i",
"42": "j",
"43": "k",
"44": "l",
"45": "m",
"46": "n",
"47": "o",
"48": "p",
"49": "q",
"50": "r",
"51": "s",
"52": "t",
"53": "u",
"54": "v",
"55": "w",
"56": "x",
"57": "y",
"58": "z",
"59": "{",
"60": "}"
}
>>>
>>> print indices_char
{'60': '}', '24': '7', '25': '8', '26': '9', '27': ':', '20': '3', '21': '4', '22': '5', '23': '6', '28': ';', '29': '=', '0': '\n', '2': '!', '4': '#', '6': '%', '8': "'", '59': '{', '58': 'z', '11': '*', '10': ')', '13': ',', '12': '+', '15': '.', '14': '-', '17': '0', '16': '/', '19': '2', '18': '1', '57': 'y', '56': 'x', '51': 's', '50': 'r', '53': 'u', '52': 't', '55': 'w', '54': 'v', '48': 'p', '49': 'q', '46': 'n', '47': 'o', '44': 'l', '45': 'm', '42': 'j', '43': 'k', '40': 'h', '41': 'i', '1': ' ', '3': '"', '5': '$', '7': '&', '9': '(', '39': 'g', '38': 'f', '33': 'a', '32': ']', '31': '[', '30': '?', '37': 'e', '36': 'd', '35': 'c', '34': 'b'}
>>> char_indices = {y:int(x) for x,y in indices_char.iteritems()}
>>> print char_indices
{'\n': 0, '!': 2, ' ': 1, '#': 4, '"': 3, '%': 6, '$': 5, "'": 8, '&': 7, ')': 10, '(': 9, '+': 12, '*': 11, '-': 14, ',': 13, '/': 16, '.': 15, '1': 18, '0': 17, '3': 20, '2': 19, '5': 22, '4': 21, '7': 24, '6': 23, '9': 26, '8': 25, ';': 28, ':': 27, '=': 29, '?': 30, '[': 31, ']': 32, 'a': 33, 'c': 35, 'b': 34, 'e': 37, 'd': 36, 'g': 39, 'f': 38, 'i': 41, 'h': 40, 'k': 43, 'j': 42, 'm': 45, 'l': 44, 'o': 47, 'n': 46, 'q': 49, 'p': 48, 's': 51, 'r': 50, 'u': 53, 't': 52, 'w': 55, 'v': 54, 'y': 57, 'x': 56, '{': 59, 'z': 58, '}': 60}
>>>

where indices_char is https://github.com/MadisonJMyers/deep-learning-language-model/blob/master/code/indices_char.txt
and char_indices is https://github.com/MadisonJMyers/deep-learning-language-model/blob/master/code/char_indices.txt

prevent file lookup issues

In the notebook, the single data file is loaded by assuming the notebook is run from the same location as the single data file, like so:

path = 'yelp_100_5.txt'
text = open(path).read().lower()

You can present this with a little bit of python to pull from a URL instead:

import requests
link = "https://raw.githubusercontent.com/IBM/deep-learning-language-model/master/data/yelp_100_3.txt"
f = requests.get(link)
text = f.text.lower()

Just an idea / option.