minimaxir / gpt-2-simple Goto Github PK

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

License: Other

Python 100.00%

text-generation tensorflow openai textgenrnn

gpt-2-simple's Introduction

gpt-2-simple

A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifically the "small" 124M and "medium" 355M hyperparameter versions). Additionally, this package allows easier generation of text, generating to a file for easy curation, allowing for prefixes to force the text to start with a given phrase.

This package incorporates and makes minimal low-level changes to:

Model management from OpenAI's official GPT-2 repo (MIT License)
Model finetuning from Neil Shepperd's fork of GPT-2 (MIT License)
Text generation output management from textgenrnn (MIT License / also created by me)

For finetuning, it is strongly recommended to use a GPU, although you can generate using a CPU (albeit much more slowly). If you are training in the cloud, using a Colaboratory notebook or a Google Compute Engine VM w/ the TensorFlow Deep Learning image is strongly recommended. (as the GPT-2 model is hosted on GCP)

You can use gpt-2-simple to retrain a model using a GPU for free in this Colaboratory notebook, which also demos additional features of the package.

Note: Development on gpt-2-simple has mostly been superceded by aitextgen, which has similar AI text generation capabilities with more efficient training time and resource usage. If you do not require using TensorFlow, I recommend using aitextgen instead. Checkpoints trained using gpt-2-simple can be loaded using aitextgen as well.

Install

gpt-2-simple can be installed via PyPI:

pip3 install gpt-2-simple

You will also need to install the corresponding TensorFlow 2.X version (min 2.5.1) for your system (e.g. tensorflow or tensorflow-gpu).

Usage

An example for downloading the model to the local system, finetuning it on a dataset. and generating some text.

Warning: the pretrained 124M model, and thus any finetuned model, is 500 MB! (the pretrained 355M model is 1.5 GB)

import gpt_2_simple as gpt2
import os
import requests

model_name = "124M"
if not os.path.isdir(os.path.join("models", model_name)):
	print(f"Downloading {model_name} model...")
	gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/124M/


file_name = "shakespeare.txt"
if not os.path.isfile(file_name):
	url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
	data = requests.get(url)

	with open(file_name, 'w') as f:
		f.write(data.text)


sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              file_name,
              model_name=model_name,
              steps=1000)   # steps is max number of training steps

gpt2.generate(sess)

The generated model checkpoints are by default in /checkpoint/run1. If you want to load a model from that folder and generate text from it:

import gpt_2_simple as gpt2

sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

gpt2.generate(sess)

As with textgenrnn, you can generate and save text for later use (e.g. an API or a bot) by using the return_as_list parameter.

single_text = gpt2.generate(sess, return_as_list=True)[0]
print(single_text)

You can pass a run_name parameter to finetune and load_gpt2 if you want to store/load multiple models in a checkpoint folder.

There is also a command-line interface for both finetuning and generation with strong defaults for just running on a Cloud VM w/ GPU. For finetuning (which will also download the model if not present):

gpt_2_simple finetune shakespeare.txt

And for generation, which generates texts to files in a gen folder:

gpt_2_simple generate

Most of the same parameters available in the functions are available as CLI arguments, e.g.:

gpt_2_simple generate --temperature 1.0 --nsamples 20 --batch_size 20 --length 50 --prefix "<|startoftext|>" --truncate "<|endoftext|>" --include_prefix False --nfiles 5

See below to see what some of the CLI arguments do.

NB: Restart the Python session first if you want to finetune on another dataset or load another model.

Differences Between gpt-2-simple And Other Text Generation Utilities

The method GPT-2 uses to generate text is slightly different than those like other packages like textgenrnn (specifically, generating the full text sequence purely in the GPU and decoding it later), which cannot easily be fixed without hacking the underlying model code. As a result:

In general, GPT-2 is better at maintaining context over its entire generation length, making it good for generating conversational text. The text is also generally gramatically correct, with proper capitalization and few typoes.
The original GPT-2 model was trained on a very large variety of sources, allowing the model to incorporate idioms not seen in the input text.
GPT-2 can only generate a maximum of 1024 tokens per request (about 3-4 paragraphs of English text).
GPT-2 cannot stop early upon reaching a specific end token. (workaround: pass the truncate parameter to a generate function to only collect text until a specified end token. You may want to reduce length appropriately.)
Higher temperatures work better (e.g. 0.7 - 1.0) to generate more interesting text, while other frameworks work better between 0.2 - 0.5.
When finetuning GPT-2, it has no sense of the beginning or end of a document within a larger text. You'll need to use a bespoke character sequence to indicate the beginning and end of a document. Then while generating, you can specify a prefix targeting the beginning token sequences, and a truncate targeting the end token sequence. You can also set include_prefix=False to discard the prefix token while generating (e.g. if it's something unwanted like <|startoftext|>).
If you pass a single-column .csv file to finetune(), it will automatically parse the CSV into a format ideal for training with GPT-2 (including prepending <|startoftext|> and suffixing <|endoftext|> to every text document, so the truncate tricks above are helpful when generating output). This is necessary to handle both quotes and newlines in each text document correctly.
GPT-2 allows you to generate texts in parallel by setting a batch_size that is divisible into nsamples, resulting in much faster generation. Works very well with a GPU (can set batch_size up to 20 on Colaboratory's K80)!
Due to GPT-2's architecture, it scales up nicely with more powerful GPUs. For the 124M model, if you want to train for longer periods of time, GCP's P100 GPU is about 3x faster than a K80/T4 for only 3x the price, making it price-comparable (the V100 is about 1.5x faster than the P100 but about 2x the price). The P100 uses 100% of the GPU even with batch_size=1, and about 88% of the V100 GPU.
If you have a partially-trained GPT-2 model and want to continue finetuning it, you can set overwrite=True to finetune, which will continue training and remove the previous iteration of the model without creating a duplicate copy. This can be especially useful for transfer learning (e.g. heavily finetune GPT-2 on one dataset, then finetune on other dataset to get a "merging" of both datasets).
If your input text dataset is massive (>100 MB), you may want to preencode and compress the dataset using gpt2.encode_dataset(file_path). THe output is a compressed .npz file which will load much faster into the GPU for finetuning.
The 774M "large" model may support finetuning because it will cause modern GPUs to go out-of-memory (you may get lucky if you use a P100 GPU on Colaboratory). However, you can still generate from the default pretrained model using gpt2.load_gpt2(sess, model_name='774M') and gpt2.generate(sess, model_name='774M').
The 1558M "extra large", true model, may not work out-of-the-box with the GPU included with the Colaboratory Notebook. More testing is needed to identify optimial configurations for it.

Interactive Apps Using gpt-2-simple

gpt2-small — App using the default GPT-2 124M pretrained model
gpt2-reddit — App to generate Reddit titles based on a specified subreddit and/or keyword(s)
gpt2-mtg — App to generate Magic: The Gathering cards

Text Generation Examples Using gpt-2-simple

ResetEra — Generated video game forum discussions (GitHub w/ dumps)
/r/legaladvice — Title generation (GitHub w/ dumps)
Hacker News — Tens of thousands of generated Hacker News submission titles

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Disclaimer

This repo has no affiliation or relationship with OpenAI.

gpt-2-simple's People

Contributors

Stargazers

Watchers

Forkers

emrul oblivia-simplex juewang12 calculatedcontent jb33k codeaudit allensmile tonyle9 chinarefers chjort victor8733 raymanchester avatarworld gdh756462786 tomlee20180103 stanxii ssghost guanlongtianzi xennygrimmato rogalag kevin2107 chengjingfeng chilang johan-fx alphagit azvchen codecolony chandansinha yashchoubey koryako tarsbase trenddev dhruvluci shayan-taheri johndpope sruthi-racharla neibla fadeevla acndur1a totalgood veqtor zengyuwei rtpthefirst rtp2016 liuwq168 yet-another-account boxabirds teropa its-dron amir22010 cristianmtr 2016csb1062 kousun12 edwios mrklees k0a1a ericbfriday iwontbecreative chaokunchang tangert markjacksonfishing frejanil leo-xxx dontknowwat jonnyplatt minrk gftgpu bob80333 xin-miao-cs jnsimon shangcambridge unlimited-bot-works intuitionmachine jbutkevics lazuraslong kittenhero littleserendipity chad-mowbray shbm charliekmorris craigmcdiarmid zawecha1 sahanduiuc meelement hanlin-zhu sirmammingtonham ronnieqm molura jonheng andreyatgb cxz atkinsondaniel lazymonter tianxieeryang juvu ianrowan iam-abbas databill86 itspritish gabrielsjoberg

gpt-2-simple's Issues

Checking for overtraining?

Some of the text that is being generated seems too good... Wondering if I might have overtrained the model.
I can check my dataset manually quickly.

Suggestions for checking the original dataset?

Adding dropout?

I don't see dropout as a parameter in the fine-tuning options. Would it be possible to add to help avoid overfitting?

Could I use this model to train a Chinese text?

I put a Chinese text into this model and train it, but the generation is messy code. I think it may be the "encoding" error. How could I fix the problem?
Thx

There was an error occurred when generating.

There was an error, “ValueError: Can't load save_path when it is None.”, occurred when using “gpt_2_simple generate” after finetuned. How to solve it?

Explanation regarding line about prefix

I cannot figure out what this line is supposed to achieve.

if prefix:
    gen_text = prefix[0] + gen_text

Error when calling copy_checkpoint_to_gdrive()

For the record, I don't think I have ever managed to use this function.

I get this error:

I am using a run_name when calling gpt2.finetune(), so that might explain the issue.

However, there is no way to specify the run_name with gpt2.copy_checkpoint_to_gdrive():

Too many iterations when fine-tuning a model a second time

I have fine-tuned the GPT-2 model a first time with:

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              run_name=run_name,
              dataset=file_name,
              steps=1000,
              restore_from='fresh',   # change to 'latest' to resume training
              print_every=10,   # how many steps between printing progress
              sample_every=200,   # how many steps to print a demo sample
              save_every=500   # how many steps between saving checkpoint              
              )

I want to fine-tune it a second time, i.e. for 1000 additional iterations.
I have changed the cell to the following and executed it:

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              run_name=run_name,
              dataset=file_name,
              steps=1000,
              restore_from='latest',   # change to 'latest' to resume training
              print_every=10,   # how many steps between printing progress
              sample_every=200,   # how many steps to print a demo sample
              save_every=500   # how many steps between saving checkpoint              
              )

It seems to have started at 1001 iterations, which is as intended.
However, the process has not yet stopped. I thought it would stop at 2000 iterations.

Training...
[1010 | 32.03] loss=2.62 avg=2.62
[1020 | 56.97] loss=2.44 avg=2.53
[1030 | 81.37] loss=2.72 avg=2.59
[1040 | 106.37] loss=2.42 avg=2.55
[1050 | 131.60] loss=2.59 avg=2.56
[...]
[1970 | 2458.54] loss=2.03 avg=2.45
[1980 | 2483.27] loss=2.31 avg=2.45
[1990 | 2507.96] loss=2.27 avg=2.45
Saving checkpoint/descriptions/model-2000
[...]
[2000 | 2546.75] loss=2.47 avg=2.45
[2010 | 2571.65] loss=2.35 avg=2.45
[2020 | 2596.52] loss=2.37 avg=2.44
[2030 | 2621.22] loss=2.55 avg=2.45
[...]
[2960 | 4968.95] loss=2.31 avg=2.39
[2970 | 4993.75] loss=2.22 avg=2.39
[2980 | 5018.52] loss=2.48 avg=2.39
[2990 | 5043.32] loss=2.60 avg=2.39
Saving checkpoint/descriptions/model-3000
[...]
[3000 | 5081.95] loss=2.45 avg=2.39
[3010 | 5106.82] loss=2.48 avg=2.39
[3020 | 5131.65] loss=1.87 avg=2.39
[3030 | 5156.36] loss=2.23 avg=2.38
[3040 | 5181.04] loss=2.30 avg=2.38

How do I evaluate perplexity of the lm

I fine tuned the model.I wanted some guidance as to how to evaluate the perplexity of the model.

Move some functions to separate .py files

Putting everything in a single .py file is convenient but not Pythonic.

How to generate text without fintetune?

How do I generate text without finetuning the model? Please help!

Importance of prepending <|startoftext|> and suffixing <|endoftext|>

I have been using your package on a large .txt file which is a concatenation of texts (mostly reviews, and also store descriptions). However, I am unsure whether it would be important to prepend <|startoftext|> and append <|endoftext|> before concatenating the different texts.

Am I doing something wrong if I don't use these tokens?

Enhancement - Can you make the default readme generated text from terminator scripts?

TensorFlow Depreciation notices

Lots of notices, will have to keep an eye on them if something breaks in TF 2.0.

I'm OK with hiding the notices for now.

Update for 345M

Openai released the 354mb model
https://github.com/openai/gpt-2/blob/master/DEVELOPERS.md

Could you update to that version?

Weird output after finetuning for many iterations

I have fine-tuned the 345M model for 2000 iterations (fresh). The output was fine.

Then I have restarted my Colab session and fine-tuned it further up to 5000 iterations (latest).

[...]
Loading checkpoint checkpoint/345M_reviews_203770/model-2000
[...]
[2010 | 60.42] loss=2.29 avg=2.29
[2020 | 77.40] loss=2.59 avg=2.44
[2030 | 93.81] loss=2.36 avg=2.41
[2040 | 110.04] loss=1.78 avg=2.25
[2050 | 126.32] loss=2.88 avg=2.38
[2060 | 142.79] loss=2.49 avg=2.40
[2070 | 159.39] loss=1.86 avg=2.32
[2080 | 175.95] loss=2.36 avg=2.33
[2090 | 192.37] loss=3.14 avg=2.42
[2100 | 208.73] loss=2.65 avg=2.44
[2110 | 225.16] loss=1.67 avg=2.37
[...]
[4900 | 5172.41] loss=0.50 avg=1.97
[4910 | 5188.85] loss=1.76 avg=1.97
[4920 | 5205.29] loss=1.67 avg=1.97
[4930 | 5221.74] loss=1.39 avg=1.96
[4940 | 5238.17] loss=2.59 avg=1.97
[4950 | 5254.63] loss=1.61 avg=1.97
[4960 | 5271.10] loss=0.83 avg=1.95
[4970 | 5287.55] loss=2.66 avg=1.96
[4980 | 5304.01] loss=2.80 avg=1.97
[4990 | 5320.46] loss=1.14 avg=1.96
[5000 | 5336.92] loss=0.80 avg=1.95
Saving checkpoint/345M_reviews_203770/model-5000

The samples shown during the fine-tuning process were fine:

[4800 | 4986.37] loss=2.86 avg=2.02
======== SAMPLE 1 ========
 to get an heir, I would just send him off to a monastery. Or maybe let me build a house of cards, take all the land, have fun, then attack my siblings. It's good, trust me.<|endoftext|>
<|startoftext|>First I found this game while searching for something to play offline, I didn't know what to expect, nor how I would manage it. Now, I'm playing it with the A Game of Thrones Mod and it's totally different! I highly recommend!<|endoftext|>
<|startoftext|>If you're into grand strategy, but haven't found a game with the grand strategy flair quite like this one, I would recommend waiting for the steam sale. The base game is very enjoyable, but even steam sales come with some notable flaws: the combat mechanics may seem quite simplistic compared to real life (some stats even aren't shown if you have the dlc), and some dlc is needed for better visibility of factions. However, even the flaws can be turned into something really grand, like the fact that there are literally thousands of unique characters and even very large factions can be played as if they were the same, which creates the feeling of being part of a huge being as you know them.<|endoftext|>
<|startoftext|>I have so much fun with this game.  It's been out for 7 years and I still find myself coming back to this game.  I enjoy the depth of intrigue behind your dynasty, but I also enjoy the randomness of events which can drastically change your situation.  Be forewarned my english is so poor, but you can still keep me going after reading this, there is a pause button at all times.  I've found that the most captivating thing is to just let the game happen and just have it all happen.
I have over 500 hours on this and am still finding new stuff.  Would reccomend to anyone interested in strategy games.<|endoftext|>
<|startoftext|>Good game, but i do have some bugs for those who are playing on linux
i havent had any luck with this yet
it would be great if this was just like the other paradox games, but not like this the chances of you getting the buggy bugs are very slim
though i have not had any luck with the game the other paradox games have had a very good track record on this buisness
it would have to be  the same with the other games
though this game is good as heck i really can not recommend this becuase this is a pretty dang pricey game its a huge step backward imo<|endoftext|>
<|startoftext|>This game is one of my all time favourites. If it wasn't for my parents blocking me from buying it and their inability to pay for the DLC, I would have thought it was just another € type game. And as if that weren't already, their previous version was the one which resulted in my son becoming a powerful ruler, forcing me to exile him. I also had to imprison my horse, after it refused to renounce its claims. You can play as a historical character like Richard the Great and a horse, or one of many other noble and dignitary. I myself was a horse, I am content. But there is simply no way I can deny that this is the best of the strategy games! So what are the big drawbacks of the game? Well the biggest is the price. I paid this game 4 times back in the day for 40 euros, once on sale and once in a very humid, cold climate. As I said, the game was not the best, but they had to make it so they could sell the dlc, so don't expect the same quality products from this developer. And what are the things you can't do in the game? Well you can't just go and raid other lords land and that kind of thing. Also the game has no trade routes, that is a big problem. Also the game doesn't let you hold any vassals hostage, you need a Casus Beli, which is much easier to obtain if you have good reason. The big problem with the game is their DLC policy. If you want to have any fun at all, you have to make purchases from a certain game, this will result in many interesting options. The most interesting and best DLCs are not in the game, they are in the workshop, that's why. It will result in a Game Of Thrones, and other like CKII but in a different universe.<|endoftext|>
<|startoftext|>Great Game
Still Great
Instruments are good
Map is good
Music is great
but game can be slow
could be better<|endoftext|>
<|startoftext|>Tons of replay potential

[4810 | 5024.66] loss=2.54 avg=2.03

However, some of the output of gpt2.generate() are nonsensical, e.g. the following excerpt:

run_name = '345M_reviews_203770'
num_samples = 3
num_batches = 3
temperature = 0.7

gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,
              temperature=temperature,
              prefix='<|startoftext|>Please',
              truncate='<|endoftext|>')

<|startoftext|>Please note that I have not played the game in offline mode. I am just writing this review to show that offline play can be a great joy when playing CK2 with friends can be a pain.
This game is a game about building your dynasty and protecting what you built. You might be the emperor of your own realm but a war can break your realm apart and take all your holdings away, or you could be the king of France who just happens to be your best friend. The choice is yours. [i](Also, you can choose to start as a vassal of another king or emperor who owns some of your lands and pay them homage, gaining prestige and money while doing so)[/i]
The game can be hard or easy, depending on who you play and how difficult the game can be. The game is as hard and challenging as anyone can make it. My advice is to start small (Ireland) and work yourself up to something, or try something more difficult, in Ireland or elsewhere.
I have played this game for years and I still play it sometimes. I have also played the game offline a lot and have a lot more fun.
If you are new to the game, and want a quick overview, I recommend playing  in single player mode (Start saved at 0800) for the easiest learning curve.                                                                                          A'                 1     ,      b   

 and      " '
                  the   a   you the
 her     [  
  i

It also happens with:

run_name = '345M_reviews_203770'
num_samples = 1
num_batches = 1
temperature = 0.7

gpt2.generate(sess,
              run_name=run_name,
              nsamples=num_samples,
              batch_size=num_batches,
              temperature=temperature)

Edit: I have restarted my Colab session and fine-tuned it further up to 6000 iterations (latest). This time, the issue appeared in some samples shown during the fine-tuning process. I guess:

either I have over-fine-tuned the model,
or there is some issue with resuming fine-tuning?

Allow users to generate texts longer than 1024 tokens

It likely isn't possible to do it at the generation level (like other frameworks), but we can hack it by:

Generate full text.
Feed latter half of previous text as a prefix.
Repeat until done.

What is the data format of the TXT file

What is the data format of the TXT file？when I want to finetune the model

Finetuning: fresh vs. latest

When finetuning a model, I would like a clarification for fresh vs. latest.

I thought that:

fresh would start from the original GPT-2 model,
latest would start from the previous checkpoint.

However, I am a bit confused right now (#13). Is it rather that:

both start from the previous checkpoint, if available,
otherwise from the original GPT-2 model,

and that:

fresh would perform steps iterations,
latest would perform the remaining iterations so that the total number of iterations is steps.

Improved Sampling (Nucleus Sampling)

In a recent paper, they find that beam search actually does even worse than greedy search for GPT-2 117M, and they propose a new sampling method to improve text output.

Neural Text Degradation

From what I understand, it's top-k sampling, except instead of k tokens, it takes all tokens until probability of these tokens combined is > some parameter p.

Here's an example from the paper:

🗣Context:
On days when he woke early, the president
liked to have coffee in the oval office. There was
something about watching the sky grow lighter and
lighter as you sat your pajama'd behind in the most
powerful chair in the free world, sipping marine strength
coffee, that you just couldn't achieve anywhere else.

💰Greedy: He'd been in the oval office for over a year
now, and he'd never been so happy to be there. He'd
been in the oval office for almost a year, and he'd never
been so happy to be there.

🚦BeamSearch (k=4): The President was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word. He was a man of his
word. He was a man of his word.

🥄Sampling (t=1.0): You couldn't be sure if that's what
you were really doing, and If you decided to take the day
off. The president wanted you to take the day off, but he
wanted to maintain a curfew and use his influence wisely.

🇰Top-k Sampling (k=8): "I'm sorry, sir." "No, it's okay,
son," the president said. "I understand." "You're going
to have to make a special trip down there to get that kid.
He has no idea where he's going."

⚛️Nucleus Sampling (p=0.9): But that wasn't what drew
the president's attention. He'd been seated for maybe a
minute when he noticed the other man. What was the
guy doing here?

🗣Gold: He was therefore disagreeably surprised to find
a man in an understated grey suit sitting in that selfsame
chair sipping tea. The president turned around and went
looking for his chief of staff.

Unused function argument model_name

The argument model_name is not used in generate(). Here is the line.

Similarly, for generate_to_file() here.

The first word of the prefix is trimmed to 1 character

I did not notice this behavior in my previous tests, because the first word was "I" as in "I love" or "I hate". However, when the first word is "Please", it seems that the algorithm takes into account the whole word but only displays the first character ("P").

Replace vocab and encoder for other languages

Currently I'm trying to train the GPT-2 345M model on columns written in the Dutch langauge. Everything seems to work pretty fine while generating samples, although the model sometimes comes up with fictional words (or partially English). Now I'm wondering if a customized vocab.bpe (based on the Dutch corpus) and an encoder (e.g. https://github.com/google/sentencepiece) would increase the results. I'm not sure about this, as the Dutch and English language are relatively similar langauges.

Thanks in advance.

Fails to load dataset on Windows due to text encoding

On Windows 10, when attempting to run this code on my own dataset, I run into this error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2243891: character maps to <undefined>

Adding the parameter encoding='utf8' to line 33 of load_dataset.py:
with open(path, 'r') as fp:
appears to fix this issue. I'm not 100% sure, because now instead of erroring out immediately, it's using as much RAM as it can

I can open a pull request for this if you'd like.

why is <|startoftext|> not inserted when read in plain text files?

I am having some difficulities to understand the logic of load_dataset(). When it's csv file, every line is padded with start_token and end_token. However for other plain text file, only end_token is used. Is there any particular reason to omit start_token for plain text file?

Thanks

Default temperature

Is there a reason why the default temperature is 0.7?

I have not checked thoroughly yet, but this page shows different temperatures for "detectability baselines" (shown below). It seems that the temperature is chosen closer to:

0.9 for the 117M and 345M models,
0.75 for the 762M and 1542M models.

Model	Temperature 1	Top-K 40
117M	88.29%	96.79%
345M	88.94%	95.22%
762M	77.16%	94.43%
1542M	74.31%	92.69%

Text is incoherent when generating with batch_size > 1

Not entirely sure why.

Saving checkpoints to Google Drive

I have trouble saving checkpoints to Google Drive, despite using the correct argument (#17).

The checkpoint is supposed to be 1.4 GB large, but it ends up being 1.2 MB, when it is effectively saved. In other cases, a seemingly empty folder is created in my checkpoint/ folder in my Google Drive.

gpt2.mount_gdrive()

checkpoint_folder = 'checkpoint/' + run_name
gpt2.copy_checkpoint_to_gdrive(checkpoint_folder=checkpoint_folder)

Is there a way to reliably download the checkpoint?

For instance, wouldn't it be better to archive it to .zip and then send it to Google Drive?

Not an issue - fyi / controllable neural plot generation

https://twitter.com/mark_riedl/status/1126885608897617922?s=21

Not able to load the dataset

I have been trying to train the 117M model, with the dataset of size 1.03 GB, with 64 GB ram. But while it load the dataset, it remain stuck there. And after some 30 min, its just terminate. Here is the log.

Fetching checkpoint: 1.00kit [00:00, 679kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 16.5Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 573kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001:  11%|#8               | 53.6M/498M [00:00<00:07, 62.2Mit/s]
Fetching model.ckpt.data-00000-of-00001:  28%|#####3             | 141M/498M [00:01<00:03, 105Mit/s]
Fetching model.ckpt.data-00000-of-00001:  46%|########7          | 230M/498M [00:02<00:02, 108Mit/s]
Fetching model.ckpt.data-00000-of-00001:  63%|###########4      | 316M/498M [00:03<00:02, 66.6Mit/s]
Fetching model.ckpt.data-00000-of-00001:  77%|#############8    | 384M/498M [00:04<00:01, 58.8Mit/s]
Fetching model.ckpt.data-00000-of-00001:  92%|################6 | 460M/498M [00:06<00:00, 44.8Mit/s]
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 72.4Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 3.39Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 9.86Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 9.54Mit/s]                                                       2019-05-19 16:12:23.408514: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instr
uctions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

  0%|          | 0/1 [00:00<?, ?it/s]

I also saw another issue, which ask to cut the text file. How much has to be ideal size in order to train. If not, what model size could go with 1 GB text file ?

Help will be appreciated 👍

Ability encode files to tokens separately from fine-tuning?

I have a dataset of ~160MB fine-tuning in google colab, but a dataset of ~180MB causes the runtime to crash while loading the dataset, due to using all available RAM. However, while fine-tuning, I noticed that the VRAM has 6GB available, and the RAM has ~10GB available.

My dataset was originally many smaller files that are combined, if I could encode each of these separately into tokens, then combine the encoded dataset, and skip the encoding process when loading the dataset, I think I could use larger datasets while avoiding running out of RAM.

I did notice that on line 27 of load_dataset.py it seems to be able to load pre-encoded files.

Allow users to use Colaboratory's TPU for finetuning

This alone will be the single-biggest improvement for gpt-2-simple.

8 cores
~2x speed increase relative to a K80

= 16x training speed

Unfortunately documentation for using Colaboratory's TPU is a bit messy.

Argument for copy_checkpoint_to_gdrive() and copy_checkpoint_from_gdrive()

The argument for the functions copy_checkpoint_to_gdrive() and copy_checkpoint_from_gdrive() is checkpoint_folder, which is by default equal to os.path.join('checkpoint', 'run1'), according to this line.

I believe it would make more sense to accept run_name as argument. This would avoid confusion as in #11.

345M Directory Issue

When running the example code in the README.md, after executing the line that trains on my custom dataset, it comes out with this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python37\lib\site-packages\gpt_2_simple\gpt_2.py", line 106, in finetune
    os.path.join(checkpoint_path, file))
  File "C:\Program Files\Python37\lib\shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'models\\117M\\hparams.json'

I'm trying to run with the 345M model, but it seems there's no code to automatically search in one or the other. I only have the 345M model installed.

This is the line I am trying to run.
gpt2.finetune(sess, 'file.txt', steps=1000)
You'll notice that there's nothing to denote 117M or 345M, but it still looks for the file in 117M, which does not exist.

Number of tokens on file imported, 4x actual?

With the new 345m version, my dataset tokens has gone from 45k to 184k.
Is it now tokenizing characters rather than whole words?

Add compatibility to copy model between Colaboratory and Google Cloud Storage

May be more effective for moving the model than Google Drive, albeit not free.

Blank/Prefix outputs when using truncate + include_prefix=False

Nonzero probability for these outputs per the /r/legaladvice tests, so will need to add safeguards.

0 tokens when attempting to finetune using .txt

Using the collaboratory (have not tried locally), I tried loading a normal text file and it found 0 tokens.

I've found that splitting on whitespace and turning my source file into a csv was the only way to get past this.

All of the examples reference "shakespeare.txt" but that file isn't included in the repo so I have not been able to confirm what the tool is expecting from a plaintext file.

Typo in call to sample.sample_sequence()

I cannot figure out whether this line is a typo.

    output = sample.sample_sequence(
        hparams=hparams, length=length,
        start_token=enc.encoder['<|endoftext|>'] if not prefix else None,
        context=context if prefix else None,
        batch_size=batch_size,
        temperature=temperature, top_k=top_k
    )[:, 1:]

Specifically:

start_token=enc.encoder['<|endoftext|>'] if not prefix else None,

Shouldn't it be the following instead?

start_token=enc.encoder['<|startoftext|>'] if not prefix else None,

Other examples using gpt-2-simple

For info, I have used your Python module in these two repositories:

I am not trying to advertise these, but I thought I would let you know.

Prefix and suffix and their appearance in generated samples

I have a small dataset (~2MB) consisting of columns written by a journalist throughout recent years. Each column is prepended with '<|startoftext|>' and appended with '<|endoftext|>'. I have two questions:

When generating samples by executing gpt2.generate(sess, length=310, temperature=0.7, prefix="<|startoftext|>", include_prefix=False, truncate="<|endoftext|>", nsamples=5, batch_size=5 )

the pre- and suffix still appear in the middle of texts. Am I doing wrong? Or is this normal?

Can someone explain briefly why I should prepend and append each different text within a large file? I mean, assume I have three columns. What would be the difference in behavior of the model if i would prepend and append each column vs. separate columns by e.g. a white line?

Although I spent quite some time on GPT-2 now, I still find it hard to grasp this part, so any help is greatly appreciated.

Error when generating with long prefix

Hi! When I generate text with a prefix longer than 4 characters, I get the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)[[{{node sample_sequence_6/while/model/GatherV2_1}}]]

It does not occur if the prefix is "Hi!", but it does occur when it is "Hi, this is a longer piece of text"
Do you know why this may be happening?

Add parameter to keep only one checkpoint while restore_from='latest'

Doing restore_from='latest' creates a new checkpoint + metadata which is inconvenient for a few workflows.

Error when I try to generate text from finetuned model

I'm new to machine learning.

I get this error:

ValueError Traceback (most recent call last)
in ()
1 sess = gpt2.start_tf_sess()
----> 2 gpt2.load_gpt2(sess,'run2')

6 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in _get_single_variable(self, name, shape, dtype, initializer, regularizer, partition_info, reuse, trainable, collections, caching_device, validate_shape, use_resource, constraint, synchronization, aggregation)
846 tb = [x for x in tb if "tensorflow/python" not in x[0]][:3]
847 raise ValueError("%s Originally defined at:\n\n%s" % (err_msg, "".join(
--> 848 traceback.format_list(tb))))
849 found_var = self._vars[name]
850 if not shape.is_compatible_with(found_var.get_shape()):

ValueError: Variable model/wpe already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py", line 153, in model
initializer=tf.random_normal_initializer(stddev=0.01))
File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py", line 127, in finetune
output = model.model(hparams=hparams, X=context)
File "", line 11, in
run_name='run2'

Any ideas how to fix this?

Occurences of <|startoftext|> and <|endoftext|> can appear in the middle of a generated document

Anyone understand the code of the small model? I'm looking to hire someone!

I really reallllly want to know how GPT-2 works. They gave us trained code, a Paper, and a assortment of news on the internet describing parts of it like Transformers etc.

Has anyone put the puzzle pieces together yet? Or can? I need you. This algorithm is very important in generating research that changes the world.

I currently hire someone to program my non-neural network program that will soon reach a somewhat close level at least with GPT-2, as I understand how text generation works on many levels. But I still should know how GPT-2 works, and I have a ton of effort to put in so it should be put into the right spot.

I'm not a mathematician or programmer and my understanding of research/AGI is working, I am learning a lot each day, but I seriously need someone to explain how GPT-2 works visually or in English.

Loss

apologies, tried to delete.

Influence of include_prefix

As mentioned in #40, I do not see any difference when include_prefix is set to True or to False.
What is this argument supposed to do?

It appears here:

            if truncate:
                truncate_esc = re.escape(truncate)
                if prefix and not include_prefix:
                    prefix_esc = re.escape(prefix)
                    pattern = '(?:{})(.*?)(?:{})'.format(prefix_esc,
                                                         truncate_esc)
                else:
                    pattern = '(.*?)(?:{})'.format(truncate_esc)

                trunc_text = re.search(pattern, gen_text, re.S)
                if trunc_text:
                    gen_text = trunc_text.group(1)

How to deal with the checkpoint folder?

First of all thank you for your contribution.
I would like to convert generated model to tensorflow.js to use it on a website.
Could you explain me how to deal with the output checkpoint folder?

ValueError: Variable model/wpe already exists, disallowed

I fine-tuned a model for 1000 iterations, and want to do 1000 more. In such cases, I get an error.

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              run_name=run_name,
              dataset=file_name,
              steps=1000,
              restore_from='fresh',   # change to 'latest' to resume training
              print_every=10,   # how many steps between printing progress
              sample_every=200,   # how many steps to print a demo sample
              save_every=500   # how many steps between saving checkpoint              
              )

Here is the error:

ValueError                                Traceback (most recent call last)

<ipython-input-16-3052444268f9> in <module>()
      8               print_every=10,   # how many steps between printing progress
      9               sample_every=200,   # how many steps to print a demo sample
---> 10               save_every=500   # how many steps between saving checkpoint
     11               )

/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py in finetune(sess, dataset, steps, model_name, combine, batch_size, learning_rate, accumulate_gradients, restore_from, run_name, sample_every, sample_length, sample_num, save_every, print_every, max_checkpoints, model_load)
    110 
    111     context = tf.placeholder(tf.int32, [batch_size, None])
--> 112     output = model.model(hparams=hparams, X=context)
    113     loss = tf.reduce_mean(
    114         tf.nn.sparse_softmax_cross_entropy_with_logits(

/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py in model(hparams, X, past, scope, reuse)
    151 
    152         wpe = tf.get_variable('wpe', [hparams.n_ctx, hparams.n_embd],
--> 153                              initializer=tf.random_normal_initializer(stddev=0.01))
    154         wte = tf.get_variable('wte', [hparams.n_vocab, hparams.n_embd],
    155                              initializer=tf.random_normal_initializer(stddev=0.02))

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in get_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
   1477       constraint=constraint,
   1478       synchronization=synchronization,
-> 1479       aggregation=aggregation)
   1480 
   1481 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in get_variable(self, var_store, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
   1218           constraint=constraint,
   1219           synchronization=synchronization,
-> 1220           aggregation=aggregation)
   1221 
   1222   def _get_partitioned_variable(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in get_variable(self, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
    545           constraint=constraint,
    546           synchronization=synchronization,
--> 547           aggregation=aggregation)
    548 
    549   def _get_partitioned_variable(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in _true_getter(name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, constraint, synchronization, aggregation)
    497           constraint=constraint,
    498           synchronization=synchronization,
--> 499           aggregation=aggregation)
    500 
    501     # Set trainable value based on synchronization value.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in _get_single_variable(self, name, shape, dtype, initializer, regularizer, partition_info, reuse, trainable, collections, caching_device, validate_shape, use_resource, constraint, synchronization, aggregation)
    846         tb = [x for x in tb if "tensorflow/python" not in x[0]][:3]
    847         raise ValueError("%s Originally defined at:\n\n%s" % (err_msg, "".join(
--> 848             traceback.format_list(tb))))
    849       found_var = self._vars[name]
    850       if not shape.is_compatible_with(found_var.get_shape()):

ValueError: Variable model/wpe already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py", line 153, in model
    initializer=tf.random_normal_initializer(stddev=0.01))
  File "/usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py", line 112, in finetune
    output = model.model(hparams=hparams, X=context)
  File "<ipython-input-7-3052444268f9>", line 10, in <module>
    save_every=500   # how many steps between saving checkpoint

One-off counter when checkpoint saves

Looks like 0.3.1 broke the logic a bit. Not the end of the world but should fix.

minimaxir / gpt-2-simple Goto Github PK

gpt-2-simple's Introduction

gpt-2-simple

Install

Usage

Differences Between gpt-2-simple And Other Text Generation Utilities

Interactive Apps Using gpt-2-simple

Text Generation Examples Using gpt-2-simple

Maintainer/Creator

License

Disclaimer

gpt-2-simple's People

Contributors

Stargazers

Watchers

Forkers

gpt-2-simple's Issues

I get this error:

Recommend Projects

Recommend Topics

Recommend Org