Giter Club home page Giter Club logo

gpt2-bert-reddit-bot's Introduction

gpt2-bert-reddit-bot

series of scripts to fine-tune GPT-2 and BERT models using reddit data for generating realistic replies.

jupyter notebooks also available on Google Colab here

see my blog post for a walkthrough on running the scripts

processing training data

I use pandas read_gbq to read from google bigquery. get_reddit_from_gbq.py automates the download. prep_data.py cleans and transforms the data into a format that is usable by the GPT2 and BERT fine-tuning scripts. I manually upload the results from prep_data.py into Google Drive to be used by the Google Colab notebooks.

Here is a sample of the data format outputted from prep_data.py:

"Is there any way this could be posted as a document so it can be saved permanently, outwith reddit? [SEP] Could you not just copy and paste it yourself into a word processor document?"
"Seems like alt-history is a format that would almost *require* a detailed outline before writing [SEP] Are you aware of any good outliners or character sheets for writing novels? I like to organize and plan on the macro level and then, knowing what I want to accomplish and with which character, I can then discovery write at the micro level. "
"This is depressing [SEP] There are the books and they are excellent. There are also audiobooks which are also outstanding. Including side story novellas!

Also there is no apparent sign of James S. A. Corey (which is actually two authors: Daniel Abraham and Ty Franck) going all George R. R. Martin / Robert Jordan."

pulling reddit comments with praw

I use praw to download comments.

reddit = praw.Reddit(client_id='client_id', 
                     client_secret='client_secret',
                     password='reddit_password',
                     username='reddit_username',
                     user_agent='reddit user agent name')
                     
...
subreddit = reddit.subreddit(subreddit_name)
for h in subreddit.rising(limit=5):
  for c in h.comments:
    {do stuff}
 

See the code for more details.

training, generating, classifying

more documentation to come soon...

gpt2-bert-reddit-bot's People

Contributors

stedn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gpt2-bert-reddit-bot's Issues

Can you add an example of the training data format

Hey I was curious if it might be possible to get added to the readme and example of the training data format. I know its comment [SEP] reply but I was curious if there was use of the <|endoftext|> token as well in the dataset format or not so gpt2 can know when an example ends.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.