The textgenie from harsh16dawar

textgenie's Introduction

The generated sequences and the trained models will be uploaded to google-drive here

The dependencies are located in requirements.txt file

The models have been trained on the following three books:-

Repbulic by Plato (republic.txt)
Moby Dick by Herman Melville (moby.txt)
Adam Bede by George Eliot (adam.txt only trained the model on first 89 chapters to reduce training time)

the sequences created are sored in book_sequences.txt

the models are stored as model_book.h5

the tokenizer files are sored as a pickle dump as tokenizer_book.pkl

The notebook train_book.ipynb contains the required code for training the model

train_republic.ipynb contains annotated code for better understnding

predict.ipynb file is used for the final sentence completion part by utilising the stored model

The folder also contains other books (jungle.txt (jungle book) and eliot.txt they were not used to train the model due to their smaller size

The books were downloaded from the project gutenberg website which is a free repository for many such books

(https://www.gutenberg.org/)

This project can be expanded in the fututre for style transer on text where the user can supply a text and choose an author of their choice and the provided text will be rewritten in the style of the chosen author

Limitation:-

While predicting new text currently we have to keep in mind that it only considers the last 50 words of a sequence and hence larger seed text will be not useful
When we tokenize our seed text our tokenizer files conatain only those word and integer mapping that are present in the book so if we supply our tokenizer a word that was not present in the book it will return an error

Recommend Projects