The generated sequences and the trained models will be uploaded to google-drive here
The dependencies are located in requirements.txt file
The models have been trained on the following three books:-
- Repbulic by Plato (republic.txt)
- Moby Dick by Herman Melville (moby.txt)
- Adam Bede by George Eliot (adam.txt only trained the model on first 89 chapters to reduce training time)
the sequences created are sored in book_sequences.txt
the models are stored as model_book.h5
the tokenizer files are sored as a pickle dump as tokenizer_book.pkl
The notebook train_book.ipynb contains the required code for training the model
train_republic.ipynb contains annotated code for better understnding
predict.ipynb file is used for the final sentence completion part by utilising the stored model
The folder also contains other books (jungle.txt (jungle book) and eliot.txt they were not used to train the model due to their smaller size
The books were downloaded from the project gutenberg website which is a free repository for many such books
This project can be expanded in the fututre for style transer on text where the user can supply a text and choose an author of their choice and the provided text will be rewritten in the style of the chosen author
Limitation:-
- While predicting new text currently we have to keep in mind that it only considers the last 50 words of a sequence and hence larger seed text will be not useful
- When we tokenize our seed text our tokenizer files conatain only those word and integer mapping that are present in the book so if we supply our tokenizer a word that was not present in the book it will return an error