simon376 / ds-recommender-project Goto Github PK
View Code? Open in Web Editor NEWgoodreads-based book recommendation system, created for class "data science"
goodreads-based book recommendation system, created for class "data science"
add callback which saves checkpoints so the training can be interrupted and continued (especially when training on Google Colab)
instead of (only) using RMSE, use hit ratio @k or something like that
for now, only the subset "crime, mystery, thriller" for the book-graph and reviews will be used, so only the reviews associated with those books will have title, description, etc.. The genre is therefore already set, because the authors used these fuzzy genres to generate the subsets
key is to predict rating, so reviews without a rating attached are pretty useless
also, the output dimension should then be changed to rating 1-5
re-using precomputed vocabulary does not work:
ValueError: Exception encountered when calling layer "textvectorization_review_text_0" (type TextVectorization).
When using
TextVectorization
to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(None, None) with rank=2Call arguments received:
โข inputs=tf.Tensor(shape=(None, None), dtype=string)
currently, only the Mystery-Thriller-Crime book subset is used, as recommended by the authors, because of the otherwise gigantic dataset size.
still, maybe training can happen off-site using more data, after an efficient input pipeline has been set up ( #3 ).
currently, only about 50k datapoints are used for training, since loading json into a pandas dataframes exhausts available RAM (even on google colab) quickly, so try to write a generator to batch-load data or something.
because the raw input data has to be preprocessed in pandas, maybe using dask dataframes instead of pandas may be the key to success. since tf.data doesn't come with a read_json helper natively and preprocessing the data in pandas is easiest (and already implemented)
decide on one or two models and try to hypertune the parameters, using different:
for now the network trains on all columns:
and predicts a rating based on this information, which is then used for overall ranking.
but the end goal should be to have a query and a candidate model (similar to a Retrieval Task), where Books (Book IDs + Metadata + associated Reviews) should be rated based on the user-query (User ID) for this specific user!
It may be better to train embeddings for the string data beforehand in a separate model and write it to disk, and re-use those embeddings when training the Recommender Model?
see: Word Embeddings | TF
rating predictions converge at 1.14 because obviously for all users, the ratings don't fall in line perfectly, since people vote differently. --> If one would predict the rating for a specific user, that would probably lead to way better results
using TFRS or manually retrieve the Top k Recommendations
remove Description, Review Text, Title columns and train the same models, to have a ground truth / baseline for comparison, if the extra data helps the system.
ideally, it would, or otherwise just learn to ignore the data on it's own. but that could take quite some training epochs
stop the need for VS Code to be open & random disconnects
fill empty with NaN
otherwise number of datapoints goes from 40k to 10k (when importing 50k datapoints initially)
add threefold split to validate performance during training and for selecting hyperparameters, with a separate completely unknown test-set to evaluate the final results
due to the huge dataset this of course now takes forever. maybe one can do this once and save it (see also #4 )
following #4 , visualize the embeddings using the embedding projector to see if they were "successful"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.