simon376 / ds-recommender-project Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 13.59 MB

goodreads-based book recommendation system, created for class "data science"

Jupyter Notebook 99.73% Python 0.27%

data-science goodreads goodreads-data pandas recommendation recommendation-system recommender-system tensorflow

ds-recommender-project's People

Contributors

Watchers

ds-recommender-project's Issues

Add Interruption Callback

add callback which saves checkpoints so the training can be interrupted and continued (especially when training on Google Colab)

run hyper parameter optimization on top ranking

instead of (only) using RMSE, use hit ratio @k or something like that

for now, only the subset "crime, mystery, thriller" for the book-graph and reviews will be used, so only the reviews associated with those books will have title, description, etc.. The genre is therefore already set, because the authors used these fuzzy genres to generate the subsets

Remove Reviews without Rating

key is to predict rating, so reviews without a rating attached are pretty useless

also, the output dimension should then be changed to rating 1-5

Fix TextVectorization Serialization

re-using precomputed vocabulary does not work:

ValueError: Exception encountered when calling layer "textvectorization_review_text_0" (type TextVectorization).

When using TextVectorization to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(None, None) with rank=2

Call arguments received:
• inputs=tf.Tensor(shape=(None, None), dtype=string)

Train on diverse dataset

currently, only the Mystery-Thriller-Crime book subset is used, as recommended by the authors, because of the otherwise gigantic dataset size.

still, maybe training can happen off-site using more data, after an efficient input pipeline has been set up ( #3 ).

Use Dask or tf.data to improve input data pipeline

currently, only about 50k datapoints are used for training, since loading json into a pandas dataframes exhausts available RAM (even on google colab) quickly, so try to write a generator to batch-load data or something.

because the raw input data has to be preprocessed in pandas, maybe using dask dataframes instead of pandas may be the key to success. since tf.data doesn't come with a read_json helper natively and preprocessing the data in pandas is easiest (and already implemented)

Hyperparameter-Tuning

decide on one or two models and try to hypertune the parameters, using different:

loss functions
Embedding sizes
learning rates (adaptable? LR scheduler?)
(Datasets: e.g. remove string data and see if it helps/hurts)

Change Task to Predict (User, Rating) tuples

for now the network trains on all columns:

Book Information (Title, Description, ID)
Author ID
Review Information (Text, Rating)
User ID

and predicts a rating based on this information, which is then used for overall ranking.

but the end goal should be to have a query and a candidate model (similar to a Retrieval Task), where Books (Book IDs + Metadata + associated Reviews) should be rated based on the user-query (User ID) for this specific user!

simon376 / ds-recommender-project Goto Github PK

ds-recommender-project's People

Contributors

Watchers

ds-recommender-project's Issues

Recommend Projects

Recommend Topics

Recommend Org