Giter Club home page Giter Club logo

ds-recommender-project's People

Contributors

simon376 avatar

Watchers

 avatar

ds-recommender-project's Issues

Add Interruption Callback

add callback which saves checkpoints so the training can be interrupted and continued (especially when training on Google Colab)

Remove Genre Data again

for now, only the subset "crime, mystery, thriller" for the book-graph and reviews will be used, so only the reviews associated with those books will have title, description, etc.. The genre is therefore already set, because the authors used these fuzzy genres to generate the subsets

Remove Reviews without Rating

key is to predict rating, so reviews without a rating attached are pretty useless

also, the output dimension should then be changed to rating 1-5

Fix TextVectorization Serialization

re-using precomputed vocabulary does not work:

ValueError: Exception encountered when calling layer "textvectorization_review_text_0" (type TextVectorization).

When using TextVectorization to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(None, None) with rank=2

Call arguments received:
โ€ข inputs=tf.Tensor(shape=(None, None), dtype=string)

Train on diverse dataset

currently, only the Mystery-Thriller-Crime book subset is used, as recommended by the authors, because of the otherwise gigantic dataset size.

still, maybe training can happen off-site using more data, after an efficient input pipeline has been set up ( #3 ).

Use Dask or tf.data to improve input data pipeline

currently, only about 50k datapoints are used for training, since loading json into a pandas dataframes exhausts available RAM (even on google colab) quickly, so try to write a generator to batch-load data or something.

because the raw input data has to be preprocessed in pandas, maybe using dask dataframes instead of pandas may be the key to success. since tf.data doesn't come with a read_json helper natively and preprocessing the data in pandas is easiest (and already implemented)

Hyperparameter-Tuning

decide on one or two models and try to hypertune the parameters, using different:

  • loss functions
  • Embedding sizes
  • learning rates (adaptable? LR scheduler?)
  • (Datasets: e.g. remove string data and see if it helps/hurts)

Change Task to Predict (User, Rating) tuples

for now the network trains on all columns:

  • Book Information (Title, Description, ID)
  • Author ID
  • Review Information (Text, Rating)
  • User ID

and predicts a rating based on this information, which is then used for overall ranking.

but the end goal should be to have a query and a candidate model (similar to a Retrieval Task), where Books (Book IDs + Metadata + associated Reviews) should be rated based on the user-query (User ID) for this specific user!

User-specific querys

rating predictions converge at 1.14 because obviously for all users, the ratings don't fall in line perfectly, since people vote differently. --> If one would predict the rating for a specific user, that would probably lead to way better results

Train without complex text data for comparison

remove Description, Review Text, Title columns and train the same models, to have a ground truth / baseline for comparison, if the extra data helps the system.

ideally, it would, or otherwise just learn to ignore the data on it's own. but that could take quite some training epochs

Add Train/Test/Validation Split

add threefold split to validate performance during training and for selecting hyperparameters, with a separate completely unknown test-set to evaluate the final results

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.