sararob / keras-wine-model Goto Github PK
View Code? Open in Web Editor NEWModel built with the Keras Functional API
License: Apache License 2.0
Model built with the Keras Functional API
License: Apache License 2.0
@saraob, thanks for providing this example model - very interesting.
I'm trying to recreate it locally, and am getting a MemoryError
in the line:
description_bow_train = tokenize.texts_to_matrix(description_train)
I've tried creating description_bow_train
iteratively as follows:
# Convert to a bag of words vector
description_bow_train = []
i = 0
for row in description_train:
m = tokenize.texts_to_matrix(row)
description_bow_train.append(m)
if(i % 100 == 0):
print("Row %i" % i)
i += 1
With this approach, memory (8GB allocated in a docker container) is exhausted after about 2,500 rows.
Do you have any sense as to how much memory this data set should consume to load? 8GB for 2,500 rows seems to be excessive to me...
In the wide model part, one-hot encoders are used to label categorical features with just few unique values.
# Wide feature 2: one-hot vector of variety categories
# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(variety_train)
variety_train = encoder.transform(variety_train)
variety_test = encoder.transform(variety_test)
num_classes = np.max(variety_train) + 1
# Convert labels to one hot
variety_train = keras.utils.to_categorical(variety_train, num_classes)
variety_test = keras.utils.to_categorical(variety_test, num_classes)
However, some values may just occur in test_set (fortunately, no such instance in the wine dataset). It's safer to fit the encoder with more possible values. Similar to label encoder, the tokenizer used preprocess descriptions also should learn on more possible information, which can be provided by full data set (including test set part), and without data leaking (because of no use of target label data).
When padding the seq of description, the code use default value 0 to pad:
train_embed = keras.preprocessing.sequence.pad_sequences(
train_embed, maxlen=max_seq_length, padding="post")
test_embed = keras.preprocessing.sequence.pad_sequences(
test_embed, maxlen=max_seq_length, padding="post")
But in the description encoding process, 0 refer to one specific word. I think maybe use max(encode_value)+1
as padding value?
Could you please provide the Tensorflow implementation of the same code, where input in Tokenized text, same as in the blog. But the layers defined using TF.
Please
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.