Giter Club home page Giter Club logo

entity-embedding-rossmann's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entity-embedding-rossmann's Issues

Keras Reshape() layer

There is a problem:

TypeError: init() missing 1 required positional argument: 'target_shape'

and the code is:

model_store.add(Reshape(dims=(50,)))

dims???
I can't understand this paramter.

Thank you very much~

How to build store_states.csv

Hi:
I noticed that store_states.csv in the repo. It describes where the store is . This csv is not belongs to Kaggle. So, how to build it? Thank you.

How to use the embedding on a new categorical data

Hello,

I have a general rudimentary question ( sorry in advance).

I have reviewed (not fully) many parts of the codes in here. I'd like to test the proposed embedding on a new data, but am not sure where to begin.

I have a simple 2-column data: first col is patient id (assume 1M unique patients) second col is ICD10 diag code (assume 10K categories). We have repeated measurements in data, meaning that diagnoses can be repeated within a given patient and across many patients.

I tested Multiple Correspondance Analysis with categorical data from this link, but the results are not very useful.

Similar to the German States example in the repo, my goal is to perform (unsupervised) dimensionality reduction ( such as the ones you'd see in denoising AE with minimizing reconstruction error).

  • Where should I start? Do I need to run one-hot beforehand?
  • What funcs should I use after loading my raw data to generate such embedding?

Appreciate any words of wisdom you may be able to share.

Categorical Variables with long tail

Thanks for sharing your code serving it as a reference to learn from and understand the use case of embedding better.

Our dataset is also tabular and have high cardinality columns (~6 million) like IP Address. Besides that 85% of IP Addresses appear only once in the dataset. What are your thoughts on using this technique to convert such a categorical column into euclidean space?

Do you have any suggestions for network design?

Thanks for your paper. And I designed a network like your paper.But I have about 600 numerical features and 10 category features. So the numerical features use Dense(1, input_dim=1) and category feature use Embedding().But when i train, the loss is nan....

Can't pickle <class 'module'>: attribute lookup module on builtins failed

Hi entron,

I got a problem when I pickle.dump the models.

Traceback (most recent call last):
File "train_test_model.py", line 90, in
pickle.dump(models, f)
_pickle.PicklingError: Can't pickle <class 'module'>: attribute lookup module on builtins failed

Cloud you give me some ideas? very thanks.

CAN NOT FIND "embeddings.pickle" anywhere

Hi,
I'm trying to learn your paper but something wrong with my code. I try to locate the error and I found the file named "embeddings.pickle" was everywhere but I just cannot find how to create it. I noticed the comment "# Use plot_embeddings.ipynb to create" but still I cannot find out the answer.
I'm wondering if you can give me a hand, thanks a lot.

The embedding layer

Hi Entron,

Noticed you also updated your code to keras 2.0! This is really helpful s

May i ask a simple question? For each embedding input you define, how do we visualize the embedding layer? (pardon my silliness )

For example lets suppose we have only this 2 for the embedding layer

image

output_embeddings = [output_dow, output_promo,]

output_model = Concatenate()(output_embeddings)

the output promo defined 1 hidden layer while we have the embedding layer. Therefore, how many units are there in this particular layer?

Are there a total of 1 + 6? or each embedding count as 1 so 1 + 1 = 2 units in this layer

Why not all features are used

Hi,
Thanks for your code. I notice that the features for training are as:

return [store_open,
            store_index,
            day_of_week,
            promo,
            year,
            month,
            day,
            store_data[store_index - 1]['State']
            ]

but there are lots of other features like in store.csv:

Assortment | CompetitionDistance | CompetitionOpenSinceMonth | CompetitionOpenSinceYear

Could you share the consideration why you not using these as features? Thx!

Lagged features

Hi @entron I was wondering if you tried lagged variables? Did they end up just not being that useful?

What was your experience trying to incorporate classic time series features into this dataset such as sales trends etc?

Thanks for your insights.

Categorizations choice

Hi entron.
This is more like conceptual question.
I'm reading your source and docs and I can see that you have categorized many features alone by themselves but you haven't, for example, combined several of them into single embedding.
By this I mean to use very similar features by nature in an embedding group whose resulting vector would represent "behaviour" of the group, maybe better than each feature solo categorized.
Did you try something like that? If yes, can you share your conclusions?
Thank you.

Connecting embedding weights to a categorical value

Hi Entron,

Thanks for sharing your code here!

I have data with 100,000 rows, a single categorical feature with 300 distinct values and additional continues variables. I am merging the categorical feature after label encoding with the continuous ones and running training for a classification problem with NN successfully.

I am using the following code for the embedding of the categorical feature to 5 dimensions:
input_leaf = Input(shape=(1,))
output_leaf = Embedding(300, 5, name='category_embed')(input_leaf)

Next, I am trying to extract the information from the embedding layer in order to use it in XGBoost as new features. However, the output of the following code:
model.get_layer('category_embed').get_weights()[0]
is a weights matrix in the shape of 300x5. My trouble is that I don't know how to map those weights to the original categorical values (e.g. 22 = [0.2, -0.1, 0.6, 2, 3]).

Any help will be highly appreciated!
Gilad

cannot reproduce "with EE" (with embeddings) paper results using this code

In case of XGBoost I could find in this repo any functions that would use embeddings. As I understood, entity embeddings are produced by __build_keras_model(), which seems to be used only for deep learning here, even though the paper shows its results in Tables III and IV also for KNN, RF, and XGB.

Helping others to reproduce accuracy gains from the use of embeddings in XGBoost is important because XGBoost baseline accuracy is difficult to improve using any neural network-based method, not just embeddings.

Keras version problem

I'm sorry I asked such a question,I used keras2.2.4,It was found 'Merge' that it could not be introduced in 'keras.layers.core ',But I found 'merge' in 'keras.layers',Unfortunately, the following error occurred
Thank you very much !
image

Cannot reproduce meaningful embedding

Dear Entron,
I downloaded the kaggle branch and trained with test_models.py file (default option: 1 network, train_ratio = 0.97). But I cannot see the meaningful embeddings as yours. I attached state_embedding with all states in the same distance with each other. Can you tell me why it is ?
Thank you very much !
P/S: I used keras 1.2.2, tensorflow r0.10 with GPU. I got "Result on validation data: 0.10472426564821177" and I saved trained model by keras.save() (I could not save by pickle.dump because of error as in issue #9)
state_embedding

Can't find the place where embedding pickle file is created

I'm sorry for such a question, but I can not figure out how you pretrained the embeddings.
I was looking for a function that writes to the embeddings pickle file but found none.
Also I could not understand from the paper whether you trained the embeddings on auxiliary tasks or they where formed in the process of optimizing for the main objective.

when do u decide to dense or embed?

Hi entron,

really very awesome and learnt alot from your work. However, i do have a question and i think this is the most appropriate place to ask.

Looking at your script, what is the methodology to decide to embed a variable or dense a variable. For example u chose to make variable like promo as a dense layer ...etc . Is there a reason for those that were chosen as a dense layer?

Problem with Kaggle Branch

Hi, there exists a 404 problem for Kaggle Branch. Could you please inform me where I can have a look at the origin version competition code? Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.