entron / entity-embedding-rossmann Goto Github PK

View Code? Open in Web Editor NEW

868.0 35.0 328.0 3.95 MB

Python 3.36% Jupyter Notebook 96.64%

entity-embedding kaggle one-hot-encoding categorical-data categorical-features

entity-embedding-rossmann's People

Stargazers

Watchers

Forkers

fdoperezi marcantoinegiuliani binhngoc17 andreymalakhov cclauss benjamwhite nabilblk chenglongchen chriszeng8 johanravn salahuddin1234 roadtorajendra suru27 kenhollandwhy nkhuyu bhatmonica ageek zachmayer bishwarup307 manasrk rockyzyl cutd latuji chanplusplus kudo87 codeaudit tianzhou2011 dhvfanny naplessss corpsepiges anki1909 nathanie nieshaoshuai guanlongtianzi wanghanbin qqgeogor fsodogandji hhh920406 ml-lab jjdblast airxiechao wangjiahong jeperez walterreade awasthimaddy ryanallen82 nshvai stevenlol alphasue hiterstone w313062006 robinseaside mengmengbai pjpan coco2dslesson mathlf2015 junwei-pan golakiya95 keshavamurali wangtuntun liangyh1212 imutlab hangelwen nicolechensh yuyichen09 yanyushu jxlijunhao machinelearningorg yuconan fan-feng zxlmufc zengxia sawantsaurabh githubanatta poxiaowu alvis-huang gustavocarita jkhlot jadielam mahendra-ramajayam pculliton zhouyonglong jy-shao tblazina by-sum hanlos ninnazheng gerenuk lampts nami3373 arvinds-ds cdavilarios alyzleafbell xiaol ofermend bkj fighterhit 12foo ashishlal diegoches

entity-embedding-rossmann's Issues

Keras Reshape() layer

There is a problem:

TypeError: init() missing 1 required positional argument: 'target_shape'

and the code is:

model_store.add(Reshape(dims=(50,)))

dims???
I can't understand this paramter.

Thank you very much~

How to build store_states.csv

Hi:
I noticed that store_states.csv in the repo. It describes where the store is . This csv is not belongs to Kaggle. So, how to build it? Thank you.

How to use the embedding on a new categorical data

Hello,

I have a general rudimentary question ( sorry in advance).

I have reviewed (not fully) many parts of the codes in here. I'd like to test the proposed embedding on a new data, but am not sure where to begin.

I have a simple 2-column data: first col is patient id (assume 1M unique patients) second col is ICD10 diag code (assume 10K categories). We have repeated measurements in data, meaning that diagnoses can be repeated within a given patient and across many patients.

I tested Multiple Correspondance Analysis with categorical data from this link, but the results are not very useful.

Similar to the German States example in the repo, my goal is to perform (unsupervised) dimensionality reduction ( such as the ones you'd see in denoising AE with minimizing reconstruction error).

Where should I start? Do I need to run one-hot beforehand?
What funcs should I use after loading my raw data to generate such embedding?

Appreciate any words of wisdom you may be able to share.

Categorical Variables with long tail

Thanks for sharing your code serving it as a reference to learn from and understand the use case of embedding better.

Our dataset is also tabular and have high cardinality columns (~6 million) like IP Address. Besides that 85% of IP Addresses appear only once in the dataset. What are your thoughts on using this technique to convert such a categorical column into euclidean space?

question, not issue, should I use all training data for embedding?

Should I use all training data for embedding and then attach the embedded feature to train/test and then train XGB?
Or I need to split the training into two set and learn the embedding from on set and then use the other set to train XGB?

开源GitHub用户社区象空网邀请您的加入

你好，开发者：
独自前行的道路是漫长的，你需要更多志同道合的伙伴。你的GitHub值得关注，你的想法值得表达，你的知识更需要分享。马上加入开源GitHub用户社区——象空网，和小伙伴们一起愉快的玩耍吧。
象空网GitHub地址：https://github.com/631320085/objnull，欢迎你贡献自己的力量。

Do you have any suggestions for network design？

Thanks for your paper. And I designed a network like your paper.But I have about 600 numerical features and 10 category features. So the numerical features use Dense(1, input_dim=1) and category feature use Embedding().But when i train, the loss is nan....

Can't pickle <class 'module'>: attribute lookup module on builtins failed

Hi entron,

I got a problem when I pickle.dump the models.

Traceback (most recent call last):
File "train_test_model.py", line 90, in
pickle.dump(models, f)
_pickle.PicklingError: Can't pickle <class 'module'>: attribute lookup module on builtins failed

Cloud you give me some ideas? very thanks.

embedding results

false alarm

CAN NOT FIND "embeddings.pickle" anywhere

Hi,
I'm trying to learn your paper but something wrong with my code. I try to locate the error and I found the file named "embeddings.pickle" was everywhere but I just cannot find how to create it. I noticed the comment "# Use plot_embeddings.ipynb to create" but still I cannot find out the answer.
I'm wondering if you can give me a hand, thanks a lot.

The embedding layer

Hi Entron,

Noticed you also updated your code to keras 2.0! This is really helpful s

May i ask a simple question? For each embedding input you define, how do we visualize the embedding layer? (pardon my silliness )

For example lets suppose we have only this 2 for the embedding layer

output_embeddings = [output_dow, output_promo,]

output_model = Concatenate()(output_embeddings)

the output promo defined 1 hidden layer while we have the embedding layer. Therefore, how many units are there in this particular layer?

Are there a total of 1 + 6? or each embedding count as 1 so 1 + 1 = 2 units in this layer

Why not all features are used

Hi,
Thanks for your code. I notice that the features for training are as:

return [store_open,
            store_index,
            day_of_week,
            promo,
            year,
            month,
            day,
            store_data[store_index - 1]['State']
            ]

but there are lots of other features like in store.csv:

Assortment | CompetitionDistance | CompetitionOpenSinceMonth | CompetitionOpenSinceYear

Could you share the consideration why you not using these as features? Thx!

Lagged features

Hi @entron I was wondering if you tried lagged variables? Did they end up just not being that useful?

What was your experience trying to incorporate classic time series features into this dataset such as sales trends etc?

Thanks for your insights.

Missing prepare_features in documentation

It seems to be missing

python3 prepare_features.py

in the end of the extract scripts, in the README.

Thanks for sharing this model.

Categorizations choice

Hi entron.
This is more like conceptual question.
I'm reading your source and docs and I can see that you have categorized many features alone by themselves but you haven't, for example, combined several of them into single embedding.
By this I mean to use very similar features by nature in an embedding group whose resulting vector would represent "behaviour" of the group, maybe better than each feature solo categorized.
Did you try something like that? If yes, can you share your conclusions?
Thank you.

Connecting embedding weights to a categorical value

Hi Entron,

Thanks for sharing your code here!

I have data with 100,000 rows, a single categorical feature with 300 distinct values and additional continues variables. I am merging the categorical feature after label encoding with the continuous ones and running training for a classification problem with NN successfully.

I am using the following code for the embedding of the categorical feature to 5 dimensions:
input_leaf = Input(shape=(1,))
output_leaf = Embedding(300, 5, name='category_embed')(input_leaf)

Next, I am trying to extract the information from the embedding layer in order to use it in XGBoost as new features. However, the output of the following code:
model.get_layer('category_embed').get_weights()[0]
is a weights matrix in the shape of 300x5. My trouble is that I don't know how to map those weights to the original categorical values (e.g. 22 = [0.2, -0.1, 0.6, 2, 3]).

Any help will be highly appreciated!
Gilad

cannot reproduce "with EE" (with embeddings) paper results using this code

In case of XGBoost I could find in this repo any functions that would use embeddings. As I understood, entity embeddings are produced by __build_keras_model(), which seems to be used only for deep learning here, even though the paper shows its results in Tables III and IV also for KNN, RF, and XGB.

Helping others to reproduce accuracy gains from the use of embeddings in XGBoost is important because XGBoost baseline accuracy is difficult to improve using any neural network-based method, not just embeddings.

Keras version problem

I'm sorry I asked such a question,I used keras2.2.4,It was found 'Merge' that it could not be introduced in 'keras.layers.core ',But I found 'merge' in 'keras.layers',Unfortunately, the following error occurred
Thank you very much !

Cannot reproduce meaningful embedding

Dear Entron,
I downloaded the kaggle branch and trained with test_models.py file (default option: 1 network, train_ratio = 0.97). But I cannot see the meaningful embeddings as yours. I attached state_embedding with all states in the same distance with each other. Can you tell me why it is ?
Thank you very much !
P/S: I used keras 1.2.2, tensorflow r0.10 with GPU. I got "Result on validation data: 0.10472426564821177" and I saved trained model by keras.save() (I could not save by pickle.dump because of error as in issue #9)

Can't find the place where embedding pickle file is created

I'm sorry for such a question, but I can not figure out how you pretrained the embeddings.
I was looking for a function that writes to the embeddings pickle file but found none.
Also I could not understand from the paper whether you trained the embeddings on auxiliary tasks or they where formed in the process of optimizing for the main objective.

when do u decide to dense or embed?

Hi entron,

really very awesome and learnt alot from your work. However, i do have a question and i think this is the most appropriate place to ask.

Looking at your script, what is the methodology to decide to embed a variable or dense a variable. For example u chose to make variable like promo as a dense layer ...etc . Is there a reason for those that were chosen as a dense layer?

Problem with Kaggle Branch

Hi, there exists a 404 problem for Kaggle Branch. Could you please inform me where I can have a look at the origin version competition code? Thanks a lot!

entron / entity-embedding-rossmann Goto Github PK

entity-embedding-rossmann's People

Stargazers

Watchers

Forkers

entity-embedding-rossmann's Issues

Recommend Projects

Recommend Topics

Recommend Org