sujitpal / eeap-examples Goto Github PK

View Code? Open in Web Editor NEW

258.0 258.0 85.0 5.98 MB

Code for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe

Jupyter Notebook 94.17% Python 5.83%

eeap-examples's People

Contributors

Stargazers

Watchers

Forkers

mohabdel2013 nieshaoshuai guanlongtianzi stevenlol wolkerzheng benjamesbabala leidongfeng alvis-huang lu839684437 laisun zssasa geraltlin huskyeder leezqcst raghavendranpm krishnad alomdaelmasry falconzyx daehwanahn mllog mahendra-ramajayam gargaditya soumenms2015 quantumgame mvl1208 samarth92 jhy1993 anuragkankanala zhoutf samarthagarwal23 authman calculatedcontent quangvu0702 kaihuilau jkhlot yugam1 poivrenoir tumeteor youlina3 martian07 iamshwin auserj sulasen jx57 arlenzhu pvcastro gauravgajbhiye cosecant-csc almoslmi angelo337 skywindy yijunwu hulalazz yynst2 rremani databill86 davidkabiito sepidina saiful9379 kanish84in anandprabhakar0507 strategist922 rcschen abe2g onirban ronykalfarisi tommylitlle fabio-cancio-sena dragomirradev dtriepke shenbennwdsl munkarkin96 zsoftwarerepository pzq7025 rebal-tech anukkrit149 divyanks elly1109 ravijoe eric11eca dipongkor karthikshivaram24 ponykid petcai nangal

eeap-examples's Issues

Some suggestions on attention and document similarity

Hey,

First, thanks for the kind words in various places :). I came across your posts, which led me here.

I also spent quite some time working on similarity models. I think they're surprisingly difficult to implement correctly in most deep learning toolkits. There are two problems:

It's pretty hard to maintain the symmetry. We'd like to guarantee that the Siamese network always maps a pair of identical sentences into identical vectors. Intuitively identical inputs should give 1.0 similarity, right? But lots of thing can go wrong to prevent this. Here are some of the problems I've had in different implementations:

i. Dropout needs to be synchronised across the two 'halves' of the network. If we redraw the weights for the two sentences, we'll end up with different vectors for the same input. This makes the model converge very slowly.

ii. Batch normalization. I don't remember exact results, but I do remember I ended up not wanting to use batch norm in Siamese networks, because I found it too difficult to reason about.

iii. In one of my models, I assigned random vectors to OOV words, without ensuring the same word always mapped to the same vector.

Most libraries make you pad your inputs with zeros. Most attention layers then do some sort of pooling operation. If you're averaging, you need to normalize by only the input tokens, and exclude the padding tokens. We also want to make sure we're not sending gradient through the padding tokens too. I could never get this correct in Keras. You can find my effort to replicate Parikh et al.'s decomposable attention model here: https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment . Other people have worked on the code since, but as far as I know it's still not correct.

Finally, an extra tip that should help your similarity models :). I notice you're using pre-trained embeddings, and are using a fixed-size vocabulary. This means that all words outside your vocabulary will be mapped to the same representation. If you think about it, this is pretty bad: if our input sentences match on some rare word, that's a great feature! I think the best solution is to augment the static vectors with a learned component. Here's an example network that does this: https://github.com/explosion/thinc/blob/master/examples/text-pair/glove_mwe_multipool_siamese.py#L162

The network in that example uses a trickier "Embed" step, that is the sum of the static vectors, and then learned vectors from my HashEmbed class, which uses the "hashing trick" that has been popular in sparse linear models. The insight is similar to Bloom filters, so a recent paper has called this "Bloom embeddings". Basically you just mod the key into a fixed size table, and compute multiple conditionally independent keys per word. This way allows the table to map a very large number of vocabulary items to (mostly) distinct representations, with relatively few rows.

This hash embedding trick isn't the only solution to the OOV problem. Using a character LSTM to create the OOV word features would probably work well too --- but much more slowly.

ng-vocab.tsv file not found

Hi,

I am running this notebook: https://github.com/sujitpal/eeap-examples/blob/master/src/04c-ng-clf-eeap.ipynb

When I run "Load Vocabulary" cell, it gives this error:

<ipython-input-4-e0b7b98e6728> in <module>()
      1 word2id = {"PAD": 0, "UNK": 1}
----> 2 fvocab = open(VOCAB_FILE, "rb")
      3 for i, line in enumerate(fvocab):
      4     word, count = line.strip().split("\t")
      5     if int(count) <= MIN_OCCURS:

FileNotFoundError: [Errno 2] No such file or directory: `'../data/ng-vocab.tsv'

How to setup and import custom_attn?

I guess custom_attn is a customized package,
so how can I setup and import it, Thanks!

Missing data file creating problem with result verification

ng-vocab.tsv file is missing. I checked this issue . Added this code to generate the file and verify the result. It gave me accuracy score of 55% (approx). Then instead of using this I used this dataset. Result improved to 59%. I think the dataset is playing an important role here. So I would request to post the original file, if not then atleast the process of getting or generating it, so that we can run it and get the result that you mentioned. (71% accuracy). I ran this code.

ng-sim-datagen.py throwing error.

Python3 : TypeError: a bytes-like object is required, not 'str'
Python2 : IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)

Can someone put the dataset on drive if they are able to run this script?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.