sujitpal / eeap-examples Goto Github PK
View Code? Open in Web Editor NEWCode for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe
Code for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe
Hey,
First, thanks for the kind words in various places :). I came across your posts, which led me here.
I also spent quite some time working on similarity models. I think they're surprisingly difficult to implement correctly in most deep learning toolkits. There are two problems:
i. Dropout needs to be synchronised across the two 'halves' of the network. If we redraw the weights for the two sentences, we'll end up with different vectors for the same input. This makes the model converge very slowly.
ii. Batch normalization. I don't remember exact results, but I do remember I ended up not wanting to use batch norm in Siamese networks, because I found it too difficult to reason about.
iii. In one of my models, I assigned random vectors to OOV words, without ensuring the same word always mapped to the same vector.
Finally, an extra tip that should help your similarity models :). I notice you're using pre-trained embeddings, and are using a fixed-size vocabulary. This means that all words outside your vocabulary will be mapped to the same representation. If you think about it, this is pretty bad: if our input sentences match on some rare word, that's a great feature! I think the best solution is to augment the static vectors with a learned component. Here's an example network that does this: https://github.com/explosion/thinc/blob/master/examples/text-pair/glove_mwe_multipool_siamese.py#L162
The network in that example uses a trickier "Embed" step, that is the sum of the static vectors, and then learned vectors from my HashEmbed
class, which uses the "hashing trick" that has been popular in sparse linear models. The insight is similar to Bloom filters, so a recent paper has called this "Bloom embeddings". Basically you just mod the key into a fixed size table, and compute multiple conditionally independent keys per word. This way allows the table to map a very large number of vocabulary items to (mostly) distinct representations, with relatively few rows.
This hash embedding trick isn't the only solution to the OOV problem. Using a character LSTM to create the OOV word features would probably work well too --- but much more slowly.
Hi,
I am running this notebook: https://github.com/sujitpal/eeap-examples/blob/master/src/04c-ng-clf-eeap.ipynb
When I run "Load Vocabulary" cell, it gives this error:
<ipython-input-4-e0b7b98e6728> in <module>()
1 word2id = {"PAD": 0, "UNK": 1}
----> 2 fvocab = open(VOCAB_FILE, "rb")
3 for i, line in enumerate(fvocab):
4 word, count = line.strip().split("\t")
5 if int(count) <= MIN_OCCURS:
FileNotFoundError: [Errno 2] No such file or directory: `'../data/ng-vocab.tsv'
I guess custom_attn is a customized package,
so how can I setup and import it, Thanks!
ng-vocab.tsv
file is missing. I checked this issue . Added this code to generate the file and verify the result. It gave me accuracy score of 55% (approx). Then instead of using this I used this dataset. Result improved to 59%. I think the dataset is playing an important role here. So I would request to post the original file, if not then atleast the process of getting or generating it, so that we can run it and get the result that you mentioned. (71% accuracy). I ran this code.
Python3 : TypeError: a bytes-like object is required, not 'str'
Python2 : IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)
Can someone put the dataset on drive if they are able to run this script?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.