bollu / alok-bollu Goto Github PK

A repo for work between Alok Debnath and Siddharth Bhat

Makefile 0.20% Python 1.30% C 90.21% Shell 1.19% MATLAB 0.12% C++ 4.99% Jupyter Notebook 0.07% Cuda 1.20% HTML 0.12% CSS 0.02% PHP 0.30% JavaScript 0.05% TeX 0.23%

alok-bollu's Introduction

NatC: Natural language compiler

Construct some really fucked up intermediate reprsentation that captures semantics, and then represent "information extraction" as compiler passes that whittle away at complexity. Bonus points if we can use abstract interpreatation to peform this whittling.

Word2Man

@Bollu please?

Understanding compositional distributional semantics

Compositional distributional models of meaning are models that are developed by a compositonal understanding the meaning of a word in context.

alok-bollu's People

Contributors

Stargazers

Watchers

alok-bollu's Issues

word2vec-ga open research questions

update word-analogy.cpp to use the new mulQuadForm code. The old code is deprecated and should be removed from word-analogy.cpp. To test that this is correct, change the function

void setupDotContainmentMat(int n, real *m) {                                   
    const int d = log2(n);                                                      
    assert (1 << d == n);                                                       
    for(int i = 0; i < n; ++i)                                              
        for(int j = 0; j < n; ++j)
           m[i*n+j] = i == j ? 1 : 0;                                      
}

we recover word2vec.

Get Gensim Word2Vec running to create a baseline

Setup our codebsse such that we can use the gensim.word2vec implementation as well.

Xpose Word2Vec

Word2Vec Naive Run
W2V + Actual SGD (steal from GloVe)
W2V - 0 initialized + random initialized (Syn1Neg is random :P)
W2V where syn0 = syn1neg = random
Train cost function v.w, not sigm(v.w)
Objective function: cos(theta), not sigm(v.w)

Check that current master can train word2vec properly.

Switch out the dot product of word2vec-ga to be like word2vec. Make a commit with this change, and train the model. Check if the model reproduces word2vec. If not, let's sit together and debug.

Test against FastText

Multivariate analysis of vector length againt frequency and other factors

Plot vector length against word frequency. Since the word2vec-sn-reimann did not do well, this seems to indicate that training for dot-product 1 does worse than the regular word2vec, for whatever reason. Understand why

Alternate way to train word2vec with distributionality: based on context vectors

let vec(w) . vec(w') = 1.0 - delta(neighbor(w), neighbor(w')). If the neighbourhood structure of w and w' are the same, then we should make them dot product 1, otherwise we should make them dot product based on how close their neighborhoods are.

Can't get attribute 'Word2Man'

Running on branch alok-evaluation, as branch master does not seem to have a models directory.

Command: python3 run.py eval models/working-word2vec-baseline-b0c2872c

Output:

setting up device...
device: cuda:1
====time: 0:00:00.616033
loading model from: models/working-word2vec-baseline-b0c2872c...
Traceback (most recent call last):
File "run.py", line 758, in <module>
 PARAMS = torch.load(PARSED.loadpath, map_location=DEVICE)
 File "/home/alokdebnath5/.local/lib/python3.5/site-packages/torch/serialization.py", line 367, in load
	return _load(f, map_location, pickle_module)
File "/home/alokdebnath5/.local/lib/python3.5/site-packages/torch/serialization.py", line 538, in _load
	result = unpickler.load()
AttributeError: Can't get attribute 'Word2Man' on <module '__main__' from 'run.py'>

`
Need help in figuring out why this is happening?

Run manifold dimensionality reduction to find intrinsic dimension of W2V embeddings

Basically we suspect that even tho the training occurs on 300 dimensions, the actual structure is captured by a much lower dimensional space.
In particular, we should look at conformal dimensionality reduction methods i.e. dimensionality reduction methods which preserve angles.

Modify contextualized similarity and analogy

Need to identify and consider the changes in the code of word2vec to store syn1neg as well, and also to load syn1neg while doing distance and word analogy in order to capture the true meaning of contextualized similarity.

Parameters Undefined

Issue 1: Eval versus wordnet eval in the parameters are interchangably used
Issue 2:

setting up device...
	device: cuda:3
====time: 0:00:00.983759
loading corpus: text8...
Traceback (most recent call last):
  File "run.py", line 1123, in <module>
    main()
  File "run.py", line 1076, in main
    PARAMS = Parameters.load_model_state_dict(LOGGER,
NameError: name 'Parameters' is not defined

Gaussian Embeddings Paper for Copying their Evaluation

https://arxiv.org/pdf/1412.6623.pdf

This is an ICLR 2015 paper that talks about Gaussians for a word representation. We need to use their evaluation section for out paper.

Find downstream evaluation tasks

Figure out what downstrem evaluation task is to be used to validate our models

Automatically find Glove embeddings using semidefinite programming

The kind of optimisation that glove tries to perform (finding vectors whose dot products are known) is a well known type of program as far as I understand: In particular, it forms a semidefinite program (SDP)

Here are the possible scenarios:

Glove is SDP

In this case, we use a solver for SDP against their corpus of co-occurences.

We get embeddings better than GloVe ( $$$)

Publish, show why NNs + gradient descent are unnecessary for solving SDPs.

We get embeddings worse that GloVe ( $$$)

There are two reasons for this. Either:

Their gradient descent mechanism finds vectors whose dot products better match the co-occurence matrices than a solver for SDP. In this case, we publish in the SDP literature about how gradient descent is better than a fucking solver (highly unlikely) $$$
our model has dot products that fits the co-occurence ratios better than glove, but still performs worse ( $$$)

This opens up a line of investigation, since it shows that their "explanation" is BS. Because our model literally optimizes what they claim they are optimizing better, and yet performs worse on the results. This allows us to argue one of:

The objective is ill phrased.
They are learning something else and their explanation is BS.
Their testing regime is BS.

Glove is not SDP for some technical reason `:((`

We try to massage it into an SDP, but this is the most disappointing possibility

SoTA Word2Vec test-bench

We need to look into the SoTA word embeddings evaluation testbench, in order to analyze the our performance for vectors in downstream tasks

CUDA sample programs

Fill a 1D array with the value 10
Fill a 1D array with A[i] = i
Fill a 2D array with A[i][j] with i*j.
Vector dot product, simple version, with atomicAdd(&result[i], a[i] + b[i])
Vector dot product with parallel reduction NVIDIA documentation here
Perceptron training for w^tX. Parallelise on mini-batches?
implement word2vec
implement transformer networks
implement a 3D physically based rendering engine
implement a 3D blackhole simulator, and use the rendering engine to raytrace a black hole

NDOCS error

Command: python3 run.py train --savepath ./models/

Error:

setting up device...
	device: cuda:3
====time: 0:00:01.179288
loading corpus: text8...
Traceback (most recent call last):
	File "run.py", line 1186, in <module>
	main()
File "run.py", line 1167, in main
	traintype=PARSED.traintype)
File "run.py", line 661, in __init__
	self.NDOCS = int(NDOCS)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' ``

Please explain this as well. Thanks

Plot summary statistics of our WordNet based evaluation

Mean L1 loss
Mean L2 loss

Fill in the blanks based evaluation method

Hey,

Please implement the fill in the blank based evaluation method: I ate a ___ on the bus: Guess the ___ based on the average of the vectors around it.