Giter Club home page Giter Club logo

complex's Introduction

Complex Embeddings for Simple Link Prediction

This repository contains the code of the main experiments presented in the papers:

Complex Embeddings for Simple Link Prediction, Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier and Guillaume Bouchard, ICML 2016.

Knowledge Graph Completion via Complex Tensor Factorization, Théo Trouillon, Christopher R. Dance, Éric Gaussier, Johannes Welbl, Sebastian Riedel and Guillaume Bouchard, JMLR 2017.

Install

First clone the repository:

git clone https://github.com/ttrouill/complex.git

The code depends on downhill, a theano-based Stochastic Gradient Descent implementation.

Install it, along with other dependencies with:

pip install -r requirements.txt

The code is compatible with Python 2 and 3.

Run the experiments

To run the experiments, unpack the datasets first:

unzip datasets/fb15k.zip -d datasets/
unzip datasets/wn18.zip -d datasets/

And run the corresponding python scripts, for Freebase (FB15K):

python fb15k_run.py

And for Wordnet (WN18):

python wn18_run.py

By default, it runs the ComplEx (Complex Embeddings) model, edit the files and uncomment the corresponding lines to run DistMult, TransE, RESCAL or CP models. The given hyper-parameters for each model are the best validated ones by the grid-search described in the paper.

To run on GPU (approx 5x faster), simply add the following theano flag before the python call:

THEANO_FLAGS='device=gpu' python fb15k_run.py

Export the produced embeddings

Simply uncomment the last lines in fb15k_run.py and wn18_run.py (and the import scipy.io line (requires scipy module)), this will save the embeddings of the ComplEx model in the common matlab .mat format. If you want to save the embeddings of other models, just edit the embedding variable names corresponding to the desired model (see models.py).

Run on your own data

Create a subfolder in the datasets folder, and put your data in three files train.txt, valid.txt and test.txt. Each line is a triple, in the format: 

subject_entity_id	relation_id	object_entity_id

separated with tabs. Then modify fb15k_run.py for example, by changing the name argument in the build_data function call to your data set folder name:

fb15kexp = build_data(name = 'your_dataset_folder_name',path = tools.cur_path + '/datasets/')

Implement your own model

Models are defined as classes in models.py, that all inherit the class Abstract_Model defined in the same file. The Abstract_Model class handles all the common stuff (training functions, ...), and child classes (the actual models) just need to define their embeddings shape and initialization, and their scoring and loss function.

To properly understand the following, one must be comfortable with Theano basics.

The Abstract_Model class contains the symbolic 1D tensor variables self.rows, self.cols, self.tubes and self.ys that will instantiate at runtime the corresponding: subject entity indexes, relation indexes, object entity indexes and truth values (1 or -1) respectively, of the triples of the current batch. It also contains the number of subject entities, relations and object entities of the dataset in self.n, self.m, self.l respectively, as well as the current embedding size in self.k.

Two functions must be overridden in the child classes to define a proper model: get_init_params(self) and define_loss(self).

Let's have a look at the DistMult_Model class and its get_init_params(self) function:

def get_init_params(self):
	params = { 'e' : randn(max(self.n,self.l),self.k),
			   'r' : randn(self.m,self.k)}
	return params

This function both defines the embedding-matrix shapes (number of entities * rank for e, number of relations * rank for r), and their initial value (randn is numpy.random.randn), by returning a dictionnary where the key names correspond to the class attribute names. From this dict the mother class will create shared tensor variables initialized with the given values, and assigned to the corresponding attribute names (self.e and self.r).

Now the define_loss(self) function must define three Theano expressions: the scoring function, the loss, and the regularization. Here is the DistMult_Model one:

def define_loss(self):
	self.pred_func = TT.sum(self.e[self.rows,:] * self.r[self.cols,:] * self.e[self.tubes,:], 1)

	self.loss = TT.sqr(self.ys - self.pred_func).mean()

	self.regul_func = TT.sqr(self.e[self.rows,:]).mean() \
					+ TT.sqr(self.r[self.cols,:]).mean() \
					+ TT.sqr(self.e[self.tubes,:]).mean()

The corresponding expressions must be written in their batched form, i.e. to compute the scores of multiple triples at once. For a given batch, the corresponding embeddings are retrieved with self.e[self.rows,:], self.r[self.cols,:] and self.e[self.tubes,:].

In the case of the DistMult model, the trilinear product between these embeddings is computed, here by doing first two element-wise multiplications and then a sum over the columns in the self.pred_func expression. The self.pred_func expression must yield a vector of the size of the batch (the size of self.rows, self.cols, ...). The loss defined in self.loss is the squared-loss here (see the DistMult_Logistic_Model class for the logistic loss), and is averaged over the batch, as the self.loss expression must yield a scalar value. The regularization defined here is the L2 regularization over the corresponding embeddings of the batch, and must also yield a scalar value.

That's all you need to implement your own tensor factorization model! All gradient computation is handled by Theano auto-differentiation, and all the training functions by the downhill module and the Abstract_Model class.

Cite ComplEx

If you use this package for published work, please cite either or both papers, here is the BibTeX:

@inproceedings{trouillon2016complex,
	title = {{Complex embeddings for simple link prediction}},
	author = {Trouillon, Th\'eo and Welbl, Johannes and Riedel, Sebastian and Gaussier, \'Eric and Bouchard, Guillaume},
	booktitle = {International Conference on Machine Learning (ICML)},
	volume={48},
	pages={2071--2080},
	year = {2016}
}
@article{trouillon2017knowledge,
	title={Knowledge graph completion via complex tensor factorization},
	author={Trouillon, Th{\'e}o and Dance, Christopher R and Gaussier, {\'E}ric and Welbl, Johannes and Riedel, Sebastian and Bouchard, Guillaume},
	journal={Journal of Machine Learning Research (JMLR)},
	volume={18},
	number={130},
	pages={1--38},
	year={2017}
}

License

This software comes under a non-commercial use license, please see the LICENSE file.

complex's People

Contributors

ttrouill avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

complex's Issues

How to run Complex on GPU colab?

Hello,

I am currently trying to run Complex on Google Colab Pro using my data.
I have specified the requirements for GPU on Colab and installed all packages in requirements.txt
The algorithm works fine but doesn't use the Colab GPU and uses CPU instead. As a result, I am having to wait long for my output.

Can you please specify what is wrong while using the command : THEANO_FLAGS="device=gpu" & python fb15k_run.py?
I also noticed that it doesn't use all the 15 cores on my Linux system. How can I make sure it is using all the cores?

a beginner's question

Dear author,I'm a Chinenese student, after I train on my data. I achieve a .mat model. It's a entities_num*dimension Matrix. But I don't know how the row corresponding to the entity. I just learn python recently. and I love read paper. This was my first time to practice experiment. Sorry for bother you about my simple question.

About Hermitian product

Dear author
Hi im student from South Korea
I have a question about hermitian product which is used in ComplEx's score function.
I would be glad if you explain how the equation (11) was made..

In detail, lets assume we have entity A's embedding vector of dimension 100. (Tensor shape (1, 100))
To treat this embedding vector as a complex vector,
ComplEx divides this 100-dimension embedding into 2 pieces, Re and Im.
(Real, Imaginary each would have 50-dimension in this case)

Let's assume we have another entity B's embedding too.
Then the similarity(= inner product) of these two entities can be computed with hermitian product.
(Let's say <> is a sum of element-wise multiplication>

Similarity(A,B) = <Re(A) , Re(B)> + <-Im(A) , Im(B)>

This is what I understood(Am I right?) and here is the question.
ComplEx's score function calculates score with 3 vectors (2 entities and 1 relation)...
How can I do this?? I mean, How did you made equation (11)??

Uhh..lets say relation between A, B is R.
The paper says we can calculate 3 vector's hermitian product just like the equation (11) does.

<Re(R), Re(A), Re(B)> + <Re(R), Im(A), Im(B)> + <Im(R), Re(A), Im(B)> - <Im(R), Im(A), Re(B)>

How did you made this equation??

I'll be really glad if you tell me how..
Thank you!

ConceptNet

Hi Theo,
Sorry for contacting you through here, but the email address you've given doesn't work. Basically I'm a knowledge graph embedding hobbyist and I'm trying to experiment with your ComplEx code on a 'real' dataset that's based on ConceptNet instead of FB15k. It's much larger than FB15k, and has 1345609 entities, 1995411 triplets but only 36 relations. The number of entities is much larger than FB15k, but there are much fewer relation types.
Complex doesn't seem to be as effective on this dataset. There is some improvement but it's super slow. Could you please offer some insight into how I could improve the results? My guess is that, because there's many entities and not so many triplets, the algorithm doesn't have enough triplets to learn from. Perhaps increasing the negative triplet factor would help.
But training is so slow on my computer (3 days for 50 iterations) that I'd like to hear your input before wasting another 3 days.

Thank you, and keep up the great work!

Regarding implementing constraints of TransR inside your code

Hello Theo,

I am trying to implement TransR model using your code which, as you know, has the objective function ||hMr + r - tMr||_{2} for a given relation r(h,t). Further, as you know, this model requires that we put constraints ||hMr||_{2} <= 1, ||tMr||_{2} <=1 and ||r||_{2} <=1 on the objective function.
If I take a look at the other codes for TransR, for example here ([https://github.com/thunlp/OpenKE/blob/master/models/TransR.py]) They are taking L2 norm hMr/||hMr||_{2}, tMr/||tMr||_{2} and r/||r||_{2} to implement the above constraints on the objective function before visiting mini-batch.

Could you please tell how can i implement the above constraints as hMr/||hMr||_{2}, tMr/||tMr||_{2} and r/||r||_{2} inside your code. I am getting confused because in your code h and Mr would be stored separately and I can only implement an objective function self.loss with both h and Mr present separately in it.

(a) Here is one solution according to me: i implement a class TransR_Batch_Loader(Batch_Loader) inside batching.py where it computes h = h/sqrt(L2_norm(hMr)) and Mr = Mr/sqrt(L2_norm(hMr)) separately inside the call() function and save these parameters back in the model (like in line 81 of batching.py in your code). I can do similar thing for ||hMr||_{2} and with ||r||_{2}, the constraint is straight-forward. Afterwards, when the batch index of h, Mr is separately passed to the optimization algorithm, the algorithm is optimizing hMr = h/sqrt(L2_norm(hMr))*Mr/sqrt(L2_norm(hMr)) = hMr/L2_norm(hMr) in self.loss of Model. Can you please tell me if this is the right way of projecting hMr, tMr and r to unit l2-ball before visiting each mini-batch?

(b) Can you please suggest any other way of putting the above constraints in TransR model?

Best
Navdeep

Writing out the scores

Hi, I must admit that I have been finding the code to confusing to solve this myself.

How would I go about, if I would like to write out the predicted scores of the models into a file for further use? I require the actual predicted triples/heads/tails for a given test triple with their scores (with this the ranks can be derived).

Thank you!

Runtime requirement

Hello Théo,

firstly, I would like to genuinely thank you for the code and repo which are well prepared and straightforward. However, the following situation confuses me:
Given the following settings,
nb entities: 521,358, relations:70 and obs triples:2,395,882 )
Learning rate: 0.5
Max iter: 100
Generated negatives ratio: 10
Batch size: 23958
Starting grid search on: Complex_Logistic_Model
train: 100 mini-batches from callable
downhill: compiling evaluation function
downhill: compiling ADAGRAD optimizer
downhill: setting: rms_regularizer = 1e-08
downhill: setting: patience = 9999999
downhill: setting: validate_every = 9999999
downhill: setting: min_improvement = 0
downhill: setting: max_gradient_norm = 1
downhill: setting: max_gradient_elem = 0
downhill: setting: learning_rate = TensorConstant{0.5}
downhill: setting: momentum = 0
downhill: setting: nesterov = False

, it takes hours to train models. This issue stems from the implementation or am I lacking of a theoretical background here?

Cheers

Data format

your data in three files train.txt, valid.txt and test.txt Each line is a triple, but not in the format like ubject_entity_id relation_id object_entity_id. How your program works with this data that stored in datasets fb15k.zip and wn18.zip file?

Downhill version 0.3.2 not working anymore

As downhill v0.3.2 depends on the package climate that does not seem to exist anymore, the repo can not be installed anymore.

I have just checked v0.4.0 of downhill and it seems to mainly have changed the logging part, i.e., removed the dependency to climate.

Therefore, commenting out the 4th line in *_run.py (downhill.base.logging.setLevel(20)) seems to do the trick and resolve the issue with #1.

Question RE: testing protocol

Hello Théo,

Thanks a lot for this piece of code, I've trying to understand the testing protocol from the code but unfortunately I couldn't.

I wonder what is the positive to negative ratio used in testing, and where to find that in the code.

Thanks a lot

theano.function(self.get_pred_symb_vars(), self.pred_func)

Hi, thanks for sharing this repo. I am new to Theano. When I run the command python fb15k_run.py in your Readme, I got an error. Can you provide me any advice to solve it ? Thanks!

(env_complex) gyhu@mic119:/DATA/119/gyhu/code/complex$ python fb15k_run.py WARNING (theano.configdefaults): install mkl withconda install mkl-service`: No module named 'mkl'
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
2019-09-10 16:23:59,712 (EFE) [INFO] Nb entities: 14951
2019-09-10 16:23:59,713 (EFE) [INFO] Nb relations: 1345
2019-09-10 16:23:59,713 (EFE) [INFO] Nb obs triples: 483142
2019-09-10 16:24:07,993 (EFE) [INFO] Learning rate: 0.5
2019-09-10 16:24:07,994 (EFE) [INFO] Max iter: 1000
2019-09-10 16:24:07,994 (EFE) [INFO] Generated negatives ratio: 10
2019-09-10 16:24:07,994 (EFE) [INFO] Batch size: 4831
2019-09-10 16:24:07,994 (EFE) [INFO] Starting grid search on: Complex_Logistic_Model

You can find the C code in this temporary file: /tmp/theano_compilation_error_8ipm7__4
Traceback (most recent call last):
File "fb15k_run.py", line 39, in
fb15kexp.grid_search_on_all_models(all_params, embedding_size_grid = [emb_size], lmbda_grid = [lmbda], nb_runs = 1)
File "/DATA/119/gyhu/code/complex/efe/experiment.py", line 73, in grid_search_on_all_models
self.run_model(model_s,cur_params)
File "/DATA/119/gyhu/code/complex/efe/experiment.py", line 94, in run_model
model.fit(self.train, self.valid, Parameters(**vars(params)), self.n_entities, self.n_relations, self. n_entities, self.scorer)
File "/DATA/119/gyhu/code/complex/efe/models.py", line 152, in fit
self.setup_params_for_train(train_triples, valid_triples, hparams)
File "/DATA/119/gyhu/code/complex/efe/models.py", line 132, in setup_params_for_train
self.pred_func_compiled = theano.function(self.get_pred_symb_vars(), self.pred_func)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/compile/function .py", line 317, in function
output_keys=output_keys)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/compile/pfunc.py ", line 486, in pfunc
output_keys=output_keys)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/compile/function _module.py", line 1841, in orig_function
fn = m.create(defaults)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/compile/function _module.py", line 1715, in create
input_storage=input_storage_lists, storage_map=storage_map)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/link.py", li ne 699, in make_thunk
storage_map=storage_map)[:3]
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/vm.py", line 1091, in make_all
impl=impl))
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile
keep_lock=keep_lock)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/cc.py", line 1624, in cthunk_factory
key=key, lnk=self, keep_lock=keep_lock)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1189, in module_from_key
module = lnk.compile_cmodule(location)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/cc.py", line 1527, in compile_cmodule
preargs=preargs)
File "/DATA/119/gyhu/soft/Anaconda3/envs/env_complex/lib/python3.6/site-packages/theano/gof/cmodule.py", line 2396, in compile_str
(status, compile_stderr.replace('\n', '. ')))
Exception: ('The following error happened while compiling the node', AdvancedSubtensor1(e1, tubes), '\n', "Compilation failed (return status=1): /tmp/ccpoOASs.s: Assembler messages:. /tmp/ccpoOASs.s:1675: Error: no such instruction: vinserti128 $0x1,%xmm0,%ymm1,%ymm0'. /tmp/ccpoOASs.s:1680: Error: no such instructio n: vextracti128 $0x1,%ymm0,16(%r12)'. ", '[AdvancedSubtensor1(e1, tubes)]')

`

Question about concatenation in extract_sub_scores function.

Hi Théo,

I wonder if this line

res = Result(res.preds[idxs], res.true_vals[idxs], res.ranks[np.concatenate((idxs,idxs))], res.raw_ranks[np.concatenate((idxs,idxs))])
should be "res = Result(res.preds[idxs], res.true_vals[idxs], res.ranks[np.concatenate((idxs, len(idxs)+idxs))], res.raw_ranks[np.concatenate((idxs, len(idxs)+idxs))])". That is, add len(idxs) to the concatenation.

Thanks a lot.

Loss Function of Trans_L1_Model

Hello,

I might be wrong, but is the self.loss function of the Trans_L1_Model correctly implemented in line 453 of model.py especially the reshape() part.

I am doubting it because lets say i have batch_size = 2, neg_ratio = 4 and the portion of the code just before reshape will give matrix as in figure (a) in the attached figure
Picture1.pdf where c1, c3, c5, c7 represent the corruption of same true triple and c2, c4, c6, c8 represent corruption of another triple.

When we reshape it by calling reshape((int(batch_size),int(neg_ratio)) as in the current code, we get something as in figure (b). If this is followed by sum along dimension 1, then the two different type of triples's data is being added to each other as in figure (c) (first column has sum = c1+c2+c3+c4 where c1, c3 belong to one type of triple is being added to c2, c4 which is another type).

On the other hand if we implement it as reshape(neg_ratio, batch_size) followed by sum along dimension 0, then i adds only one type of triple's data to each other.

please correct me if my understanding of this code is wrong.

Thanks in advance
Navdeep

Print scores at index

Hello,
First off: great module! I just implemented it and it works like a charm with my own data.
Now I want to do some error analysis by printing out the top 100 scores of certain relationship types.
Is there any build-in way to do that?

Thanks,

Run question.

Hello,

I have do all steps by your readme file.

However, I have encountered an question as follows:

image

Could you help me solve this problem?

Thank you.

James.

Just replace the tail entity when testing.

Hello,

Thanks for your great job.

I have a question about how to change the corrupted generation when testing.

I just want to replace the tail entities rather than repalce the head entities.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.