torchkge-team / torchkge Goto Github PK
View Code? Open in Web Editor NEWTorchKGE: Knowledge Graph embedding in Python and PyTorch.
License: Other
TorchKGE: Knowledge Graph embedding in Python and PyTorch.
License: Other
Hi there,
I recently came across this package and really like it thus far, I think your API is one of the best implementations I've seen around graph training!
I was wondering if support for very large graph training is on your roadmap at all similar to what Pytorch biggraph does with its partitioning and distributed training. This API with that kind of support could be really useful for large graph embedding training.
Thanks and well done again on a great package.
Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
I was trying to run the 'Simplest training' example available on the torchkge site.
For some reason, it keeps giving me an error that all my tensors should be on the same device. However, I have simply copy-pasted the example with my own dataset.
The code works only if I change use_all
parameter in dataloader = DataLoader(train, batch_size=batch_size, use_cuda="all")
to None
, i.e. when I shift my dataloader to the cpu which slows down training.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-27-e141f90ae057>](https://localhost:8080/#) in <module>
18 for i, batch in enumerate(dataloader):
19 h, t, r = batch[0], batch[1], batch[2]
---> 20 n_h, n_t = sampler.corrupt_batch(h, t, r)
21 optimizer.zero_grad()
22
[/usr/local/lib/python3.9/dist-packages/torchkge/sampling.py](https://localhost:8080/#) in corrupt_batch(self, heads, tails, relations, n_neg)
315 # Randomly choose which samples will have head/tail corrupted
316 mask = bernoulli(self.bern_probs[relations].repeat(n_neg)).double()
--> 317 n_h_cor = int(mask.sum().item())
318 neg_heads[mask == 1] = randint(1, self.n_ent,
319 (n_h_cor,),
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
I read the source code of RESCALModel but I realize that there is no regularization term as opppose to the original Rescal paper of Nickel?
Can somebody explain why torchkge does not use regularization term?
Hi does the idx of returned tensor corresponding to the idx in ent2idx, or the returned embeddings tensor is not ordered.
if returned tensor has size 131414 and ent2idx also has size 131414, does this mean I can access each entity's embedding by using tensor[ent2idx['entity_name']].
I am trying to train a ConvKB model using an instance of the class Trainer https://github.com/torchkge-team/torchkge/blob/master/torchkge/utils/training.py. However for ConvKB, the call to the method self.model.normalize_parameters() raises a NotImplementedError. Is this necessary for correct training of ConvKB? Is there any other way to train a ConvKB model properly? If not, are there any plans to support it in the future?
Thanks in advance for your support and help!
Best,
Luis
Currently, the KnowledgeGraph
class accept a data frame containing three columns (['from', 'rel', 'to']
) and I feel like it would be nice to provide some facts that are known to be false, with a fourth column containing a boolean value.
It could be used as a complement to the false facts that are generated through the sampler during the training of a model.
And I think that it would be particularly interesting while using the test kg with the LinkPredictionEvaluator
, for which we could provide false facts to analyse the accuracy of the model we are evaluating.
What is your opinion on the subject?
I'll take this opportunity to thank you for the majestic work you've been doing so far!
Possibly add an extra \t after Mean Rank : {} for better result presentation. Apologies if this is seen as a non-issue
I find that you say "For each true triplet, produce a corrupted one not different from any other true triplet" in the negative sampling section, but I'm not sure how you ensure that the negative sampling sample will not be true triplet, hoping to get an answer.
I only found this relevant code:
`
n_h_cor = int(mask.sum().item())
neg_heads[mask == 1] = randint(1, self.n_ent,
(n_h_cor,),
device=device)
neg_tails[mask == 0] = randint(1, self.n_ent,
(batch_size * n_neg - n_h_cor,),
device=device)
`
I'm using the Hole you made.
probelm occured while scoring
Error =
/usr/local/lib/python3.8/dist-packages/torchkge/models/bilinear.py in inference_scoring_function(self, h, t, r)
375 # this is the tail completion case in link prediction
376 h = h.view(b_size, 1, self.emb_dim)
--> 377 hr = matmul(h, r).view(b_size, self.emb_dim, 1)
378 return (hr * t.transpose(1, 2)).sum(dim=1)
379 elif len(h.shape) == 3:
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [64, 500] but got: [64, 20387].
64 is Lot size and 500 is embedding size, and 20387 is a number of candidates.
so i check the size of candi, heda, tail, relaiton
candi = torch.Size([64, 500, 500]) b_size, rel_emb_dim, n_ent, dtype: torch.float
head = torch.Size([64, 500]) b_size, rel_emb_dim, dtype: torch.float
tail = torch.Size([64, 500]) b_size, rel_emb_dim, dtype: torch.float
relation = torch.Size([64, 20387, 500]) b_size, rel_emb_dim
candi's & relation size is weired
so i try to evaluation on fb15k data using your tutorior code. and i check the size of candi, heda, tail, relaiton.
this is a rsult
candi = torch.Size([1, 14951, 100]) b_size, rel_emb_dim, n_ent, dtype: torch.float
head = torch.Size([1, 100]) b_size, rel_emb_dim, dtype: torch.float
tail = torch.Size([1, 100]) b_size, rel_emb_dim, dtype: torch.float
relation = torch.Size([1, 100]) b_size, rel_emb_di
I didn't find any problem with my input data generation.
I maded this code to generate input data.
I tried both pandas and kg methods.
raw data is a txt file composed of tab intervals in the order of head, relation, and tail.
but didn't work
`def load_data(file_path, name_entity_data, name_relation_data, name_train_data, name_valid_data, name_test_data, name_all_data, name_AUC_data):
# file_path = '/content/drive/MyDrive/
print("load data from {}".format(file_path))
with open(os.path.join(file_path, name_entity_data)) as f:
entity2id = dict()
id2entity = dict() #
for line in f:
eid, entity = line.strip().split('\t')
entity2id[entity] = int(eid)
id2entity[eid] = entity #
with open(os.path.join(file_path, name_relation_data)) as f:
relation2id = dict()
id2relation = dict() #
for line in f:
rid, relation = line.strip().split('\t')
relation2id[relation] = int(rid)
id2relation[rid] = relation #
kg_train = read_triplets_to_kg(os.path.join(file_path, name_train_data), entity2id, relation2id)
kg_valid = read_triplets_to_kg(os.path.join(file_path, name_valid_data), entity2id, relation2id)
kg_test = read_triplets_to_kg(os.path.join(file_path, name_test_data), entity2id, relation2id)
kg_all = read_triplets_to_kg(os.path.join(file_path, name_all_data), entity2id, relation2id)
kg_auc = read_triplets_to_kg(os.path.join(file_path, name_AUC_data), entity2id, relation2id)
print('num_entity: {}'.format(len(entity2id)))
print('num_relation: {}'.format(len(relation2id)))
print('num_kg_train: {}'.format(len(kg_train['heads'])))
print('num_kg_valid: {}'.format(len(kg_valid['heads'])))
print('num_kg_test: {}'.format(len(kg_test['heads'])))
return entity2id, relation2id, id2entity, id2relation, kg_train, kg_valid, kg_test, kg_all, kg_auc`
`def read_triplets_to_kg(file_path, entity2id, relation2id):
heads = []
tails = []
relations = []
kg = dict()
with open(file_path) as f:
for line in f:
head, relation, tail = line.strip().split('\t')
heads.append(entity2id[head])
tails.append(entity2id[tail])
relations.append(relation2id[relation.strip()])
kg['heads'] = torch.LongTensor(heads)
kg['tails'] = torch.LongTensor(tails)
kg['relations'] = torch.LongTensor(relations)
return kg`
and middle of model train code,
` entity2id, relation2id, id2entity, id2relation, kg_train, kg_valid, kg_test, kg_all, kg_auc = load_data(file_path, name_entity_data, name_relation_data, train, valid, test, all, auc)
kg_train = KnowledgeGraph(kg= kg_train, ent2ix = entity2id, rel2ix = relation2id)
kg_valid = KnowledgeGraph(kg=kg_valid, ent2ix = entity2id, rel2ix = relation2id)
kg_test = KnowledgeGraph(kg=kg_test, ent2ix = entity2id, rel2ix = relation2id)
kg_auc = KnowledgeGraph(kg=kg_auc, ent2ix = entity2id, rel2ix = relation2id) `
I see that there is a potential misalignment between the paper and implemented version of TransE. Can you help clarify?
The misalignment I see is that the loss in the paper is always computed using the normalized head and normalized tail embeddings (based on the pseudo-code). However, in the implementation, despite the entity embeddings being re-normalized at the end of each epoch, after each minibatch a gradient update is made. This means that the loss is not computed for normalized head and tail embeddings. Only for the first minibatch is the loss computed on the normalized embeddings but not for the rest.
Edit: Nevermind. All head and tail vectors are normalized in the scoring function also. This fixes it:
torchkge/torchkge/models/translation.py
Line 69 in d56e9d8
I am studying how you compute the filter ranking but the function "lp_helper" seems not to be implemented.
I am exploring using this for training TransE embeddings had some beginner questions:
I am starting dig into the docs to understand the framework better, so apologies for if some these questions are already covered in the docs.
In the original paper, the author proposed three soft constraints and added a hyperparameter C
to weight the importance of these constraints. While referring to the
torchkge/torchkge/utils/losses.py
Line 12 in a3474b7
C
term.
I found a similar issue in OpenKE Weight C in TransH missing , is this the same reason ttorchkge ignores C
though torchkge uses a different normalization method from OpenKE?
Assuming that we use SmallKG or KnowledgeGraph classes to train the models with our own dataset:
Thanks,
Mladen
I was looking into the get_df()-method under KnowledgeGraph-class and in line 397, it seems that i2e is used but not declared anywhere. Or am I missing something here? The method seems to be un-callable.
In file "translation" line 210-215
self.ent_emb.weight.data = normalize(self.ent_emb.weight.data,
p=2, dim=1)
self.norm_vect.weight.data = normalize(self.norm_vect.weight.data,
p=2, dim=1)
self.rel_emb.weight.data = self.project(self.ent_emb.weight.data,
self.norm_vect.weight.data)
For self.rel_emb.weight.data, why is "self.project(self.ent_emb.weight.data,self.norm_vect.weight.data)"?
And it shows this bug, during the model initialization phase:
RuntimeError: The size of tensor a (14541) must match the size of tensor b (237) at non-singleton dimension 0
The following issues occur when evaluating the model performance of TransR
evaluator = LinkPredictionEvaluator(model, kg_valid)
evaluator.evaluate(b_size=1, iverbose=False)
'TransRModel' object has no attribute 'emb_dim'
I had to implement this function for the project I am currently working on and I guess it could be handy to someone else.
I don't really know in which file this function would be at its best place but I would be glad to create a pull request if someone has a good idea about it.
Also if you have remarks/advices about the implementation do not hesitate to point them to me.
def kg2df(kg: KnowledgeGraph) -> pd.DataFrame:
"""
Revert a torchKGE `KnowledgeGraph` into a pandas `DataFrame`.
:param kg: A knowledge graph.
:return: A dataframe containing the same information than the knowledge graph.
"""
ix2rel = dict([(ix, rel) for rel, ix in kg.rel2ix.items()])
ix2ent = dict([(ix, rel) for rel, ix in kg.ent2ix.items()])
df = pd.DataFrame({'from': kg.head_idx, 'rel': kg.relations, 'to': kg.tail_idx})
df['from'] = df['from'].map(ix2ent)
df['rel'] = df['rel'].map(ix2rel)
df['to'] = df['to'].map(ix2ent)
return df
Hi, thank for your job. Recently, I'm trying to train my model with the 1-N score method in the ConvE. Have you achieve it?
Hope for your reply. Thanks!
I'm going to use HOLE model in your code.
before i use, i have few question. expecially in scoring_function.
is scoring matmul process right?
in relate article, first calcualte circular convolution between subject and object, then matmul with relation vector
but your code first calculate circular convolution between subject and relation vector then matmul with object vector
after mod shifting relation matrix, e.g relation matrix shape (2, 3) → (2,3,3),
when matmul with head embeding vector your code, It is put into multiplication by column.
Isn't it right that it should be changed by row?
e.g. it is matrix after mod shifting
[[1,1,1],
[2,2,2],
[3,3,3]]
when matmul with head vector, it calculated by column [1, ,2, 3] not [1,1,1] ...
Hi,
An implementation of GAATs by Wang et al. would be welcome. https://ieeexplore.ieee.org/abstract/document/8946600
from torchkge.data_structures import KnowledgeGraph
https://github.com/torchkge-team/torchkge/blob/master/torchkge/data_structures.py
I noticed that it takes a significant time to build. Have there been academic works that develop ways of implement graphs efficiently that are employed in TorchKGE?
As I understand, it creates a knowledge graph tensor based on the knowledge graph triplet list, is this correct?
Hi,
I realized the scoring function (method scoring_function) of the ConvKB model returns two values for a given triple. These are the results of a softmax activation that, from what I understand, has two output neurons that represent the probabilities of a triple being true and false. However I do not know in which order these cases are defined in the neural network. Which output neuron corresponds to which case? I could unfortunately not figure it out from reading the code.
Thanks in advance for your attention and help!
Best,
Luis
Hi, recently I noticed this project and the corresponding paper. And I also found that the evaluation time is a big advantage of this project, I want to know why torchkge
can evaluate faster than OpenKE
or AmpliGraph
. I have read the corresponding files about evaluation, e.g., torchkge/evaluation/link_prediction.py
, but I still don't know the key element of this success. Can you help me?
The reason for my interest in evaluation time is that I find a relatively complex model will spend much time evaluating on some large KGs. So I hope to obtain some insights from this project to accelerate the evaluation stage.
I will appreciate your reply. Thanks!
TrainDataLoader
seems to fail to return an iterator hence its def get_counter_examples(self) -> SmallKG
method fails with an error message hinting that self.iterator
is None
.
So trainer.run()
fails in the Shortest Training from the docs.
File "/project-root/test.py", line 38, in main trainer.run()
File "/project-root/.venv/lib/python3.10/site-packages/torchkge/utils/training.py", line 179, in run
self.counter_examples = data_loader.get_counter_examples()
File "/project-root/.venv/lib/python3.10/site-packages/torchkge/utils/training.py", line 64, in get_counter_examples
return SmallKG(self.iterator.nh, self.iterator.nt, self.iterator.r)
AttributeError: 'NoneType' object has no attribute 'nh'
****
1. copy Shortest Training from the docs
2. run it on Ubuntu and MacOS
****
For the work I had to do, it was interesting to split the accuracy metric by predicate.
Particularly for debugging purposes, if you want to check which relations are "well understood" by the model and which are not.
Here's an implementation of this task:
class PredicateTripletClassificationEvaluator(TripletClassificationEvaluator):
"""This evaluator has the particularity to measure the accuracy by predicate.
Apart from that, it's the exact replica of the `TripletClassificationEvaluator`
"""
def __init__(self, model, kg_val, kg_test):
super().__init__(model, kg_val, kg_test)
def accuracy(self, b_size: int) -> dict:
"""
Parameters
----------
b_size: int
Batch size.
Returns
-------
acc: Dict[str, float]
Share by predicate of all triplets (true and negatively sampled ones)
that were correctly classified using the thresholds learned from the
validation set.
"""
if not self.evaluated:
self.evaluate(b_size)
r_idx = self.kg_test.relations
neg_heads, neg_tails = self.sampler.corrupt_kg(b_size,
self.is_cuda,
which='test')
scores = self.get_scores(self.kg_test.head_idx,
self.kg_test.tail_idx,
r_idx,
b_size)
neg_scores = self.get_scores(neg_heads, neg_tails, r_idx, b_size)
if self.is_cuda:
self.thresholds = self.thresholds.cuda()
scores = (scores > self.thresholds[r_idx])
neg_scores = (neg_scores < self.thresholds[r_idx])
accuracy_by_predicate = {}
for predicate, rel_index in self.kg_test.rel2ix.items():
mask = (self.kg_test.relations == rel_index)
masked_scores = masked_select(scores, mask)
masked_neg_scores = masked_select(neg_scores, mask)
accuracy_by_predicate[predicate] = (
count_nonzero(masked_scores).item() +
count_nonzero(masked_neg_scores).item()
) / (2 * count_nonzero(mask).item())
return accuracy_by_predicate
The Trainer
class is great but I needed a way to retreive and store the counter-examples used during the training phase.
I will soon propose a pull request implementing that.
If you have any remarks/advices/questions do not hesitate to contact me.
Hi, i found the default negtive num of training is 1
in the BernoulliNegativeSampler
, i change it to BernoulliNegativeSampler(kg,n_neg=100)
. But it seem don't work,
Could you tell me How to change the negtive num?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.