torchkge-team / torchkge Goto Github PK

View Code? Open in Web Editor NEW

380.0 380.0 42.0 936 KB

TorchKGE: Knowledge Graph embedding in Python and PyTorch.

License: Other

Python 100.00%

torchkge's People

Contributors

Stargazers

Watchers

torchkge's Issues

Distributed Support for large graphs

Hi there,

I recently came across this package and really like it thus far, I think your API is one of the best implementations I've seen around graph training!

I was wondering if support for very large graph training is on your roadmap at all similar to what Pytorch biggraph does with its partitioning and distributed training. This API with that kind of support could be really useful for large graph embedding training.

Thanks and well done again on a great package.

data_structure.py line 147 and line 148 are duplicated

TorchKGE version:
Python version:
Operating System:

Description

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Can there be an example for ConvKB and RotatE

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

TorchKGE version: 0.17.5
Python version: 3.9.16
Operating System: Ubuntu on Colab

Description

I was trying to run the 'Simplest training' example available on the torchkge site.
For some reason, it keeps giving me an error that all my tensors should be on the same device. However, I have simply copy-pasted the example with my own dataset.

What I Did

The code works only if I change use_all parameter in dataloader = DataLoader(train, batch_size=batch_size, use_cuda="all") to None, i.e. when I shift my dataloader to the cpu which slows down training.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-27-e141f90ae057>](https://localhost:8080/#) in <module>
     18     for i, batch in enumerate(dataloader):
     19         h, t, r = batch[0], batch[1], batch[2]
---> 20         n_h, n_t = sampler.corrupt_batch(h, t, r)
     21         optimizer.zero_grad()
     22 

[/usr/local/lib/python3.9/dist-packages/torchkge/sampling.py](https://localhost:8080/#) in corrupt_batch(self, heads, tails, relations, n_neg)
    315         # Randomly choose which samples will have head/tail corrupted
    316         mask = bernoulli(self.bern_probs[relations].repeat(n_neg)).double()
--> 317         n_h_cor = int(mask.sum().item())
    318         neg_heads[mask == 1] = randint(1, self.n_ent,
    319                                        (n_h_cor,),

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

No regularization term for Rescal

TorchKGE version: 0.6.13

I read the source code of RESCALModel but I realize that there is no regularization term as opppose to the original Rescal paper of Nickel?
Can somebody explain why torchkge does not use regularization term?

get_embeddings()

Hi does the idx of returned tensor corresponding to the idx in ent2idx, or the returned embeddings tensor is not ordered.

if returned tensor has size 131414 and ent2idx also has size 131414, does this mean I can access each entity's embedding by using tensor[ent2idx['entity_name']].

ConvKB full implementation

TorchKGE version: 0.17.2
Python version: 3.8
Operating System: Linux

Description

I am trying to train a ConvKB model using an instance of the class Trainer https://github.com/torchkge-team/torchkge/blob/master/torchkge/utils/training.py. However for ConvKB, the call to the method self.model.normalize_parameters() raises a NotImplementedError. Is this necessary for correct training of ConvKB? Is there any other way to train a ConvKB model properly? If not, are there any plans to support it in the future?

Thanks in advance for your support and help!

Best,
Luis

[Feature request] Possibility to provide false facts to the `KnowledgeGraph` class

Currently, the KnowledgeGraph class accept a data frame containing three columns (['from', 'rel', 'to']) and I feel like it would be nice to provide some facts that are known to be false, with a fourth column containing a boolean value.

It could be used as a complement to the false facts that are generated through the sampler during the training of a model.
And I think that it would be particularly interesting while using the test kg with the LinkPredictionEvaluator, for which we could provide false facts to analyse the accuracy of the model we are evaluating.

What is your opinion on the subject?

I'll take this opportunity to thank you for the majestic work you've been doing so far!

Very minor issue on print result

torchkge/torchkge/evaluation/link_prediction.py

Line 232 in a80743d

print('Mean Rank : {} \t Filt. Mean Rank : {}'.format(

Possibly add an extra \t after Mean Rank : {} for better result presentation. Apologies if this is seen as a non-issue

question of the sampling

Description

I find that you say "For each true triplet, produce a corrupted one not different from any other true triplet" in the negative sampling section, but I'm not sure how you ensure that the negative sampling sample will not be true triplet, hoping to get an answer.

I only found this relevant code:
`
n_h_cor = int(mask.sum().item())

    neg_heads[mask == 1] = randint(1, self.n_ent, 

                                   (n_h_cor,),

                                   device=device)

    neg_tails[mask == 0] = randint(1, self.n_ent,

                                   (batch_size * n_neg - n_h_cor,),

                                   device=device)

KnowledgeGraph embedding size problem in HoLE

TorchKGE version: 0.17.5
Python version: Python 3.8.10
Operating System: colab

I'm using the Hole you made.
probelm occured while scoring

Error =
/usr/local/lib/python3.8/dist-packages/torchkge/models/bilinear.py in inference_scoring_function(self, h, t, r)
375 # this is the tail completion case in link prediction
376 h = h.view(b_size, 1, self.emb_dim)
--> 377 hr = matmul(h, r).view(b_size, self.emb_dim, 1)
378 return (hr * t.transpose(1, 2)).sum(dim=1)
379 elif len(h.shape) == 3:

RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [64, 500] but got: [64, 20387].
64 is Lot size and 500 is embedding size, and 20387 is a number of candidates.

so i check the size of candi, heda, tail, relaiton
candi = torch.Size([64, 500, 500]) b_size, rel_emb_dim, n_ent, dtype: torch.float
head = torch.Size([64, 500]) b_size, rel_emb_dim, dtype: torch.float
tail = torch.Size([64, 500]) b_size, rel_emb_dim, dtype: torch.float
relation = torch.Size([64, 20387, 500]) b_size, rel_emb_dim

candi's & relation size is weired

so i try to evaluation on fb15k data using your tutorior code. and i check the size of candi, heda, tail, relaiton.
this is a rsult
candi = torch.Size([1, 14951, 100]) b_size, rel_emb_dim, n_ent, dtype: torch.float
head = torch.Size([1, 100]) b_size, rel_emb_dim, dtype: torch.float
tail = torch.Size([1, 100]) b_size, rel_emb_dim, dtype: torch.float
relation = torch.Size([1, 100]) b_size, rel_emb_di

I didn't find any problem with my input data generation.
I maded this code to generate input data.
I tried both pandas and kg methods.
raw data is a txt file composed of tab intervals in the order of head, relation, and tail.
but didn't work

`def load_data(file_path, name_entity_data, name_relation_data, name_train_data, name_valid_data, name_test_data, name_all_data, name_AUC_data):

# file_path = '/content/drive/MyDrive/

print("load data from {}".format(file_path))

with open(os.path.join(file_path, name_entity_data)) as f:
    entity2id = dict()
    id2entity = dict() #

    for line in f:
        eid, entity = line.strip().split('\t')
        entity2id[entity] = int(eid)
        id2entity[eid] = entity #

with open(os.path.join(file_path, name_relation_data)) as f:
    relation2id = dict()
    id2relation = dict() #

    for line in f:
        rid, relation = line.strip().split('\t')
        relation2id[relation] = int(rid)
        id2relation[rid] = relation #

kg_train = read_triplets_to_kg(os.path.join(file_path, name_train_data), entity2id, relation2id)
kg_valid = read_triplets_to_kg(os.path.join(file_path, name_valid_data), entity2id, relation2id)
kg_test = read_triplets_to_kg(os.path.join(file_path, name_test_data), entity2id, relation2id)
kg_all = read_triplets_to_kg(os.path.join(file_path, name_all_data), entity2id, relation2id)
kg_auc = read_triplets_to_kg(os.path.join(file_path, name_AUC_data), entity2id, relation2id)

print('num_entity: {}'.format(len(entity2id)))
print('num_relation: {}'.format(len(relation2id)))
print('num_kg_train: {}'.format(len(kg_train['heads'])))
print('num_kg_valid: {}'.format(len(kg_valid['heads'])))
print('num_kg_test: {}'.format(len(kg_test['heads'])))

return entity2id, relation2id, id2entity, id2relation, kg_train, kg_valid, kg_test, kg_all, kg_auc`

`def read_triplets_to_kg(file_path, entity2id, relation2id):
heads = []
tails = []
relations = []
kg = dict()

with open(file_path) as f:
    for line in f:
        head, relation, tail = line.strip().split('\t')
        heads.append(entity2id[head])
        tails.append(entity2id[tail])
        relations.append(relation2id[relation.strip()])

kg['heads'] = torch.LongTensor(heads)
kg['tails'] = torch.LongTensor(tails)
kg['relations'] = torch.LongTensor(relations)

return kg`

and middle of model train code,

` entity2id, relation2id, id2entity, id2relation, kg_train, kg_valid, kg_test, kg_all, kg_auc = load_data(file_path, name_entity_data, name_relation_data, train, valid, test, all, auc)

kg_train = KnowledgeGraph(kg= kg_train, ent2ix = entity2id, rel2ix = relation2id)
kg_valid = KnowledgeGraph(kg=kg_valid, ent2ix = entity2id, rel2ix = relation2id)
kg_test = KnowledgeGraph(kg=kg_test, ent2ix = entity2id, rel2ix = relation2id)
kg_auc = KnowledgeGraph(kg=kg_auc, ent2ix = entity2id, rel2ix = relation2id) `

Question of implementation misalignment with paper- TransE entity normalization

TorchKGE version: 0.17.5

Description

I see that there is a potential misalignment between the paper and implemented version of TransE. Can you help clarify?

The misalignment I see is that the loss in the paper is always computed using the normalized head and normalized tail embeddings (based on the pseudo-code). However, in the implementation, despite the entity embeddings being re-normalized at the end of each epoch, after each minibatch a gradient update is made. This means that the loss is not computed for normalized head and tail embeddings. Only for the first minibatch is the loss computed on the normalized embeddings but not for the rest.

Edit: Nevermind. All head and tail vectors are normalized in the scoring function also. This fixes it:

torchkge/torchkge/models/translation.py

Line 69 in d56e9d8

def scoring_function(self, h_idx, t_idx, r_idx):

"lp_helper" is not implemented

I am studying how you compute the filter ranking but the function "lp_helper" seems not to be implemented.

ConvKB

TorchKGE version:
Python version:
Operating System:

Description

How to run ConvKB, OUT_OF_CUDA

What I Did

Questions: Incremental training

I am exploring using this for training TransE embeddings had some beginner questions:

Have been using Pytorch BigGraph , single box mode - For TransE did anybody do a benchmarking of quality on standard datasets , what are the key differences between the two frameworks ?
I want to run incremental training, ie; once I have trained embeddings and graph gets updated, only retrain for the new nodes to avoid volatility of existing embeddings. Is there a way to do that here?
Can I run this for a graph with 100 million edges , that fit into a single box ?

I am starting dig into the docs to understand the framework better, so apologies for if some these questions are already covered in the docs.

Missing soft constraints in TransH

In the original paper, the author proposed three soft constraints and added a hyperparameter C to weight the importance of these constraints. While referring to the

torchkge/torchkge/utils/losses.py

Line 12 in a3474b7

class MarginLoss(Module):

, I did not find the C term.

I found a similar issue in OpenKE Weight C in TransH missing , is this the same reason ttorchkge ignores C though torchkge uses a different normalization method from OpenKE?

Exporting embeddings and Knowledge Graph Completion (KGC) type functionality

Assuming that we use SmallKG or KnowledgeGraph classes to train the models with our own dataset:

Is there an API call that exports trained embeddings in a user-friendly format so that embeddings for entities (h,t) and relations can be used?
Is there an API that will enable us to perform KGC type inference operations like passing in for example (h,r,?) and retrieving topK tails for given head and relation, etc. I understand that this is KGC functionality and torchKGE may not have aspirations to go in that direction functionality-wise.

Thanks,
Mladen

get_df()

I was looking into the get_df()-method under KnowledgeGraph-class and in line 397, it seems that i2e is used but not declared anywhere. Or am I missing something here? The method seems to be un-callable.

Bug of TransH

Description

In file "translation" line 210-215
self.ent_emb.weight.data = normalize(self.ent_emb.weight.data,
p=2, dim=1)
self.norm_vect.weight.data = normalize(self.norm_vect.weight.data,
p=2, dim=1)
self.rel_emb.weight.data = self.project(self.ent_emb.weight.data,
self.norm_vect.weight.data)

For self.rel_emb.weight.data, why is "self.project(self.ent_emb.weight.data,self.norm_vect.weight.data)"?
And it shows this bug, during the model initialization phase:
RuntimeError: The size of tensor a (14541) must match the size of tensor b (237) at non-singleton dimension 0

Error : 'TransRModel' object has no attribute 'emb_dim'

The following issues occur when evaluating the model performance of TransR

        evaluator = LinkPredictionEvaluator(model, kg_valid)
        evaluator.evaluate(b_size=1, iverbose=False)

'TransRModel' object has no attribute 'emb_dim'

Function that convert a `KnowledgeGraph` to a `DataFrame`

I had to implement this function for the project I am currently working on and I guess it could be handy to someone else.

I don't really know in which file this function would be at its best place but I would be glad to create a pull request if someone has a good idea about it.
Also if you have remarks/advices about the implementation do not hesitate to point them to me.

def kg2df(kg: KnowledgeGraph) -> pd.DataFrame:
    """
    Revert a torchKGE `KnowledgeGraph` into a pandas `DataFrame`.

    :param kg: A knowledge graph.
    :return: A dataframe containing the same information than the knowledge graph.
    """
    ix2rel = dict([(ix, rel) for rel, ix in kg.rel2ix.items()])
    ix2ent = dict([(ix, rel) for rel, ix in kg.ent2ix.items()])

    df = pd.DataFrame({'from': kg.head_idx, 'rel': kg.relations, 'to': kg.tail_idx})

    df['from'] = df['from'].map(ix2ent)
    df['rel'] = df['rel'].map(ix2rel)
    df['to'] = df['to'].map(ix2ent)

    return df

[Feature request] The 1-N scoring method of ConvE.

Hi, thank for your job. Recently, I'm trying to train my model with the 1-N score method in the ConvE. Have you achieve it?

Hope for your reply. Thanks!

Need to check the sequence of HOLE scoring functions

I'm going to use HOLE model in your code.

before i use, i have few question. expecially in scoring_function.

is scoring matmul process right?
in relate article, first calcualte circular convolution between subject and object, then matmul with relation vector
but your code first calculate circular convolution between subject and relation vector then matmul with object vector
after mod shifting relation matrix, e.g relation matrix shape (2, 3) → (2,3,3),
when matmul with head embeding vector your code, It is put into multiplication by column.
Isn't it right that it should be changed by row?

e.g. it is matrix after mod shifting
[[1,1,1],
[2,2,2],
[3,3,3]]

when matmul with head vector, it calculated by column [1, ,2, 3] not [1,1,1] ...

Implementation of GAATs would be welcome

Hi,
An implementation of GAATs by Wang et al. would be welcome. https://ieeexplore.ieee.org/abstract/document/8946600

What does the KnowledgeGraph do to build?

from torchkge.data_structures import KnowledgeGraph
https://github.com/torchkge-team/torchkge/blob/master/torchkge/data_structures.py

I noticed that it takes a significant time to build. Have there been academic works that develop ways of implement graphs efficiently that are employed in TorchKGE?

As I understand, it creates a knowledge graph tensor based on the knowledge graph triplet list, is this correct?

ConvKB scores interpretation

Hi,

I realized the scoring function (method scoring_function) of the ConvKB model returns two values for a given triple. These are the results of a softmax activation that, from what I understand, has two output neurons that represent the probabilities of a triple being true and false. However I do not know in which order these cases are defined in the neural network. Which output neuron corresponds to which case? I could unfortunately not figure it out from reading the code.

Thanks in advance for your attention and help!

Best,
Luis

Question on evaluation time

Hi, recently I noticed this project and the corresponding paper. And I also found that the evaluation time is a big advantage of this project, I want to know why torchkge can evaluate faster than OpenKE or AmpliGraph. I have read the corresponding files about evaluation, e.g., torchkge/evaluation/link_prediction.py, but I still don't know the key element of this success. Can you help me?
The reason for my interest in evaluation time is that I find a relatively complex model will spend much time evaluating on some large KGs. So I hope to obtain some insights from this project to accelerate the evaluation stage.
I will appreciate your reply. Thanks!

Shortest Training from docs doesn't work

TorchKGE version: 0.17.7
Python version: 3.10.11, 3.10.13
Operating System: 4.15.0-209-generic x86_64 x86_64 GNU/Linux, MacOS Ventura 13.1.1

Description

TrainDataLoader seems to fail to return an iterator hence its def get_counter_examples(self) -> SmallKG method fails with an error message hinting that self.iterator is None.

So trainer.run() fails in the Shortest Training from the docs.

Error message:

File "/project-root/test.py", line 38, in main trainer.run()

File "/project-root/.venv/lib/python3.10/site-packages/torchkge/utils/training.py", line 179, in run

self.counter_examples = data_loader.get_counter_examples()

File "/project-root/.venv/lib/python3.10/site-packages/torchkge/utils/training.py", line 64, in get_counter_examples

return SmallKG(self.iterator.nh, self.iterator.nt, self.iterator.r)

AttributeError: 'NoneType' object has no attribute 'nh'

What I Did

****
1. copy Shortest Training from the docs
2. run it on Ubuntu and MacOS
****

Evaluate triplet classification by predicate

For the work I had to do, it was interesting to split the accuracy metric by predicate.
Particularly for debugging purposes, if you want to check which relations are "well understood" by the model and which are not.

Here's an implementation of this task:

class PredicateTripletClassificationEvaluator(TripletClassificationEvaluator):
    """This evaluator has the particularity to measure the accuracy by predicate.

    Apart from that, it's the exact replica of the `TripletClassificationEvaluator`
    """

    def __init__(self, model, kg_val, kg_test):
        super().__init__(model, kg_val, kg_test)

    def accuracy(self, b_size: int) -> dict:
        """

        Parameters
        ----------
        b_size: int
            Batch size.

        Returns
        -------
        acc: Dict[str, float]
            Share by predicate of all triplets (true and negatively sampled ones)
            that were correctly classified using the thresholds learned from the
            validation set.

        """
        if not self.evaluated:
            self.evaluate(b_size)

        r_idx = self.kg_test.relations

        neg_heads, neg_tails = self.sampler.corrupt_kg(b_size,
                                                       self.is_cuda,
                                                       which='test')
        scores = self.get_scores(self.kg_test.head_idx,
                                 self.kg_test.tail_idx,
                                 r_idx,
                                 b_size)
        neg_scores = self.get_scores(neg_heads, neg_tails, r_idx, b_size)

        if self.is_cuda:
            self.thresholds = self.thresholds.cuda()

        scores = (scores > self.thresholds[r_idx])
        neg_scores = (neg_scores < self.thresholds[r_idx])

        accuracy_by_predicate = {}
        for predicate, rel_index in self.kg_test.rel2ix.items():
            mask = (self.kg_test.relations == rel_index)
            masked_scores = masked_select(scores, mask)
            masked_neg_scores = masked_select(neg_scores, mask)
            accuracy_by_predicate[predicate] = (
                                                       count_nonzero(masked_scores).item() +
                                                       count_nonzero(masked_neg_scores).item()
                                               ) / (2 * count_nonzero(mask).item())

        return accuracy_by_predicate

Retreive counter examples generated during training for `Trainer` class

The Trainer class is great but I needed a way to retreive and store the counter-examples used during the training phase.

I will soon propose a pull request implementing that.

If you have any remarks/advices/questions do not hesitate to contact me.

How to change the negtive num?

Hi, i found the default negtive num of training is 1 in the BernoulliNegativeSampler, i change it to BernoulliNegativeSampler(kg,n_neg=100). But it seem don't work,

Could you tell me How to change the negtive num？

torchkge-team / torchkge Goto Github PK

torchkge's People

Contributors

Stargazers

Watchers

Forkers

torchkge's Issues

Description

What I Did

Description

What I Did

Description

Description

Description

Description

What I Did

Description

Description

Error message:

What I Did

Recommend Projects

Recommend Topics

Recommend Org