thudm / cogqa Goto Github PK

View Code? Open in Web Editor NEW

455.0 455.0 83.0 35.82 MB

Source code and dataset for ACL 2019 paper "Cognitive Graph for Multi-Hop Reading Comprehension at Scale"

License: MIT License

Python 100.00%

bert graph-neural-networks question-answering

cogqa's People

Contributors

Stargazers

Watchers

Forkers

qibinc wunaidev weigoss jianbotang daishu7 tian231825 ronakice alenegro81 daiyuandian remonly wurentidai cyprestar colinsongf ptolemyre souls362 crazyofapple keenpray hhwowen tk1363704 xrosliang ttgit thedreamwer yinzzzzzz stuartchan dujiaxin rebekk xwixcn haoyunhong dogydev alreal0 zebrajack ammieqi wengbenjue merrylovedog moerqiang yuexia9527 flyingwing sean0719 mars-wei haojunyu1998 jeffery1lee chenyangjun45 yipeng5 yiran-hao phychaos antlerkeke know-nothing8 yfqiu98 yanghuili anshiquanshu66 xumeng123 zhangsongdmk askintution hamigua2019 danhsh qingkongzhiqian two222 qianrenjian taichicode1 jialongyin liuqiangh shaoaallen webteg strategist922 wangdongde gongchuanyang ryantoleco jinlmsft 22842219 xiaoanshi shadowkun thinkall osaware hawksilent kalufinnle sevenbrain418 nanqiai iq-scm 5l1v3r1 bocchi810 shihanyang elinaliu

cogqa's Issues

Where can find the file "enwiki-20171001-pages-meta-current-withlinks-abstracts" mentioned in "read_fullwiki.py"?

Where can find the file "enwiki-20171001-pages-meta-current-withlinks-abstracts" mentioned in "read_fullwiki.py"? It looks like it's a json file, but all I found on the wiki were XML files. Is there anything else in between? Or can you give me the link of this file? Thank you for your

论文中"提取一跳节点，但是不计算语义向量"的含义

"And when extracting 1-hop nodes from question to initialize G, we do not calculate semantic vectors and only the Question part exists in the input."

请问这句话该怎样理解？看论文中系统一在访问节点x的输入为:[CLS] Question [SEP] clues[x,G] [SEP] Para[x]，不计算语义向量的话，相当于第一次是: [CLS] Question [SEP] [SEP] Para[x],这个意思吗？

ans_loss and hop_loss became 'nan'

Hello, I followed all the instructions to train the System1 model. Since I have only 1 RTX2080 GPU, I adjusted the batch size to 4 instead of 12. That is python train.py --batch_size 4. After training about 60% of the training process, the hop_loss and ans_loss became nan. I used another server which has multiple GPUs and trained system1 with the default settings batch_size=12 and it works successfully. I wonder whether it is caused by the loss function LogSoftmax which may be -inf? Do you have any solution for this problem? Thank you.

model1 = BertForMultiHopQuestionAnswering.from_pretrained(BERT_MODEL, cache_dir=PYTORCH_PRETRAINED_BERT_CACHE / 'distributed_{}'.format(-1))

Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.

作者老师您好！
Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased,
这个是我今年见过的最离谱的bug了。。。

model2 = CognitiveGNN(model1.config.hidden_size) 'NoneType' object has no attribute 'config'

作者老师您好！有个bug我刚开始复现的时候就出现了，后来跑着跑着自己没了，现在换了模型又出现了。
Traceback (most recent call last):
File "/home/shaoai/CogQA/train.py", line 337, in
fire.Fire(main)
File "/home/shaoai/anaconda3/envs/mypytorch/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/shaoai/anaconda3/envs/mypytorch/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/shaoai/anaconda3/envs/mypytorch/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "/home/shaoai/CogQA/train.py", line 323, in main
model2 = CognitiveGNN(model1.config.hidden_size)
AttributeError: 'NoneType' object has no attribute 'config'
这是我第一次clone代码跑的时候就碰到的bug，请问是为什么呢？怎么解决？
期待您的回复！感谢！

您好，关于段落抽取的问题。

您好，我没有太看明白，您关于段落的选择并不是事先选出可能的几个（比如5个10个）的段落在进行推理，而是直接在所有的段落上进行迭代推理吗？

关于bert使用不加载链接，而是通过文件加载的问题

您好，因为每次用链接来加载bert-base-uncased时的耗时会比较长，那么我该填什么参数来文件加载呢？

[About BertEmbeddingsPlus] Adding extra embedding without changing encoder

Just curious, is it common practice to add extra embedding to BERT but keep using original encoder parameters?

About the framework of GNN

i found the code of CognitiveGNN model and it seems like you do not use any framework of GNN such as DGI or PyG, So why do you not use any framework cause i think may it can help with speed problem.

数据集中的答案并不是来源于该条数据的材料--问题

注意到：数据集中的答案并不是来源于该条数据的材料，那么多跳是根据什么来跳的？不是本条数据的材料吗？

例如：问题-Which magazine was started first Arthur's Magazine or First for Women?，答案是Arthur's Magazine，但是答案在"context"中没有提到。

排除了用wipidia的外部数据进行多跳，因为窝并没有运行那段程序，嗷嗷

sem[x,Q,clues]在系统一和系统二中的作用

大佬您好，我看到sem[x,Q,clues]在您论文中的系统一和系统二都有体现，想明白sem是什么意思？具体体现在：

系统一--对于答案节点x, Para[x]可能丢失。因此不提取跨度,而是基于”句子A”部分来计算sem[x,Q,clues]；
系统二：为了完全理解实体x和问题Q之间的关系,仅仅通过分析sem[x,Q,clues]远远不够；

请问可以解释下这两句话的含义吗？
(我之前以为工作的流程是:通过bert抽取得到下一跳实体，再由下一跳实体在gnn中游走推理，进而再传给bert，但是发现多了个sem)

希望自己能够表述清楚~·······································`·

Possibility of training on 1 gpu?

Running training code causes continuous Cuda memory errors starting at around epoch 34.
My gpu:nvidia gtx 100ti .
I’ve tried to offload GCN to CPU and set batch size to 1.
Any way I could further optimize my code to prevent these errors?
Thanks.

文件夹里enwiki-20171001-pages-meta-current-withlinks-abstracts应该有什么文件？

A question about the "process_train.py"

hi, I have a question about the "process_train.py"

Why here parameter in find_fact_content function is sen_num?

The low results

The final results are very low: {'em': 0.03551654287643484, 'f1': 0.0478604024080054, 'prec': 0.051990450467830615, 'recall': 0.0491018037405613, 'sp_em': 0.0005401755570560432, 'sp_f1': 0.12093599476326915, 'sp_prec': 0.07244085040682292, 'sp_recall': 0.42420838558245866, 'joint_em': 0.0001350438892640108, 'joint_f1': 0.009409090423642906, 'joint_prec': 0.005848759671643607, 'joint_recall': 0.03384786494172044} which is nearly 10 times worse than your results in the paper. Can you tell me the reason?

where can i find hotpot_train_v1.1_refined.json

hello，i wonder where to download hotpot_train_v1.1_refined.json. i can only find hotpot_train_v1.1.json.

感觉这代码没法复现

照着流程运行下来效果很差，而且总感觉细节经不起推敲

fullwiki data

hello, could you tell me where can i download the fullwiki data(enwiki-20171001-pages-meta-current-withlinks-abstracts),thanks

Where is the fullwiki_input_improved_by_cogqa1hop.zip?

I have completed training the task1 and task2. When I want to evaluate the model, I couldn't find where is the fullwiki_input_improved_by_cogqa1hop.zip. So I directly run the command: python cogqa.py --data_file='hotpot_dev_fullwiki_v1_merge.json'. But during the answering, it shows some wrong: ValueError: attempt to get argmin of an empty sequence.
Start Training... on 1 GPUs
17%|██████▊ | 1294/7405 [03:41<13:29, 7.55it/s]Traceback (most recent call last):
File "cogqa.py", line 244, in
fire.Fire(main)
File "/home/zeyuzhang/anaconda3/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/zeyuzhang/anaconda3/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/zeyuzhang/anaconda3/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "cogqa.py", line 234, in main
gold, ans, graph_ret, ans_nodes = cognitive_graph_propagate(tokenizer, data, model1, model2, device, setting = setting, max_new_nodes=max_new_nodes)
File "cogqa.py", line 147, in cognitive_graph_propagate
l, r = find_start_end_before_tokenized(orig_text, [pred_slice])[0]
File "/home/zeyuzhang/Downloads/CogQA-master/utils.py", line 238, in find_start_end_before_tokenized
result = fuzzy_find([span], orig_text)
File "/home/zeyuzhang/Downloads/CogQA-master/utils.py", line 107, in fuzzy_find
r, score = dp(item, sentence)
File "/home/zeyuzhang/Downloads/CogQA-master/utils.py", line 86, in dp
r = np.argmin(f[len(a) - 1])
File "/home/zeyuzhang/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1172, in argmin
return _wrapfunc(a, 'argmin', axis=axis, out=out)
File "/home/zeyuzhang/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: attempt to get argmin of an empty sequence
Can you tell me how to solve these? Thank you very much.

RuntimeError: CUDA error: an illegal memory access was encountered

In the process of train, encountered such a mistake, where is the problem? "RuntimeError: CUDA error: an illegal memory access was encountered"

不好意思一次性问了很多细节问题，希望您有时间可以解答下

System 1 with roberta?

Is it possible to replace bert with roberta in system 1?

Where can I get “wikipedia documents” ?

Hi , Where can I get “wikipedia documents” ?
[step 4 of Preprocess ：Run python read_fullwiki.py to load wikipedia documents to redis ]

Where can i get `hotpot_train_v1.1.json`?

When i run !python /content/CogQA/process_train.py the following error occures:

Traceback (most recent call last): File "/content/CogQA/process_train.py", line 18, in <module>

with open('./hotpot_train_v1.1.json', 'r') as fin:

FileNotFoundError: [Errno 2] No such file or directory: './hotpot_train_v1.1.json'

I wonder where can i get this file, please help

dump.rdb

Hi,
Why didn't I find this file “dump.rdb”. Thx u ve much.

关于训练结果

你好，最近尝试跑了一下代码。在没有动任何参数的情况下得出的正确率似乎没有达到paper上说的55%左右，请问是需要动别的什么设置吗？还是我可能哪里搞错了？训练过程中也没有任何报错感觉完全没头绪。下面是我跑出来的结果，希望大佬能解答

{'f1': 0.08085961805557072, 'joint_recall': 0.015151681770718994, 'joint_prec': 0.031592446116416414, 'em': 0.052397029034436195, 'sp_f1': 0.11114176393041891, 'joint_f1': 0.019395676593513603, 'joint_em': 0.0, 'sp_em': 0.0, 'sp_recall': 0.08090929552104435, 'prec': 0.08201151088389438, 'recall': 0.08718660277950503, 'sp_prec': 0.18284942606347063}

将bert换成albert时，加载输入数据时出了个错误

作者老师您好！我在改进代码模型的时候尝试将bert换成albert
我把
BERT_MODEL = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL, do_lower_case=True)
换成了
tokenizer = BertTokenizer.from_pretrained("./albert_base")
BERT_MODEL = BertModel.from_pretrained("./albert_base")

然后报错：
File "train.py", line 158, in main
bundles.append(convert_question_to_samples_bundle(tokenizer, data))
File "/home/shao/CogQA/data.py", line 187, in convert_question_to_samples_bundle
ids.append(tokenizer.convert_tokens_to_ids(tokenized_all))
File "/home/shao/anaconda3/envs/cogqa/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization.py", line 121, in convert_tokens_to_ids
ids.append(self.vocab[token])
KeyError: '[CLS]'
请问会是加载数据时什么方面的原因呢？期待您的回复！

where can I find 'enwiki-20171001-pages-meta-current-withlinks-abstracts'

Hello, I only find four data files in the link of hotpotQA you give:
Training set, Dev set(distractor), Dev set and Test set

So how can I get 'enwiki-20171001-pages-meta-current-withlinks-abstracts'

Thank you!

About dump.pkl

Hi,

Considering rigorousness, is dump.pkl mentioned in README the Redis dump file?

In my experiment, I didn't find dump.pkl in the project directory, but I found a dump.rdb which is about 2.4GB in redis directory.

Thank you!

请问训练速度和什么有关呀？显卡？内存？

作者老师您好！再次冒昧打扰了！我现在是24g显存的显卡，但是内存只有32g，有条件加到64g。
目前sys1和sys2的训练速度大概在5-6小时，请问加内存可以加快速度吗？

Add blog post (in Chinese) link in README

https://zhuanlan.zhihu.com/p/72981392

Self-cycle in gold-only cognitive graph for comparison question

Hi,

I found it may cause a self-cycle in the following snippet.

CogQA/process_train.py

Lines 91 to 93 in 217f0f1

 if bundle['answer'] == 'yes' or bundle['answer'] == 'no' \ 

 or (question_type > 0 and bundle['type'] == 'comparison'): 

 pool.add(title)

For example, after running process_train.py, I got a JSON object like this:

{
  "supporting_facts": [
    [
      "Arthur's Magazine",
      0,
      [
        [
          "Arthur's Magazine",
          "Arthur's Magazine",
          0,
          17
        ]
      ]
    ],
    [
      "First for Women",
      0,
      [
        [
          "First for Women",
          "First for Women",
          0,
          15
        ]
      ]
    ]
  ],
  "level": "medium",
  "question": "Which magazine was started first Arthur's Magazine or First for Women?",
  "context": ["..."],
  "answer": "Arthur's Magazine",
  "_id": "5a7a06935542990198eaf050",
  "type": "comparison",
  "Q_edge": [
    [
      "First for Women",
      "First for Women",
      54,
      69
    ],
    [
      "Arthur's Magazine",
      "Arthur's Magazine",
      33,
      50
    ]
  ]
}

However, I think it should be like what showed in your examples:

{
  "supporting_facts": [
    [
      "Arthur's Magazine",
      0,
      []
    ],
    [
      "First for Women",
      0,
      []
    ]
  ],
  "level": "medium",
  "question": "Which magazine was started first Arthur's Magazine or First for Women?",
  "context": ["..."],
  "answer": "Arthur's Magazine",
  "_id": "5a7a06935542990198eaf050",
  "type": "comparison",
  "Q_edge": [
    [
      "Arthur's Magazine",
      "Arthur's Magazine",
      33,
      50
    ],
    [
      "First for Women",
      "First for Women",
      54,
      69
    ]
  ]
}

Could you explain what this snippet works for? By the way, I got a reproduction result which is about 10% lower than the result in the paper on dev set with 2 K80 GPUs, do you think this snippet is a reason of low result?

Thank you!

另外我能否理解为process_train做的工作就是加一个模糊匹配的实体，以及构建认知图?
感谢回复~

About process_train.py

Hi, Ming
According to README, I run the code in the following order:

process_train.py with hotpot_train_v1.1.json -> get hotpot_train_v1.1_refined3.json
read_fullwiki.py ( from read_fullwiki.ipynb)
run_cg.py (first time)
run_cg.py (second time, set lr=4*1e-5) -> models/bert-base-uncased.bin & .bin.tmp
eval_cg.py on hotpot_dev_fullwiki_v1.json -> hotpot_dev_fullwiki_v1_pred.json
hotpot_evaluate_v1.py with hotpot_dev_fullwiki_v1_pred.json & hotpot_dev_fullwiki_v1.json
the results not ideal, but I can not find any reasons.
Some wrong operations exist during the above process?
Some Other questions
1)In step 1, GOT hotpot_train_v1.1_refined3.json. By using "_id", I SELECT the same 500 examples as hotpot_train_v1.1_500_refined.example.json from hotpot_train_v1.1_refined3.json.
but there are some differences between the new_generated 500 and your provided 500 examples.
In read_fullwiki.py, the last part, "pages" represent ?
Could you please give me some advice? Or could you provide a refined-training dataset download link? Thanks a lot!

	if bundle['answer'] == 'yes' or bundle['answer'] == 'no' \
	or (question_type > 0 and bundle['type'] == 'comparison'):
	pool.add(title)

thudm / cogqa Goto Github PK

cogqa's People

Contributors

Stargazers

Watchers

Forkers

cogqa's Issues

Recommend Projects

Recommend Topics

Recommend Org