Hello, I am trying to build my own data set, and I would like to know how do you build

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How do you build the fact file in raw data? about kobe HOT 8 CLOSED

thudm commented on September 2, 2024

How do you build the fact file in raw data?

from kobe.

Comments (8)

matthewsun214 commented on September 2, 2024

Also, when I run the version 2, the CUDA is out of memory when I am training the kobe-full in 2*P100 GPUS. How can I solve it?

Thank you.

from kobe.

qibinc commented on September 2, 2024

Hi @matthewsun214 , thanks for your interest in this work. In general (not specific to this work), when running into OOM errors, you can decrease the batch size (e.g., --batch-size=32 instead of the default 64). In addition, you can also use gradient accumulation.

Hope this help!

from kobe.

qibinc commented on September 2, 2024

For building the facts, we use BM25 with the title to retrieve the knowledge from a knowledge base, which belongs to the traditional sparse retrieval. More recently, there are works like DPR (https://arxiv.org/abs/2004.04906) which learns dense and differentiable retrieval, and even WebGPT (https://arxiv.org/pdf/2112.09332.pdf) which uses existing search engines to retrieve knowledge.

from kobe.

qibinc commented on September 2, 2024

Here is a screenshot of the current version's training progress. Please feel free to reach out if you run into unexpected problems or performance!

from kobe.

matthewsun214 commented on September 2, 2024

Thank you for your reply. I am trying to train the model with your provided data. Also,I would like to know that which tools do you use for handling Chinese words, such as Jieba??

from kobe.

qibinc commented on September 2, 2024

Hi @matthewsun214 , we use the same pre-trained tokenizer as bert-base-chinese (See https://github.com/THUDM/KOBE#tokenization), which is basically tokenize Chinese into characters and numbers/alphabets into words. This is shown to be more effective than word-LM with word segmentation tools in Chinese.

By the way, I would suggest going through the README and the code before trying to get help from issues (probably get you answers much faster).

from kobe.

matthewsun214 commented on September 2, 2024

Hi @qibinc , we have already gone through the README, the code and your paper, you have mentioned the methodology of retrieving knowledge in the paper, but we faced different technical difficulties in data preprocessing, mainly building the fact file. I would suggest adding some suggestion/ instruction in README about building fact file, otherwise, it is really hard for people to guess how to build fact file.

from kobe.

qibinc commented on September 2, 2024

Hi @matthewsun214 , thanks for your suggestions. I was referring to the tokenization (word segmentation) question you asked, which is clearly visible in the README and the code. Retrieving the knowledge goes beyond the scope of the paper, as we focus on utilizing the knowledge. I wish you good luck in your work.

from kobe.

How do you build the fact file in raw data? about kobe HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent