Do you have s available/any easy way to convert raw data to your processed datas

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ok, got it. Actually, I cannot give a for that because it requires getting depe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can we upload our own dataset? about wordgcn HOT 5 CLOSED

malllabiisc commented on September 12, 2024

Can we upload our own dataset?

from wordgcn.

Comments (5)

svjan5 commented on September 12, 2024

Hi @sandro272,
By dataset you mean training dataset (wikipedia corpus) or evaluation data?

from wordgcn.

sandro272 commented on September 12, 2024

@svjan5 Uh... I mean that because I want to use our own dataset, so can you provide a script or method that converts raw data into your processed data (eg voc2id.txt, etc.). Thank you！

from wordgcn.

svjan5 commented on September 12, 2024

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.

from wordgcn.

sandro272 commented on September 12, 2024

@svjan5 OK,thank you!

from wordgcn.

loginaway commented on September 12, 2024

Hi,
I got a problem while trying to generate my own data.txt.
Specifically, I found that the initial data.txt is not of the format you have mentioned in README.md. (As follows)
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
They are actually organized like this (the first line of the initial data.txt file)
15 14 15 24351 24351 10 7 436 2083 26 8385 121958 4986 215 13 6932 2293 2 1|0|26 5|1|11 5|2|23 5|3|34 5|4|7 7|6|11 5|7|9 9|8|7 7|9|38 9|10|13 13|11|2 13|12|7 10|13|16 5|14|10 21854 21854 3 15 659 2324 0 2397 0 479 328 4 5905 7965 0
which have 4 parts, the first part '15 14 15' --I guess they are the numbers of the latter three parts? So what does the latter three parts represent?

Also, please update the README.md : )
Thanks!
@svjan5

p.s. I re-read the 'batch_generator.cpp', and it seems the last part of each line (i.e. the sequence of numbers after the dependency relations) are read but not stored.
Therefore, would it work if I set the first three numbers as (number of words in the sentence, number of dependency relations, 0), and leave the last part empty?

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.

from wordgcn.

Can we upload our own dataset? about wordgcn HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent