Giter Club home page Giter Club logo

Comments (5)

svjan5 avatar svjan5 commented on September 12, 2024

Hi @sandro272,
By dataset you mean training dataset (wikipedia corpus) or evaluation data?

from wordgcn.

sandro272 avatar sandro272 commented on September 12, 2024

@svjan5 Uh... I mean that because I want to use our own dataset, so can you provide a script or method that converts raw data into your processed data (eg voc2id.txt, etc.). Thank you!

from wordgcn.

svjan5 avatar svjan5 commented on September 12, 2024

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.

from wordgcn.

sandro272 avatar sandro272 commented on September 12, 2024

@svjan5 OK,thank you!

from wordgcn.

loginaway avatar loginaway commented on September 12, 2024

Hi,
I got a problem while trying to generate my own data.txt.
Specifically, I found that the initial data.txt is not of the format you have mentioned in README.md. (As follows)
<num_words> <num_dep_rels> tok1 tok2 tok3 ... tokn dep_e1 dep_e2 .... dep_em
They are actually organized like this (the first line of the initial data.txt file)
15 14 15 24351 24351 10 7 436 2083 26 8385 121958 4986 215 13 6932 2293 2 1|0|26 5|1|11 5|2|23 5|3|34 5|4|7 7|6|11 5|7|9 9|8|7 7|9|38 9|10|13 13|11|2 13|12|7 10|13|16 5|14|10 21854 21854 3 15 659 2324 0 2397 0 479 328 4 5905 7965 0
which have 4 parts, the first part '15 14 15' --I guess they are the numbers of the latter three parts? So what does the latter three parts represent?

Also, please update the README.md : )
Thanks!
@svjan5

p.s. I re-read the 'batch_generator.cpp', and it seems the last part of each line (i.e. the sequence of numbers after the dependency relations) are read but not stored.
Therefore, would it work if I set the first three numbers as (number of words in the sentence, number of dependency relations, 0), and leave the last part empty?

Ok, got it. Actually, I cannot give a script for that because it requires getting dependency parse of the text which requires Stanford CoreNLP. So, you first need to get a dependency parse of the text, then I think everything is quite straight forward. voc2id.txt will contain the mapping of tokens to their unique ids and data.txt contains listing of tokens and dependency parse edges for each sentence in the corpus. Let me know if you face any difficulty in the whole process.

from wordgcn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.