Giter Club home page Giter Club logo

word2vec's Introduction

word2vec

Word2Vec in C++ 11

See main.cc for building instructions and usage. (NOTE: openmp is used in the newest version and thus g++ is required for multithreading)

Results with OMP_NUM_THREADS=8: (save model is understandably slow as it stores text)

loadvocab: 1.9952 seconds    
train: 33.5145 seconds
save model: 4.7554 seconds

Machine configuration

jackdeng-mac:word2vec jack$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz

word2vec's People

Contributors

jdeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word2vec's Issues

which parameter to change to run demo for n epochs?

Hello,

I had gone through all of the code. I could not get how to change the #epochs or #iter parameter and also #threads parameter like in Google-C code.

Can you give little glimpse on it.

I found n_workers parameter but by changing that parameter I could not get much more timing difference. Changing n_workers from 4 to 16 only improved 5 sec of training time.

Please let me know.

Thanks.

how to make this code work

Hello!

I have struggled to implement C+11 on my Ubuntu, and successfully compiled your project by running
++ -g -o w2v -std=c++0x -Ofast -march=native -funroll-loops main.cc -lpthread (a little different than your doc cuz I want to use gdb for debugging)

However, when I run the project as ./w2v, it returns only

collecting [word] 0 sentences, 0 distinct words, 0 words
Segmentation fault (core dumped)

And I cannot input strings as the lines 62:74 indicate.

I am wondering if there is something wrong with my procedure or if I need to do other things.

Thank you in advance!

Segfault when loading pretrained models

When I try to load pretrained models from the google code word2vec C implementation, the library segfaults. I have tried this using text8 trained by myself using the C implementation, and the pretrained Google News model.

How to run the Github

Hello,

When we run a command provided in the main.cc file it generates a executable. Thats nice.

But, what parameters do I need to pass for training.

Can you please specify details in readme file.

I am trying to run standard text8 corpus with 71291 vocab size.

Thanks.

A little problem for print

Error:

word2vec.h:319:48: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
printf("training %d sentences\n", n_sentences);
^
word2vec.h: In instantiation of ‘int Word2Vec::load(const string&) [with String = std::basic_string; std::string = std::basic_string]’:
main.cc:130:27: required from here
word2vec.h:418:38: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
printf("%d words loaded\n", n_words);

Solution:

  • Word2vec line 319: printf("training %ld sentences\n", n_sentences);
  • Word2vec line 418: printf("%ld words loaded\n", n_words);

Why not negative sampling in word2vec?

Hi! I have read the origin paper(Distributed Representations of Words and Phrases and Their Compositionality) for word2vec, I find the implementation of the negative sampling which is reported good for word embedding by Mikolov in his paper in, but you make it disable line468.
Is that because good performance is able to achieve without negative sampling?

A little problem in train function

Hi!
The code is great!

I use this code to implement the paragraph2vec, and I found that there my be several iterations of the training.If we use the following code to train for several times like this:

for (int i = 0; i < n; i++)
        model.train(sentences);

The first loop is right,but the next loops' memory usage is larger and larger.
After reviewing the code word2vec.h,I have found the code in word2vec.hmay have a problem:

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();
            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

The vector sentence will be larger and larger if we use the train function for the second time . We can clear the vector first.

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();

//By Largelymfs
sentence.clear();

            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

And we can put the train function into a loop.
Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.