jdeng / word2vec Goto Github PK

Word2Vec in C++ 11

C++ 67.98% C 32.02%

word2vec's Introduction

word2vec

Word2Vec in C++ 11

See main.cc for building instructions and usage. (NOTE: openmp is used in the newest version and thus g++ is required for multithreading)

Results with OMP_NUM_THREADS=8: (save model is understandably slow as it stores text)

loadvocab: 1.9952 seconds    
train: 33.5145 seconds
save model: 4.7554 seconds

Machine configuration

jackdeng-mac:word2vec jack$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz

word2vec's People

Contributors

Stargazers

Watchers

Forkers

liyanghua detectivebag changguanghua ljwsummer hihihippp andy071001 ilovejs null400 yiiwood hetong007 qiuwei swinghu xjzhou ee-novrain bitilandu yanyiwu alienfeel ablelu loull521 phecy fightfxj raymonddy xujunhai1991 siyuano fanfannothing lrt05hust mengwenkui likaiguo birdgun vlsi1217 wumiyang huiwenhan bjj1994 chagge rayleighchen xiangbai clarklli woodstone121 zuiwufenghua ysongfinance zxsted splade lakezhang leigaosearch liu0329 research2010 tianditu geraldstanje jadesoul bnn2010 fighting0801 lxiong jimberxin mtfelix jcjview starsnet83 uikit0 xuanhan863 eli2014 suwenkui xiliangsong xjy0791 jekywong we1559 zhimingz simonoso lexiao811 seanwu1987 softzhy bpiwowar fiendark jingtaow chenbk85 njuhugn mickelfeng liufly zhoujialinmumu czinsane foreveryl jungle-cat bookchan qinxiaojie chenmoshushi liulingling xinyuecanji sayiho lijiankou thedawnoftheday leoleishi huangpeng1126 xn0507 bunnyrabbit8mile bigdig garfielder007 sean-sunzq simon582 lu839684437 rpdodo xzm2004260 wyxingyux

word2vec's Issues

which parameter to change to run demo for n epochs?

Hello,

I had gone through all of the code. I could not get how to change the #epochs or #iter parameter and also #threads parameter like in Google-C code.

Can you give little glimpse on it.

I found n_workers parameter but by changing that parameter I could not get much more timing difference. Changing n_workers from 4 to 16 only improved 5 sec of training time.

Please let me know.

Thanks.

how to make this code work

Hello!

I have struggled to implement C+11 on my Ubuntu, and successfully compiled your project by running
++ -g -o w2v -std=c++0x -Ofast -march=native -funroll-loops main.cc -lpthread (a little different than your doc cuz I want to use gdb for debugging)

However, when I run the project as ./w2v, it returns only

collecting [word] 0 sentences, 0 distinct words, 0 words
Segmentation fault (core dumped)

And I cannot input strings as the lines 62:74 indicate.

I am wondering if there is something wrong with my procedure or if I need to do other things.

Thank you in advance!

Segfault when loading pretrained models

When I try to load pretrained models from the google code word2vec C implementation, the library segfaults. I have tried this using text8 trained by myself using the C implementation, and the pretrained Google News model.

How to run the Github

Hello,

When we run a command provided in the main.cc file it generates a executable. Thats nice.

But, what parameters do I need to pass for training.

Can you please specify details in readme file.

I am trying to run standard text8 corpus with 71291 vocab size.

Thanks.

请问如何在VS下运行成功？

accuracy seems off by a wide-margin wrt original word2vec

On the section 'capital-common-countries' of the questions-words.txt file from word2vec, using

./w2v test

I get 33.59% Top-1 accuracy vs 80.04% for the original C version released by Google.

A little problem for print

Error:

word2vec.h:319:48: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
printf("training %d sentences\n", n_sentences);
^
word2vec.h: In instantiation of ‘int Word2Vec::load(const string&) [with String = std::basic_string; std::string = std::basic_string]’:
main.cc:130:27: required from here
word2vec.h:418:38: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
printf("%d words loaded\n", n_words);

Solution:

Word2vec line 319: printf("training %ld sentences\n", n_sentences);
Word2vec line 418: printf("%ld words loaded\n", n_words);

Why not negative sampling in word2vec?

Hi! I have read the origin paper(Distributed Representations of Words and Phrases and Their Compositionality) for word2vec, I find the implementation of the negative sampling which is reported good for word embedding by Mikolov in his paper in, but you make it disable line468.
Is that because good performance is able to achieve without negative sampling?

A little problem in train function

Hi!
The code is great!

I use this code to implement the paragraph2vec, and I found that there my be several iterations of the training.If we use the following code to train for several times like this:

for (int i = 0; i < n; i++)
        model.train(sentences);

The first loop is right,but the next loops' memory usage is larger and larger.
After reviewing the code word2vec.h,I have found the code in word2vec.hmay have a problem:

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();
            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

The vector sentence will be larger and larger if we use the train function for the second time . We can clear the vector first.

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();

//By Largelymfs
sentence.clear();

            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

And we can put the train function into a loop.
Thanks a lot.