Giter Club home page Giter Club logo

Comments (10)

maifeng avatar maifeng commented on July 28, 2024

Can you share the version of gensim?
gensim.__version__

When you run create_dicty.py, are you using the example data and seed words or your own?

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

gensim 3.8.3
our own data. Do you want to see a sample of it? its glassdoor review text

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

seed words are the same
since we are pretty much trying to achieve the same goal.

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

image

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

After putting some print statements, we reached the root cause of the error. in culture_dictionary.py --> deduplicate_keywords function, the dimension_seed_words returns an empty list:

dimension_seed_words = [ word for word in seed_words[dimension] if word in word2vec_model.wv.vocab]

this line returns empty list.
if we remove the condition wv.vocab, it does return a list but again it says, the word is not in vocab.

from measuring-corporate-culture-using-machine-learning.

maifeng avatar maifeng commented on July 28, 2024

Yes, it meant none of the seed words are in the word2vec vocabulary. By default, a word needs to appear at least 5 times to be included. If you have a small corpus, you can add the argument min_count=1 to the train_w2v_model() function in clean_and_train.py. Also, make sure that seed words are in lowercase.

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

Hey,
Thanks a lot for the help.
it worked like magic.
Can you please suggest some tips on how to improve its accuracy?
I am assuming if we increase the min_count value, it also improves the accuracy?
what other things can we do?
i think we cannot replace word2vec with bert or something since the code is filled with word2vec functions

from measuring-corporate-culture-using-machine-learning.

maifeng avatar maifeng commented on July 28, 2024

It is always helpful to have a larger corpus. Words that appear only a few times may not have a good representation after training. The code does not train bert; it produces static word vectors.

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

Hey @maifeng
How are you doing?
i was wondering if there is any way to reduce the time for parsing?
I am running 10,000 text blocks using parse_parallel and it takes almost 45 mins.
is there any way to make it even more optimal other than just using more CPU cores?

from measuring-corporate-culture-using-machine-learning.

HuzaifahSaleem avatar HuzaifahSaleem commented on July 28, 2024

Hey @maifeng
there??

from measuring-corporate-culture-using-machine-learning.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.