Comments (10)
Can you share the version of gensim?
gensim.__version__
When you run create_dicty.py
, are you using the example data and seed words or your own?
from measuring-corporate-culture-using-machine-learning.
gensim 3.8.3
our own data. Do you want to see a sample of it? its glassdoor review text
from measuring-corporate-culture-using-machine-learning.
seed words are the same
since we are pretty much trying to achieve the same goal.
from measuring-corporate-culture-using-machine-learning.
from measuring-corporate-culture-using-machine-learning.
After putting some print statements, we reached the root cause of the error. in culture_dictionary.py --> deduplicate_keywords function, the dimension_seed_words returns an empty list:
dimension_seed_words = [ word for word in seed_words[dimension] if word in word2vec_model.wv.vocab]
this line returns empty list.
if we remove the condition wv.vocab, it does return a list but again it says, the word is not in vocab.
from measuring-corporate-culture-using-machine-learning.
Yes, it meant none of the seed words are in the word2vec vocabulary. By default, a word needs to appear at least 5 times to be included. If you have a small corpus, you can add the argument min_count=1
to the train_w2v_model()
function in clean_and_train.py
. Also, make sure that seed words are in lowercase.
from measuring-corporate-culture-using-machine-learning.
Hey,
Thanks a lot for the help.
it worked like magic.
Can you please suggest some tips on how to improve its accuracy?
I am assuming if we increase the min_count value, it also improves the accuracy?
what other things can we do?
i think we cannot replace word2vec with bert or something since the code is filled with word2vec functions
from measuring-corporate-culture-using-machine-learning.
It is always helpful to have a larger corpus. Words that appear only a few times may not have a good representation after training. The code does not train bert; it produces static word vectors.
from measuring-corporate-culture-using-machine-learning.
Hey @maifeng
How are you doing?
i was wondering if there is any way to reduce the time for parsing?
I am running 10,000 text blocks using parse_parallel and it takes almost 45 mins.
is there any way to make it even more optimal other than just using more CPU cores?
from measuring-corporate-culture-using-machine-learning.
Hey @maifeng
there??
from measuring-corporate-culture-using-machine-learning.
Related Issues (9)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from measuring-corporate-culture-using-machine-learning.