Code Repository for MS20190155

Python 100.00%

measuring-corporate-culture-using-machine-learning's Introduction

Measuring Corporate Culture Using Machine Learning

Introduction

The repository implements the method described in the paper

Kai Li, Feng Mai, Rui Shen, Xinyan Yan, Measuring Corporate Culture Using Machine Learning, The Review of Financial Studies, 2020; DOI:10.1093/rfs/hhaa079 [Available at SSRN]

The code is tested on Ubuntu 18.04 and macOS Catalina, with limited testing on Windows 10.

Requirement

The code requres

Python 3.6+
The required Python packages can be installed via pip install -r requirements.txt
Download and uncompress Stanford CoreNLP v3.9.2. Newer versions may work, but they are not tested. Either set the environment variable to the location of the uncompressed folder, or edit the following line in the global_options.py to the location of the uncompressed folder, for example:

os.environ["CORENLP_HOME"] = "/home/user/stanford-corenlp-full-2018-10-05/"

If you are using Windows, use "/" instead of "\" to separate directories.
Make sure requirements for CoreNLP are met. For example, you need to have Java installed (if you are using Windows, install Windows Offline (64-bit) version). To check if CoreNLP is set up correctly, use command line (terminal) to navigate to the project root folder and run python -m culture.preprocess. You should see parsed outputs from a single sentence printed after a moment:

(['when[pos:WRB] I[pos:PRP] be[pos:VBD] a[pos:DT]....

Data

We included some example data in the data/input/ folder. The three files are

documents.txt: Each line is a document (e.g., each earnings call). Each document needs to have line breaks remvoed. The file has no header row.
document_ids.txt: Each line is document ID (e.g., unique identifier for each earnings call). A document ID cannot have _ or whitespaces. The file has no header row.
(Optional) id2firms.csv: A csv file with three columns (document_id:str, firm_id:str, time:int). The file has a header row.

Before running the code

You can config global options in the global_options.py. The most important options are perhaps:

The RAM allocated for CoreNLP
The number of CPU cores for CoreNLP parsing and model training
The seed words
The max number of words to include in each dimension. Note that after filtering and deduplication (each word can only be loaded under a single dimension), the number of words will be smaller.

Running the code

Use python parse.py to use Stanford CoreNLP to parse the raw documents. This step is relatvely slow so multiple CPU cores is recommended. The parsed files are output in the data/processed/parsed/ folder:
- documents.txt: Each line is a sentence.
- document_sent_ids.txt: Each line is a id in the format of docID_sentenceID (e.g. doc0_0, doc0_1, ..., doc1_0, doc1_1, doc1_2, ...). Each line in the file corresponds to documents.txt.
Note about performance: This step is time-consuming (~10 min for 100 calls). Using python parse_parallel.py can speed up the process considerably (~2 min with 8 cores for 100 calls) but it is not well-tested on all platforms. To not break things, the two implementations are separated.
Use python clean_and_train.py to clean, remove stopwords, and named entities in parsed documents.txt. The program then learns corpus specific phrases using gensim and concatenate them. Finally, the program trains the word2vec model.

The options can be configured in the global_options.py file. The program outputs the following 3 output files:
- data/processed/unigram/documents_cleaned.txt: Each line is a sentence. NERs are replaced by tags. Stopwords, 1-letter words, punctuation marks, and pure numeric tokens are removed. MWEs and compound words are concatenated.
- data/processed/bigram/documents_cleaned.txt: Each line is a sentence. 2-word phrases are concatenated.
- data/processed/trigram/documents_cleaned.txt: Each line is a sentence. 3-word phrases are concatenated. This is the final corpus for training the word2vec model and scoring.
The program also saves the following gensim models:
- models/phrases/bigram.mod: phrase model for 2-word phrases
- models/phrases/trigram.mod: phrase model for 3-word phrases
- models/w2v/w2v.mod: word2vec model
Use python create_dict.py to create the expanded dictionary. The program outputs the following files:
- outputs/dict/expanded_dict.csv: A csv file with the number of columns equal to the number of dimensions in the dictionary (five in the paper). The row headers are the dimension names.
(Optional): It is possible to manually remove or add items to the expanded_dict.csv before scoring the documents.
Use python score.py to score the documents. Note that the output scores for the documents are not adjusted by the document length. The program outputs three sets of scores:
- outputs/scores/scores_TF.csv: using raw term counts or term frequency (TF),
- outputs/scores/scores_TFIDF.csv: using TF-IDF weights,
- outputs/scores/scores_WFIDF.csv: TF-IDF with Log normalization (WFIDF).
(Optional): It is possible to use additional weights on the words (see score.score_tf_idf() for detail).
(Optional): Use python aggregate_firms.py to aggregate the scores to the firm-time level. The final scores are adjusted by the document lengths.

measuring-corporate-culture-using-machine-learning's People

Contributors

Stargazers

Watchers

Forkers

arlionn wizardshowing talaveol ivanlxw pqhai ginward scho2429 samuellee19 cxxxa ruiwangus tomerkal ernestta dayuyang1999 leicell nickqh philipcaochicago caiyishu-private-repositories li-hai-yang skwolvie chinchiehao vincentzhao0619 wzhao0619 jialin72 gurleenx camelot2002 renatomatz petershan1119 danielizard kennywantstofly maifeng liweiweng magicc0nch qiyanghe zhuqi79 hanlin891016 hanyuqi9 rpg163 rqym yzh07 honghuang723 kehlv dajun0 rialele kunqian1127 jizhouwang0713 hngu178 keesh0410 yuning598 zsvz marcelotorrescisterna yuanlei6616 naomi-kenyatta jeayoung-park shanglai khanwasti partigabor beingfanfan ollie0317 yuetingjiang qihangzh cknju2021 sharanvc pavanmahuli hkrelax pjcuadros wujianzu tlmegumi alfonsomeraz hbs-rcs alexdpeng vyassreenivas sawyear ittshuang ali-vtine samswannce xhaaat praisetompane xtarz shirleyshi97 leo307646595 chronusgong floraisok abigailhust cyc202 juntailu youjustlook renboru94

measuring-corporate-culture-using-machine-learning's Issues

Issues while running parse.py

Hi
I was trying to run the code parse.py and threw the backtrace like
local variable 'sentences_processed' referenced before assignment.
It does the processing till certain extent then throw this error after few processing. I tried with multiprocess and single core, bot.
Any help will be appreciated.

Thanks!

Parsing Takes Too Much Time

Hey @maifeng
Hope you are doing well.

Can you please suggest a way to further optimize parse_parallel.py?
since it takes a lot of time.

Confusion about the dictionary

Thanks to this project for the opportunity to learn. I encountered 2 confusions while building the dictionary and would like answers. Thanks.

I set N_WORDS_DIM=500, but the final number of words I get in expanded_dict.csv is 520. Why is the final number of words not equal to 500?

2, I set N_WORDS_DIM=300 and 500 respectively, and found that the first 300 words of the two results are not exactly the same (I think if the seed words are not changed, the first 300 words should theoretically be exactly the same based on the cos method to rank) I am also confused by this question.

I look forward to receiving a reply! Thanks!

Result question

Hello professor.

Thank you very much for your amazing project!

After I run the score.py file, I got the result for five culture values. And I see in your paper that the "Weighted frequency count of innovation-related words in the QA section of learning calls average over a three-year window.".
In order to get the variable results in your paper, should I need to add up a company's four file values? For example, in 2013 company X had a,b,c,d 4 files for earning calls transcript. So in order to get the innovation value in your paper, after running score.py file, I need to add up (innovation_a/length of a)+(innovation_b/length of b)+(innovation_c/length of c)+(innovation_d/length of d) to get the final value for innovation culture value for the "X company" in 2013?

I am very puzzled. Hope to get your reply!

Issue with the parsing script

Dear Maifeng,

thank you for the great and easily understandable implementation of the code used in your paper. I have one question/suggestion:

When I run parse.py, the script fails on every line because corpus_preprocessor is not defined in process_line. Somehow the script does not manage to transfer the corpus_preprocessor method to the process_line function. While I can rewrite the script to access NLP outside of the process_line function this seems to be an unelegant workaround. Am I overlooking anything? I have installed all the requisite packages. I have also checked the fixes proposed within the closed issue, however they do not seem to help.

Some additional info:

When I just throw in a quick print(print(corpus_preprocessor.process_document) call directly behind the initial code (see below) it seems to work fine, even for individual strings that are in my files. So the issue doesnt seem to derive from CoreNLP

   with CoreNLPClient(be_quiet=False,
            properties={
                "ner.applyFineGrained": "false",
                "annotators": "tokenize, ssplit, pos, lemma, ner, depparse",
            },
            memory=global_options.RAM_CORENLP,
            threads=global_options.N_CORES,
            timeout=120000000,
            max_char_length=1000000,
    ) as client:
        corpus_preprocessor = preprocess.preprocessor(client)

        print(corpus_preprocessor.process_document(string, id))

Edit:

The issue seems to be stemming from the usage of starmap in my case. If I call the process_line function with a regular call, I do not have the same issue.

Best wishes
Hauke

ZeroDivision Error

While running create_dict.py it gave me an error (code name: ZeroDivision Error). The error raised when it called gensim word2vec model for expansion of dictionary. Attached Screenshot for reference.
P.S I have checked all of the parameters and nothing is empty

Error when installing corenlp

I followed your instructions to install the corenlp in Win10. However, when I run "python -m culture.preprocess" to check if well set up, there returned:

File "C:\Python39\lib\site-packages\stanfordnlp\server\client.py", line 311, in _request
self.ensure_alive()
File "C:\Python39\lib\site-packages\stanfordnlp\server\client.py", line 137, in ensure_alive
raise PermanentlyFailedException("Timed out waiting for service to come alive.")
stanfordnlp.server.client.PermanentlyFailedException: Timed out waiting for service to come alive.

Could you give me some suggestions to handle this problem?

Division by Zero Error

Hey Authors,
I have been trying to run the code for create_dicty.py on a text file converted via csv file using pandas. I am getting the division by zero error on word2vec.n_similarity function call and it says atleast one of the 3 lists are empty. But i have checked, no list is empty. The model, the seed words and the expanded words too.

Please help

Changing Dataset

I wanted to change the data set but am unable to understand how you have mapped document_ids to the documents. A little clarification of that in readme.md would be really helpful.
Thank you.

ms20190155 / measuring-corporate-culture-using-machine-learning Goto Github PK