Calculate context for word using WordContextMatrix with smoothed positive PMI values, and simularity between word vectors using different algorithms of similarity calculations.
Algorithm performs the following steps:
- Extracts texts from
.txt
files that contains texts fromcorpus
folder. - Analyzes corpus, tokenizes texts to words, remove stop words and single character words and creates list with lemmas (base form of the word).
- Creates word-context-matrix with smoothed positive PMI values from lemmas.
- Calculates similarities between word vectors from word-context-matrix (using chosen method for calculating similarities).
- Create an
XML
-file with potential context for each word.
Node JS
library andNPM
package manager.- Libraries installed from
package.json
file.
- Go to the project root directory.
- Run
npm i
ornpm install
command. This command installs necessary libraries. - Open
.env
file and configure the following parameters:
CORPUS_DIRECTORY
:string
value, that specifies directory to the corpus with texts (absolute or relative path).WINDOW_SIZE
:integer
value, that specifies number of words left and right to the main word.ALPHA
:float
value in[0, 1]
, that specifies alpha parameter for PMI calculations.COUNT_COUNTEXT_WORDS
:integer
value, that specifies count of top context words for each word from word-context-matrix, that are shown in outputXML
-file.OUTPUT_FOLDER
:string
value, that specifies location for outputXML
-file (absolute or relative path).
After that, place into CORPUS_DIRECTORY
folders .txt
-files with texts.
In the project root directory run npm start <i>
command, where <i>
specifies number of simularity calculations algorithm:
1
- specifies cosine simularity algorithm;
2
- specifies Jaccard simularity algorithm;
3
- specifies Jenson-Shennon divergence algorithm.
See the result in the configured OUTPUT_FOLDER
directory.
As the output you get an OUTPUT_FOLDER/wordsContext.xml
file in such format:
<?xml version="1.0"?>
<document name="wordsContext">
<context type="contextWords" word="...">
<word type="word" similarity="...">...</word>
<word type="word" simularity="...">...</word>
...
</context>
<context type="contextWords" word="...">
...
</context>
...
</document>
natural
(version0.6.3
) is used for tokenizing input texts from corpus to words.stopwords
(version0.0.9
) is used to remove stopwords from corpus.lemmatizer
(version0.0.1
) is used for creating lemmas from words.xmlbuilder
(version15.1.0
) is used for creating XML-file with context words.