Giter Club home page Giter Club logo

twec's Introduction

TWEC: Temporal Word Embeddings with a Compass (AAAI 2019)

  • News May-2021: Thanks to wabyking (https://github.com/wabyking) we found that a gensim compilation problem was affecting the installation of our tool. The compass was unstable during the second part of the training. We updated our edited gensim package with the compilation so that this problem does not occur. There might be a small variation in the results you get with the new stable version. Our AAAI results were computed on a compiled version of the software and were not affected by this issue.

This package contains Python code to build temporal word embeddings with a compass! One of the problems of temporal word embeddings is that they require alignment between corpora. We propose a method to aligned distributional representation based on word2vec. This method is efficient and it is based on a simple heuristic: we train an atemporal word embedding, the compass and we use this embedding to freeze one of the layers of the CBOW architecture.

we freeze one layer of the CBOW architecture and train temporal embedding on the other matrix. See the paper for more details.

img/twec.png

Reference

This work is based on the following paper (Link.)

  • Di Carlo, V., Bianchi, F., & Palmonari, M. (2019). Training Temporal Word Embeddings with a Compass. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6326-6334. https://doi.org/10.1609/aaai.v33i01.33016326

Abstract

Temporal word embeddings have been proposed to support the analysis of word meaning shifts during time and to study the evolution of languages. Different approaches have been proposed to generate vector representations of words that embed their meaning during a specific time interval. However, the training process used in these approaches is complex, may be inefficient or it may require large text corpora. As a consequence, these approaches may be difficult to apply in resource-scarce domains or by scientists with limited in-depth knowledge of embedding models. In this paper, we propose a new heuristic to train temporal word embeddings based on the Word2vec model. The heuristic consists in using atemporal vectors as a reference, i.e., as a compass, when training the representations specific to a given time interval. The use of the compass simplifies the training process and makes it more efficient. Experiments conducted using state-of-the-art datasets and methodologies suggest that our approach outperforms or equals comparable approaches while being more robust in terms of the required corpus size.

Installing

Important: always create a virtual environment because TWEC uses a custom version of the gensim library.

  • clone the repository
  • virtualenv -p python3.6 env
  • source env/bin/activate
  • pip install cython
  • pip install git+https://github.com/valedica/gensim.git
  • cd in repository
  • pip install -e .

Jupyter: you can use this in a jupyter-notebook, but remember that you need the virtual environment! In the following the commands you need to use, but for a more detailed description of what we are doing see this link.

  • you need to install the virtual environment inside jupyter
  • source env/bin/activate
  • (venv) $ pip install ipykernel
  • (venv) $ ipython kernel install --user --name=twec_kernel
  • you will find the "twec_kernel" when you create a new notebook

Guide

  • Remember: when you call the training method of TWEC the class creates a "model/" folder where it is going to save the trained objects. The compass will be trained as first element and it will be saved in that folder. If you want to overwrite it remember to set the parameter overwrite=True, otherwise it will reload the already trained compass.
  • What do you need: temporal slices of text (i.e., text from 1991, text from 1992, etc...) and the concatenation of those text slices (the compass).
  • The compass should be the concatenation of the slice you want to align. In the next code section you will see that we are going to use arxiv papers text from two different years. The "compass.txt" file contains the concatenation of both slices.

How To Use

  • Training

Suppose you have two slices of temporal text "arxiv_14.txt" and "arxiv_9.txt". First of all, create the concatenation of these two and create a "compass.txt" file. Now you can train the compass.

from twec.twec import TWEC
from gensim.models.word2vec import Word2Vec

aligner = TWEC(size=30, siter=10, diter=10, workers=4)

# train the compass: the text should be the concatenation of the text from the slices
aligner.train_compass("examples/training/compass.txt", overwrite=False) # keep an eye on the overwrite behaviour

You can see that the class covers the same parameters the Gensim word2vec library has. "siter" refers to the compass training iterations while "diter" refers to the training iteration of the specific temporal slices. After this first training you can train the slices:

# now you can train slices and they will be already aligned
# these are gensim word2vec objects
slice_one = aligner.train_slice("examples/training/arxiv_14.txt", save=True)
slice_two = aligner.train_slice("examples/training/arxiv_9.txt", save=True)

These two slices are now aligned and can be compared!

  • Load Data

You can load data has you do with gensim.

model1 = Word2Vec.load("model/arxiv_14.model")
model2 = Word2Vec.load("model/arxiv_9.model")

People

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

twec's People

Contributors

valedica avatar vinid avatar

Stargazers

Yusuke Fukasawa avatar Lizhen Liang avatar Jiseong avatar Catherine Koshka avatar  avatar Hanzhi (Alex) Zhang avatar  avatar  avatar Peter Leonard avatar M***** L* avatar Wendy Mak avatar  avatar Dan Lou avatar zgl avatar Segun Aroyehun avatar Joshua Levy avatar CamillaGotta avatar Hui Lin avatar  avatar  avatar Aida avatar ToastyNews avatar Anxo Pérez Vila avatar @norma.mnor avatar Davide Nunes avatar Baihan Lin avatar Hiyam Ghannam avatar Noon Abdulqadir (MFA Noon) avatar Yantong Lai avatar Sidsel Boldsen avatar Abheesht avatar Gunjan Chhablani avatar  avatar Haowen Jiang avatar Rajaswa Patil avatar Shawn Jonghyuck Jung avatar  avatar  avatar Giorgio Ottolina avatar Vipula Rawte avatar Jay Jang avatar 爱可可-爱生活 avatar Alex Ruch avatar Ramsey avatar waby avatar zhaolin avatar Costantino Carta avatar Ben Marwick avatar  avatar  avatar Toru Urakawa avatar Federico Belotti avatar Roman Jurowetzki avatar Andrea Guzzo avatar Anton Alekseev avatar Oksana Dereza avatar Vivek Kulkarni avatar Ayan-Yue Gupta avatar Ilya Akdemir avatar Surajit Dasgupta avatar Raphael Troncy avatar  avatar  avatar

Watchers

Vivek Kulkarni avatar Raphael Troncy avatar  avatar paper2code - bot avatar  avatar

twec's Issues

general metrics and plotting

Hi,

thank you for this package! I have just run twec on my data and all has worked well however:

It is unclear on how to access metrics relating to this data. For example, I can access the standard word2vec characteristics that show me most similar embeddings in the individual models (slice_one, slice_two), however I want to start to understand how specific words change in the context of time.

Are there handy functions for exploring most changing words and also for plotting them as in the published paper?

Is there also a way of plotting the data in the same way as figure 1 in:

https://ipg.idsia.ch/preprints/supsi2020a.pdf

This is how i would like to demonstrate how some words change over time, but unsure as to how to plot this data from TWEC model.

thanks!

Is the compass fixed during training timestamped text?

from twec.twec import TWEC    
from gensim.models.word2vec import Word2Vec

def train():    
    aligner = TWEC(size=30, siter=10, diter=10, workers=4)    
    aligner.train_compass("examples/training/compass.txt", overwrite=False)     
    slice_one = aligner.train_slice("examples/training/arxiv_14.txt", save=True)    
    slice_two = aligner.train_slice("examples/training/arxiv_9.txt", save=True)

def test():    
    model1 = Word2Vec.load("model/arxiv_14.model")    
    model2 = Word2Vec.load("model/arxiv_9.model")    
    for model in [model1,model2]:    
	    print(sum(model.syn1neg))    
	    print(sum(model.wv.syn0))   

if __name__ == "__main__":    
    train()    
    test()

The output of the code is shown as below:

`[ 54.13019 -7458.793 -2588.298 3593.7505 -731.2068 1354.8907
1956.362 2851.0269 -1234.2087 -2461.2375 693.96765 4517.283
1506.449 -1617.4432 1538.4094 2772.7483 2216.757 -3763.828
2090.126 -298.45084 -294.8205 1523.8512 -4156.9824 -723.04803
-533.2238 1869.8455 -1205.959 -3589.7622 -7645.8135 -4966.196 ]

[ 75.728424 2146.7905 711.1423 -1063.1915 280.071
-428.7143 -653.19977 -737.3386 470.85577 737.1261
-51.172543 -1358.5729 -683.6471 417.5251 -398.98938
-808.00616 -600.1352 1040.6033 -659.40375 73.63555
73.206184 -372.51102 1261.4464 297.45206 212.58424
-495.39255 383.86707 955.2797 2138.7588 1448.5309 ]

[ -206.65964 -5041.21 -2035.7019 2772.9456 -725.939 1060.8079
1505.944 2003.1798 -563.0721 -1705.3502 515.3484 3435.2378
1639.7721 -1262.4358 1019.02844 1742.8516 1668.6241 -2807.0754
1269.7594 -494.86893 -221.1095 729.1342 -2732.2847 -153.8587
-501.57608 1336.3754 -1268.0028 -2143.7483 -5006.103 -3494.257 ]

[ 172.6337 2415.6028 1006.8115 -1404.1216 438.4736
-564.31976 -789.87054 -883.25604 373.4988 959.29047
-90.953415 -1708.9927 -1026.617 612.2216 -466.75372
-864.9828 -801.6127 1305.5497 -626.8068 282.45493
129.64682 -274.8585 1347.7399 130.84848 272.3334
-714.29504 643.37933 997.20715 2441.326 1698.4065 ]`

We could clearly see that both the context embedding and target embedding are not fixed. If the compass is not fixed, this work could be very similar to Kim et.al., and all word embedding in different time are not aligned, and therefore it is a little bit risky to compare word vectors in different years especially we train the temporal word vectors with more steps/epochs.

However, in the paper, it was stated in the section of 'Temporal Word Embeddings with a Compass':

During this training process, the target
embeddings of the output matrix U are not modified, while
we update the context embeddings in the input matrix Cti

Am I wrong here? Is there anything I did not notice?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.