Authors: Taylor B. Arnold, Nicolas Bailler, Paula Lissón
License: LGPL-2
The fasttextM R package is designed to make it easy to apply multilingual word embeddings to a dataset.
To install, grab the development version using devtools:
devtools::install_github("statsmaths/fasttextM")
The basic installation of the package contains a very small set of embeddings in English and French for testing purposes. To do any real work, we need to install the full versions of these. Here we use only the top 500MB's of the file; the full file is 6GB but more frequent words are contained at the top and we find that the first 500-1000MB's are all we ever need in practice. Feel free to reduce the number depending on your needs, internet speed, and disk space.
library(fasttextM)
ft_download_model("en", mb = 500)
ft_download_model("fr", mb = 500)
Note that these only need to be downloaded once. They are then saved locally on your machine.
Next, we load these two models into memory:
ft_load_model("en")
ft_load_model("fr")
We can now compute the embeddings of words in either language. Each of these embeddings is a length 300 vector:
en_embed <- ft_embed(words = c("hello", "fish", "cheese"),
lang = "en")
en_embed[,1:20]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -0.159450 -0.18259 0.033443 0.18813 -0.067903 -0.13663 -0.25559 0.11000
[2,] 0.010938 0.32371 -0.169970 0.42405 -0.447940 0.15972 0.31668 -0.15638
[3,] 0.207420 0.04882 0.078373 -0.24411 -0.247880 0.35715 0.12923 -0.38060
[,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 0.17275 0.0519710 -0.023302 0.038866 -0.245150 -0.21588 0.359250
[2,] -0.18606 0.0088676 0.167340 0.212200 -0.048738 -0.11182 0.098233
[3,] 0.40952 0.3056300 -0.209410 0.174500 0.070295 -0.39164 0.300000
[,16] [,17] [,18] [,19] [,20]
[1,] -0.082526 0.121760 -0.26775 0.10072 -0.13639
[2,] -0.151830 0.043405 -0.22468 0.19034 -0.30115
[3,] -0.454120 -0.141620 -0.17220 0.24395 -0.18230
More interestingly, we can see the words that are close to these words in the French embedding:
en_embed <- ft_nn(words = c("jump", "fish", "cheese", "city", "swim"),
lang = "en", lang_out = "fr", n = 10)
en_embed
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "saut" "sauts" "sautant" "élancer" "sauter" "saute"
[2,] "poissons" "poisson" "anguilles" "crevettes" "anguille" "salmonidés"
[3,] "fromage" "fromages" "confiture" "beurre" "saucisson" "confitures"
[4,] "ville" "villes" "capitale" "faubourgs" "mégapole" "quartier"
[5,] "nager" "nage" "nageurs" "nageant" "natation" "nagent"
[,7] [,8] [,9] [,10]
[1,] "sauteurs" "sauteur" "tamgho" "grimper"
[2,] "pêchées" "écrevisses" "crevette" "pêchés"
[3,] "pommes" "babeurre" "charcuterie" "saucissons"
[4,] "municipal" "banlieue" "cité" "quartiers"
[5,] "natatoire" "nagé" "nageur" "plongeon"
It is also possible, and often interesting, to use the nearest neighbours function to find similar words in the same language.
To see a list of all available language for download, run ft_languages()
.
It also indicates which models are downloaded and which have been loaded
into memory:
ft_languages()[20:30,]
iso_language_name iso_code installed loaded
20 Persian fa
21 Finnish fi
22 French fr * *
23 Western Frisian fy
24 Galician gl
25 Gujarati gu
26 Hebrew (modern) he
27 Hindi hi
28 Croatian hr
29 Hungarian hu
30 Armenian hy
The package is a work in progress. If you need some functionality not supported yet, please open a Issue and we will attempt to get it working for the next release.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.