Giter Club home page Giter Club logo

fasttextm's Introduction

fasttextM: Fast Multilingual Word Embeddings

Authors: Taylor B. Arnold, Nicolas Bailler, Paula Lissón
License: LGPL-2

AppVeyor Build Status Travis-CI Build Status Coverage Status

Overview

The fasttextM R package is designed to make it easy to apply multilingual word embeddings to a dataset.

To install, grab the development version using devtools:

devtools::install_github("statsmaths/fasttextM")

The basic installation of the package contains a very small set of embeddings in English and French for testing purposes. To do any real work, we need to install the full versions of these. Here we use only the top 500MB's of the file; the full file is 6GB but more frequent words are contained at the top and we find that the first 500-1000MB's are all we ever need in practice. Feel free to reduce the number depending on your needs, internet speed, and disk space.

library(fasttextM)
ft_download_model("en", mb = 500)
ft_download_model("fr", mb = 500)

Note that these only need to be downloaded once. They are then saved locally on your machine.

Next, we load these two models into memory:

ft_load_model("en")
ft_load_model("fr")

We can now compute the embeddings of words in either language. Each of these embeddings is a length 300 vector:

en_embed <- ft_embed(words = c("hello", "fish", "cheese"),
                     lang = "en")
en_embed[,1:20]
          [,1]     [,2]      [,3]     [,4]      [,5]     [,6]     [,7]     [,8]
[1,] -0.159450 -0.18259  0.033443  0.18813 -0.067903 -0.13663 -0.25559  0.11000
[2,]  0.010938  0.32371 -0.169970  0.42405 -0.447940  0.15972  0.31668 -0.15638
[3,]  0.207420  0.04882  0.078373 -0.24411 -0.247880  0.35715  0.12923 -0.38060
         [,9]     [,10]     [,11]    [,12]     [,13]    [,14]    [,15]
[1,]  0.17275 0.0519710 -0.023302 0.038866 -0.245150 -0.21588 0.359250
[2,] -0.18606 0.0088676  0.167340 0.212200 -0.048738 -0.11182 0.098233
[3,]  0.40952 0.3056300 -0.209410 0.174500  0.070295 -0.39164 0.300000
         [,16]     [,17]    [,18]   [,19]    [,20]
[1,] -0.082526  0.121760 -0.26775 0.10072 -0.13639
[2,] -0.151830  0.043405 -0.22468 0.19034 -0.30115
[3,] -0.454120 -0.141620 -0.17220 0.24395 -0.18230

More interestingly, we can see the words that are close to these words in the French embedding:

en_embed <- ft_nn(words = c("jump", "fish", "cheese", "city", "swim"),
                  lang = "en", lang_out = "fr", n = 10)
en_embed
     [,1]       [,2]       [,3]        [,4]        [,5]        [,6]
[1,] "saut"     "sauts"    "sautant"   "élancer"   "sauter"    "saute"
[2,] "poissons" "poisson"  "anguilles" "crevettes" "anguille"  "salmonidés"
[3,] "fromage"  "fromages" "confiture" "beurre"    "saucisson" "confitures"
[4,] "ville"    "villes"   "capitale"  "faubourgs" "mégapole"  "quartier"
[5,] "nager"    "nage"     "nageurs"   "nageant"   "natation"  "nagent"
     [,7]        [,8]         [,9]          [,10]
[1,] "sauteurs"  "sauteur"    "tamgho"      "grimper"
[2,] "pêchées"   "écrevisses" "crevette"    "pêchés"
[3,] "pommes"    "babeurre"   "charcuterie" "saucissons"
[4,] "municipal" "banlieue"   "cité"        "quartiers"
[5,] "natatoire" "nagé"       "nageur"      "plongeon"

It is also possible, and often interesting, to use the nearest neighbours function to find similar words in the same language.

To see a list of all available language for download, run ft_languages(). It also indicates which models are downloaded and which have been loaded into memory:

ft_languages()[20:30,]
   iso_language_name iso_code installed loaded
20           Persian       fa
21           Finnish       fi
22            French       fr         *      *
23   Western Frisian       fy
24          Galician       gl
25          Gujarati       gu
26   Hebrew (modern)       he
27             Hindi       hi
28          Croatian       hr
29         Hungarian       hu
30          Armenian       hy

The package is a work in progress. If you need some functionality not supported yet, please open a Issue and we will attempt to get it working for the next release.

Note

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

fasttextm's People

Watchers

James Cloos avatar Gregor W avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.