Light

finalfusion / finalfrontier Goto Github PK

View Code? Open in Web Editor NEW

85.0 8.0 4.0 615 KB

Context-sensitive word embeddings with subwords. In Rust.

Home Page: https://finalfusion.github.io/finalfrontier

License: Other

Rust 100.00%

word embeddings rust skipgram structured subword-units finalfusion

finalfrontier's Introduction

finalfrontier

Introduction

finalfrontier is a Rust program for training word embeddings. finalfrontier currently has the following features:

Models:
- skip-gram (Mikolov et al., 2013)
- structured skip-gram (Ling et al., 2015)
- directional skip-gram (Song et al., 2018)
- dependency (Levy and Goldberg, 2014)
Output formats:
- finalfusion
- fastText
- word2vec binary
- word2vec text
- GloVe text
Noise contrastive estimation (Gutmann and Hyvärinen, 2012)
Subword representations (Bojanowski et al., 2016)
Hogwild SGD (Recht et al., 2011)
Quantized embeddings through the finalfusion quantize command.

The trained embeddings can be stored in the versatile finalfusion format, which can be read and used with the finalfusion crate and the finalfusion Python module.

The minimum required Rust version is currently 1.70.

Where to go from here

Installation
Quickstart
Manual pages:
- finalfrontier-skipgram(1) — train word embeddings with the (structured) skip-gram model
- finalfrontier-deps(1) — train word embeddings with dependency contexts
finalfusion crate
Python module

finalfrontier's People

Contributors

Stargazers

Watchers

Forkers

jguhlin realnicolasbourbaki bytesnake twuebi

finalfrontier's Issues

Document side-effects of increasing the number of threads

Replace custom fallible conversion methods by TryFrom

Save norms in finalfusion format

In the old finalfrontier format, we saved the norms of the word embeddings before normalization. This information is lost now that we save directly in finalfusion format. Add an appropriate chunk to finalfusion and restore this functionality.

Training without subwords

#13 was a first stab at it, maybe we can include this in 0.6?

Replace underscores by dashes in options

Most people are not using German keyboards. Underscores are much more annoying to type on e.g. US keyboards than dashes. We should replace all underscores in options by dashes.

This should only be done after #91 and #92 to avoid annoying merge conflicts.

Endianness in memory mapped embedding matrices

We store embeddings in little-endian byte order. However, the byte order is not taken into account when embedding matrices are memory mapped. Consequently, incorrect embeddings will be used on big-endian platforms when memory mapping is used.

Broken CI script

https://travis-ci.org/finalfusion/finalfrontier/jobs/590369304#L562-L565

+'[' '!' rustc --version
+grep '^rustc 1.31.0' ']'
ci/script.sh: line 9: [: missing `]'
grep: ]: No such file or directory

Also, while we're at it, we might want to move cargo fmt to the beginning of the script to get builds to fail sooner.

Fix EOS ngrams

The ngram vocab currently brackets the <\s> marker and extracts ngrams from "<<\s>>". Those subwords aren't trained because the indices are never added to the subwords Vec in the vocab via:

            if word.word() == util::EOS {
                subword_indices.push(Vec::new());
                continue;
            }

It's fairly unlikely to encounter those ngrams anywhere but we should fix this once we figured out the bug.

Update man pages

ff-train: verify that the options are still in sync
ff-deps: write the manpage

Support memory mapping of the embedding matrix

Support memory mapping of the embedding matrix to reduce memory use.

Some general vocab questions

Hello. I'm a genomics researcher interested in using finalfrontier to create embeddings based on DNA and protein sequences. Unfortunately, I'm a bit new to rust and very new to the finalfrontier codebase. I've got a few issues right away I need some help with (but likely to have many more).

For DNA we use kmers (like ngrams, as DNA is essentially very large continuous strings) with a sliding window approach, and I've code to count a large corpus in about 3 hours (300 million unique kmers, 50Gb compressed -- almost as fast as FastText, without the extra processing on the front-end nor the extra few hundred Gb of data). I'd like to get this into a SubwordVocab.

It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.
Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.
Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?
I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?

Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?

Any and all help is greatly appreciated!

Cheers,
--Joseph

Kick out EOS marker

It's reduntant because we train with punctuation. Also, EOS pops up in different, somewhat unrelated components. E.g. vocabs need to explicitly match for the EOS appended by SentenceIterator.

Add support for dependency embeddings

Add flag to dump the context matrix

I see two ways:

Extend finalfusion to handle files with both input and output matrices
Dump the output matrix + vocab in a seperate finalfusion model.

Re 1.: More work and might make APIs (more) complex.
Re 2.: Hackier solution, output types need to implement to_string(). Lookup consequently also done through stringly keys.

Switch to rust2vec format?

I wonder whether it is still necessary to retain finalfrontier's own binary format. It is now possible on master to convert embeddings to the new rust2vec format. However, it is an extra step. For users, it would be much simpler if finalfrontier directly stored trained embeddings in rust2vec format. Additionally, this would allow us to remove a lot of code from finalfrontier, such as Model and the similarity/analogy query functionality.

Stuff that is currently stored in the finalfrontier format that are lost in the rust2vec conversion:

metadata
l2 norms of the word embeddings pre-normalization

Provide subcommand to generate completions

Like we did in finalfusion-utils.

Release 0.6

I think the norms storage change is pretty important. The earlier we push 0.6.0 out, the better, since it reduces the number of embeddings in the wild that do not have norms.

That said, I think it would be nice to have Nicole's directional skipgram implementation in as well, since then we also have a nice user-visible feature.

Is there anything else that we want to add before branching 0.6?

Dealing with different set of command-line options

I have implemented support for training floret embeddings, but the command-line gets a bit unwieldy. Floret is quite a bit different than what we have so far:

We need an option to set the number of hashes.
We need an option to set the seed for murmur3.
Upstream floret doesn't use a matrix size that is a power of 2, it would be nice to provide the same freedom in finalfrontier.
Most output formats do not really make sense for floret, e.g. the word2vec and text formats are useless, since floret does not use word embeddings.

I see two ways forward:

We add the necessary options and validations to ensure that no incompatible set of options is used.
We add another level of subcommands, with only the relevant set of options, e.g.:
finalfrontier skipgram floret, finalfrontier skipgram fasttext, finalfrontier skipgram buckets, finalfrontier skipgram explicit and the same for deps.

For (2), I am not sure if this is the best partitioning.

Add support for vocab size target

Some other embedding packages allow you to set a target vocabulary size. In this case, the mincount is sized so that the vocabulary size is less than the target vocab size.

Note that this is different than taking the N most frequent items, since that might remove words with a certain frequency, while retaining other words with the same frequency.

Automatic builds of binary packages are broken?

At least, we do not have binaries in the assets of 0.7.0.

Add overarching `finalfrontier` man page

We currently have finalfrontier-skipgram(1) and finalfrontier-deps(1). We should have finalfrontier(1) that describes what finalfrontier is and gives a brief overview of the subcommands and pointers to their man pages. See man cargo or man git for some conventions.

Optimization opportunities

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.

I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.

Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.
The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

Thank you

Hi @danieldk @sebpuetz ,

I just want to say thanks. You saved me from a text classifier deadline where flair/bert embedding would be too slow, and I was unable to do any magical invocation(tried version 3.8, version 4, other parameters) to force gensim to train an working word2vec, the model would simplify not converge and got way worse on each epoch. You rock and finalfrontier from my POV look like the only game in town ( spaCy was lack-lusting even with floret)!!!

Make a Homebrew tap for macOS release builds

This would permit people to install finalfrontier on a Mac with brew install finalfrontier without compilation.

Support explicitly stored ngrams

Entails changes to finalfusion-rust for serialization/usage.

Add config
Restructure vocab into module (vocab.rs for all variants is quite unwieldy)
Extend vocab module
- Use indexer approach from finalfusion-rust, parameterize SubwordVocab with Indexer, separate ngram and word indices in the index through enum (might get ugly, considering how many type parameter we already have for the trainer structs)
- Implement independent vocab type
Add support in binaries (depends on support in finalfusion-rust)
Update finalfusion-rust dependency once it supports NGramVocabs
Replace finalfusion dependency with release.
fix #72

Investigate performance hit of using atomics in place of 'hogwild counters'

Investigate use of dictionary automaton for the vocabulary

One way to reduce the memory use of the vocabulary (besides memory mapping the vocabulary) would be to store the vocabulary in a dictionary automaton. Investigate whether the reduction in memory use is worthwhile.

Include generated man pages in release builds

Store the number of threads in metadata

Implement ASAG and AsySVRG

As you are using Hogwild! for the multicore SGD implementation, it would perhaps be interesting to investigate whether you can speed up the optimization with

AsySVRG (https://arxiv.org/abs/1508.05711)
ASAG (https://arxiv.org/abs/1606.04809)

ps nice project you have there 👍

Projectivization switch for Dependency Embeddings

Add basic webpage with quickstart and manual pages

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.