Giter Club home page Giter Club logo

finalfrontier's Introduction

Crate Docs Build Status

finalfrontier

Introduction

finalfrontier is a Rust program for training word embeddings. finalfrontier currently has the following features:

  • Models:
    • skip-gram (Mikolov et al., 2013)
    • structured skip-gram (Ling et al., 2015)
    • directional skip-gram (Song et al., 2018)
    • dependency (Levy and Goldberg, 2014)
  • Output formats:
    • finalfusion
    • fastText
    • word2vec binary
    • word2vec text
    • GloVe text
  • Noise contrastive estimation (Gutmann and Hyvärinen, 2012)
  • Subword representations (Bojanowski et al., 2016)
  • Hogwild SGD (Recht et al., 2011)
  • Quantized embeddings through the finalfusion quantize command.

The trained embeddings can be stored in the versatile finalfusion format, which can be read and used with the finalfusion crate and the finalfusion Python module.

The minimum required Rust version is currently 1.70.

Where to go from here

finalfrontier's People

Contributors

bytesnake avatar danieldk avatar dependabot-preview[bot] avatar divefish avatar sebpuetz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

finalfrontier's Issues

Save norms in finalfusion format

In the old finalfrontier format, we saved the norms of the word embeddings before normalization. This information is lost now that we save directly in finalfusion format. Add an appropriate chunk to finalfusion and restore this functionality.

Replace underscores by dashes in options

Most people are not using German keyboards. Underscores are much more annoying to type on e.g. US keyboards than dashes. We should replace all underscores in options by dashes.

This should only be done after #91 and #92 to avoid annoying merge conflicts.

Endianness in memory mapped embedding matrices

We store embeddings in little-endian byte order. However, the byte order is not taken into account when embedding matrices are memory mapped. Consequently, incorrect embeddings will be used on big-endian platforms when memory mapping is used.

Fix EOS ngrams

The ngram vocab currently brackets the <\s> marker and extracts ngrams from "<<\s>>". Those subwords aren't trained because the indices are never added to the subwords Vec in the vocab via:

            if word.word() == util::EOS {
                subword_indices.push(Vec::new());
                continue;
            }

It's fairly unlikely to encounter those ngrams anywhere but we should fix this once we figured out the bug.

Update man pages

  • ff-train: verify that the options are still in sync
  • ff-deps: write the manpage

Some general vocab questions

Hello. I'm a genomics researcher interested in using finalfrontier to create embeddings based on DNA and protein sequences. Unfortunately, I'm a bit new to rust and very new to the finalfrontier codebase. I've got a few issues right away I need some help with (but likely to have many more).

For DNA we use kmers (like ngrams, as DNA is essentially very large continuous strings) with a sliding window approach, and I've code to count a large corpus in about 3 hours (300 million unique kmers, 50Gb compressed -- almost as fast as FastText, without the extra processing on the front-end nor the extra few hundred Gb of data). I'd like to get this into a SubwordVocab.

  1. It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.

  2. Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.

  3. Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?

  4. I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?

Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?

Any and all help is greatly appreciated!

Cheers,
--Joseph

Kick out EOS marker

It's reduntant because we train with punctuation. Also, EOS pops up in different, somewhat unrelated components. E.g. vocabs need to explicitly match for the EOS appended by SentenceIterator.

Add flag to dump the context matrix

I see two ways:

  1. Extend finalfusion to handle files with both input and output matrices
  2. Dump the output matrix + vocab in a seperate finalfusion model.

Re 1.: More work and might make APIs (more) complex.
Re 2.: Hackier solution, output types need to implement to_string(). Lookup consequently also done through stringly keys.

Switch to rust2vec format?

I wonder whether it is still necessary to retain finalfrontier's own binary format. It is now possible on master to convert embeddings to the new rust2vec format. However, it is an extra step. For users, it would be much simpler if finalfrontier directly stored trained embeddings in rust2vec format. Additionally, this would allow us to remove a lot of code from finalfrontier, such as Model and the similarity/analogy query functionality.

Stuff that is currently stored in the finalfrontier format that are lost in the rust2vec conversion:

  • metadata
  • l2 norms of the word embeddings pre-normalization

@sebpuetz

Release 0.6

I think the norms storage change is pretty important. The earlier we push 0.6.0 out, the better, since it reduces the number of embeddings in the wild that do not have norms.

That said, I think it would be nice to have Nicole's directional skipgram implementation in as well, since then we also have a nice user-visible feature.

Is there anything else that we want to add before branching 0.6?

Dealing with different set of command-line options

I have implemented support for training floret embeddings, but the command-line gets a bit unwieldy. Floret is quite a bit different than what we have so far:

  • We need an option to set the number of hashes.
  • We need an option to set the seed for murmur3.
  • Upstream floret doesn't use a matrix size that is a power of 2, it would be nice to provide the same freedom in finalfrontier.
  • Most output formats do not really make sense for floret, e.g. the word2vec and text formats are useless, since floret does not use word embeddings.

I see two ways forward:

  1. We add the necessary options and validations to ensure that no incompatible set of options is used.
  2. We add another level of subcommands, with only the relevant set of options, e.g.:
    finalfrontier skipgram floret, finalfrontier skipgram fasttext, finalfrontier skipgram buckets, finalfrontier skipgram explicit and the same for deps.

For (2), I am not sure if this is the best partitioning.

Add support for vocab size target

Some other embedding packages allow you to set a target vocabulary size. In this case, the mincount is sized so that the vocabulary size is less than the target vocab size.

Note that this is different than taking the N most frequent items, since that might remove words with a certain frequency, while retaining other words with the same frequency.

Add overarching `finalfrontier` man page

We currently have finalfrontier-skipgram(1) and finalfrontier-deps(1). We should have finalfrontier(1) that describes what finalfrontier is and gives a brief overview of the subcommands and pointers to their man pages. See man cargo or man git for some conventions.

Optimization opportunities

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.

I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.

  1. Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.

  2. The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

@sebpuetz

Thank you

Hi @danieldk @sebpuetz ,

I just want to say thanks. You saved me from a text classifier deadline where flair/bert embedding would be too slow, and I was unable to do any magical invocation(tried version 3.8, version 4, other parameters) to force gensim to train an working word2vec, the model would simplify not converge and got way worse on each epoch. You rock and finalfrontier from my POV look like the only game in town ( spaCy was lack-lusting even with floret)!!!

Support explicitly stored ngrams

Entails changes to finalfusion-rust for serialization/usage.

  • Add config
  • Restructure vocab into module (vocab.rs for all variants is quite unwieldy)
  • Extend vocab module
    • Use indexer approach from finalfusion-rust, parameterize SubwordVocab with Indexer, separate ngram and word indices in the index through enum (might get ugly, considering how many type parameter we already have for the trainer structs)
    • Implement independent vocab type
  • Add support in binaries (depends on support in finalfusion-rust)
  • Update finalfusion-rust dependency once it supports NGramVocabs
  • Replace finalfusion dependency with release.
  • fix #72

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.