Giter Club home page Giter Club logo

finalfusion-python's Introduction

finalfusion-python

Documentation Status

Introduction

finalfusion is a Python package for reading, writing and using finalfusion embeddings, but also supports other commonly used embeddings like fastText, GloVe and word2vec.

The Python package supports the same types of embeddings as the finalfusion-rust crate:

  • Vocabulary:
    • No subwords
    • Subwords
  • Embedding matrix:
    • Array
    • Memory-mapped
    • Quantized
  • Norms
  • Metadata

Installation

The finalfusion module is available on PyPi for Linux, Mac and Windows. You can use pip to install the module:

$ pip install --upgrade finalfusion

Installing from source

Building from source depends on Cython. If you install the package using pip, you don't need to explicitly install the dependency since it is specified in pyproject.toml.

$ git clone https://github.com/finalfusion/finalfusion-python
$ cd finalfusion-python
$ pip install .

If you want to build wheels from source, wheel needs to be installed. It's then possible to build wheels through:

$ python setup.py bdist_wheel

The wheels can be found in dist.

Package Usage

Basic usage

import finalfusion
# loading from different formats
w2v_embeds = finalfusion.load_word2vec("/path/to/w2v.bin")
text_embeds = finalfusion.load_text("/path/to/embeds.txt")
text_dims_embeds = finalfusion.load_text_dims("/path/to/embeds.dims.txt")
fasttext_embeds = finalfusion.load_fasttext("/path/to/fasttext.bin")
fifu_embeds = finalfusion.load_finalfusion("/path/to/embeddings.fifu")

# serialization to formats works similarly
finalfusion.compat.write_word2vec("to_word2vec.bin", fifu_embeds)

# embedding lookup
embedding = fifu_embeds["Test"]

# reading an embedding into a buffer
import numpy as np
buffer = np.zeros(fifu_embeds.storage.shape[1], dtype=np.float32)
fifu_embeds.embedding("Test", out=buffer)

# similarity and analogy query
sim_query = fifu_embeds.word_similarity("Test")
analogy_query = fifu_embeds.analogy("A", "B", "C")

# accessing the vocab and printing the first 10 words
vocab = fifu_embeds.vocab
print(vocab.words[:10])

# SubwordVocabs give access to the subword indexer:
subword_indexer = vocab.subword_indexer
print(subword_indexer.subword_indices("Test", with_ngrams=True))

# accessing the storage and calculate its dot product with an embedding
res = embedding.dot(fifu_embeds.storage)

# printing metadata
print(fifu_embeds.metadata) 

Beyond Embeddings

# load only a vocab from a finalfusion file
from finalfusion import load_vocab
vocab = load_vocab("/path/to/finalfusion_file.fifu")

# serialize vocab to single file
vocab.write("/path/to/vocab_file.fifu.voc")

# more specific loading functions exist
from finalfusion.vocab import load_finalfusion_bucket_vocab
fifu_bucket_vocab = load_finalfusion_bucket_vocab("/path/to/vocab_file.fifu.voc")

The package supports loading and writing all finalfusion chunks this way. This is only supported by the Python package, reading will fail with e.g. the finalfusion-rust.

Scripts

finalfusion also includes a conversion script ffp-convert to convert between the supported formats.

# convert from fastText format to finalfusion
$ ffp-convert -f fasttext fasttext.bin -t finalfusion embeddings.fifu

ffp-bucket-to-explicit can be used to convert bucket embeddings to embeddings with an explicit ngram lookup.

# convert finalfusion bucket embeddings to explicit
$ ffp-bucket-to-explicit -f finalfusion embeddings.fifu explicit.fifu

ffp-select generates new embedding files based on some embeddings and a word list. Using ffp-select with embeddings with a simple vocab results in a subset of the original embeddings. With subword embeddings, vectors for unknown words in the word list are computed and added to the new embeddings. The resulting embeddings cannot provide representations for OOV words anymore. The new vocabulary covers only the words in the word list.

$ ffp-select large-embeddings.fifu subset-embeddings.fifu words.txt

Finally, the package comes with ffp-similar and ffp-analogy to do analogy and similarity queries.

# get the 5 nearest neighbours of "Tübingen"
$ echo Tübingen | ffp-similar embeddings.fifu
# get the 5 top answers for "Tübingen" is to "Stuttgart" like "Heidelberg" to...
$ echo Tübingen Stuttgart Heidelberg | ffp-analogy embeddings.fifu

Where to go from here

finalfusion-python's People

Contributors

danieldk avatar sebpuetz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

finalfusion-python's Issues

Expose model parameters

The interface could use getters for the various model parameters specified here, particularly the commonly used ones such as dimension, vocab size, context size, etc.

Restrict AppVeyor Builds to Pullrequests and Releases

I think AppVeyor is processing the builds sequentially, this takes a lot of time. We are currently building on both CI services when something is pushed to a branch and also when a PR is made.

Maybe we can restrict the AppVeyor builds to pull requests and releases? I'm not sure how release builds are triggered, so I don't know if this is actually doable.

Numpy dependency should be specified

If you try to get an embedding without having Numpy installed in the Python environment, you get a panic. Is it possible to have Numpy be automatically installed as a dependency when you run pip install finalfusion?

Consider replacing finalfusion-python by ffp?

Just wanted to have your opinion on this @sebpuetz . I think my objection at the time is that we had to keep two implementations (Rust, Python) in sync. But:

  • finalfusion is mostly 'done' (API stable, data format stable), so having two implementations is less of an issue now.
  • In the meanwhile, we have seen cases of unsoundness in pyo3.

Investigate generation of documentation

Investigate current possibilities for documentation generation in pyo3 and pyo3-pack. Ideally, we'd generate something similar to readthedocs. This is the last item from #5.

Return None instead of raising exceptions

Most of our methods raise exceptions if the model can't produce an embedding for the given input, e.g. embedding() and similarity() while we could also return None in those cases.

I'm not sure what's canonical Python here, but I'd prefer to do something like:

embedding = embeds.Embedding("something oov")
if embedding is None:
    embedding = generic_oov_embed()

over

try:
    embedding = embeds.Embedding("something oov")
except KeyError:
    embedding = generic_oov_embed()

What's your take on this?

Add typing to all methods

E.g. __getitem__ on the storage types is missing that among other methods. That leaves a lot of code paths un-analyzed because they implicitly become Any.

Release 0.4 branch

I would like to branch a new a new release. Primarily to ensure that we are in sync with finalfusion, to get analogy queries with masking, and it is nice to have the norms functionality as well.

@sebpuetz : do you want to get #32 in before the release?

Use of locale analogy integration test

I think I missed this during reviewing, but the analogy integration test has:

export LC_ALL=en_US.UTF-8

This fails in two cases:

  • Sandboxed builds, in which glibc locales are typically not available.
  • People who have systems without this locale installed (e.g. because they use some non-English locale).

@sebpuetz Do you encounter problems when e.g. using export LC_ALL=C? (Which should avoid doing any parsing of character data.)

Add tests.

Put a very small model in the repo so we can run some tests from within python? Otherwise I'm not sure how to include tests for the CI to pick up

Release 0.5

We have added quite some features in the last month, e.g. the read methods for other formats, exposure of subword indices and the embedding similarity method.

Once #51 is done we could cut a new release, unless there is something else in the pipeline for the Python bindings.

What do you think?

Batched embedding lookup

Individual lookups for quantized embeddings are extremely slow, batched lookups are a lot faster.

Release 0.7

  • Update setup.py: include README, update URL, add other URLs #106 & #104
  • Documentation #104
  • mention scripts in docs #116
  • Add release workflow #112
  • Fix installing wheels in CI on Windows not going to install & test wheels for windows on release workflows.
  • Add MANIFEST.in for sdists #113
  • Mention analogy + similarity scripts in README #114

I think the other issues can be resolved in 0.7.1 (e.g. error handlers for bad utf8 or batched lookups).

Is there anything else that you think should be done before the release?

Add convenience methods etc.

Not sure if there's more, but that's what I remember:

  • Method to get the full embedding matrix, suggested by @Blubberli
  • numpy compatibility, suggested by @twuebi
  • make vocab accessible from python?
  • Check documentation

Handling malformed UTF8

As of now, everything assumes all words are proper utf8. Perhaps add a lossy arg analogous to finalfusion-rust in the read method(s).

Download pyo3-pack instead of cargo install

Downloading a precompiled version of pyo3-pack would shave off more than 10 minutes from build times. Any objections to this?

Although it seems like something is messed up for the 0.6.1 binary, at least it's not possible for me to run it locally or on CI. Bash just states, that the file or directory doesn't exist. The newest release works just fine locally.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.