Giter Club home page Giter Club logo

embcompare's Introduction

EmbCompare

A simple python tool for embedding comparison

EmbCompare is a small python package highly inspired by the Embedding Comparator tool that helps you compare your embeddings both visually and numerically.

Key features :

  • Visual comparison : GUI for comparison of two embeddings
  • Numerical comparison : Calculation of comparison indicators between two embeddings for monitoring purposes

EmbCompare keeps things simples. All computations are made in memory and the package does not bring any embedding storage management.

If you need a tool to store, compare and track your experiments, you may like the vectory project.

Table of content

๐Ÿ› ๏ธ Installation

# basic install
pip install embcompare

# installation with the gui tool
pip install embcompare[gui]

๐Ÿ‘ฉโ€๐Ÿ’ป Usage

EmbCompare provides a CLI with three sub-commands :

  • embcompare add is used to create or update a yaml file containing all embeddings infos : path, format, labels, term-frequencies, ... ;
  • embcompare report is used to generate json reports containing comparison metrics ;
  • embcompare gui is used to start a streamlit webapp to compare your embeddings visually.

Config file

EmbCompare use a yaml file for referencing embeddings and relevant informations. By default, EmbCompare is looking for a file named embcompare.yaml in the current working directory.

embeddings: 
    first_embedding:
        name: My first embedding
        path: /abspath/to/firstembedding.json
        format: json
        frequencies: /abspath/to/freqs.json
        frequencies_format: json
        labels: /abspath/to/labels.pkl
        labels_format: pkl
    second_embedding:
        name: My second embedding
        path: /abspath/to/secondembedding.json
        format: word2vec
        frequencies: /abspath/to/freqs.pkl
        frequencies_format: pkl
        labels: /abspath/to/labels.json
        labels_format: json

The embcompare add command allow to update this file programatically (and even create it if it does not exist).

JSON comparison report generation

EmbCompare aims to help to compare embedding thanks to numerical metrics that can be used to check if a new generated embedding is very different from the last one. The command embcompare report can be used in two ways :

  • With a single embedding. In this case it generate a small report about the embedding :
    embcompare report first_embedding
    # creates a first_embedding_report.json file containing some infos about the embedding
  • With two embeddings. In this case it generate a comparison report about the two embeddings :
    embcompare report first_embedding second_embedding
    # creates a first_embedding_second_embedding_report.json file containing comparison metrics

GUI

A short video overview of embcompare graphical user interface

The GUI is also very handy to compare embeddings. To start the GUI, use the commande embcompare gui. It will launch a streamlit app that will allow you to visually compare the embeddings you added in the configuration file.

๐Ÿ Python API

EmbCompare provide several classes to load and compare embeddings.

Embedding

The Embedding class is child of the gensim.KeyedVectors class.

It add few functionalities :

  • You can provide term frequencies so you can filter the elements later
  • You can easily compute all elements nearest neighbors (thanks to sklearn.neighbors.NearestNeighbors)
import json
import gensim.downloader as api
from embcompare import Embedding

word_vectors = api.load("glove-wiki-gigaword-100")
with open("frequencies.json", "r") as f:
  word_frequencies = json.load(f)

embedding = Embedding.load_from_keyedvectors(word_vectors, frequencies=word_frequencies)
neigh_dist, neigh_ind = embedding.compute_neighborhoods()

EmbeddingComparison

The EmbeddingComparison class is meant to compare two Embedding objects :

from embcompare import EmbeddingComparison, load_embedding

emb1 = load_embedding("first_emb.bin", embedding_format="fasttext", frequencies_path="freqs.pkl")
emb2 = load_embedding("second_emb.bin", embedding_format="word2vec", frequencies_path="freqs.pkl")

comparison = EmbeddingComparison({"emb1": emb1, "emb2": emb2}, n_neighbors=25)
comparison.neighborhoods_similarities["word"]
# 0.867

JSON reports

EmbeddingReport

The EmbeddingReport class is used to generate small report about an embedding :

from embcompare import EmbeddingReport, load_embedding

emb1 = load_embedding("first_emb.bin", embedding_format="fasttext", frequencies_path="freqs.pkl")
report = EmbeddingReport(emb1)
report.to_dict()
# { 
#   "vector_size": 300,
#   "mean_frequency": 0.00012,
#   "mean_distance_neighbors": 0.023,
#   ...
# }

EmbeddingComparisonReport

The EmbeddingComparisonReport class is used to generate small comparison report from two embedding :

from embcompare import EmbeddingComparison, EmbeddingComparisonReport, load_embedding

emb1 = load_embedding("first_emb.bin", embedding_format="fasttext", frequencies_path="freqs.pkl")
emb2 = load_embedding("second_emb.bin", embedding_format="word2vec", frequencies_path="freqs.pkl")

comparison = EmbeddingComparison({"emb1": emb1, "emb2": emb2})
report = EmbeddingComparisonReport(comparison)

report.to_dict()
# {
#   "embeddings" : [
#     { 
#       "vector_size": 300,
#       "mean_frequency": 0.00012,
#       "mean_distance_neighbors": 0.023,
#       ...
#     },
#     ...
#   ],
#   "neighborhoods_similarities_median": 0.012,
#   ...
# }

๐Ÿ“Š Create your custom streamlit app

The GUI is built with streamlit. We tried to modularized the app so you can more easily reuse some features for your custom streamlit app :

# embcompare/gui/app.py

from embcompare.gui.features import (
    display_custom_elements_comparison,
    display_elements_comparison,
    display_embeddings_config,
    display_frequencies_comparison,
    display_neighborhoods_similarities,
    display_numbers_of_elements,
    display_parameters_selection,
    display_spaces_comparison,
    display_statistics_comparison,
)
from embcompare.gui.helpers import create_comparison

def main():
    """Streamlit app for embeddings comparison"""
    config_embeddings = config[CONFIG_EMBEDDINGS]

    (
        tab_infos,
        tab_stats,
        tab_spaces,
        tab_neighbors,
        tab_compare,
        tab_compare_custom,
        tab_frequencies,
    ) = st.tabs(
        [
            "Infos",
            "Statistics",
            "Spaces",
            "Similarities",
            "Elements",
            "Search elements",
            "Frequencies",
        ]
    )

    # Embedding selection (inside the sidebar)
    with st.sidebar:
        parameters = display_parameters_selection(config_embeddings)

    # Display informations about embeddings
    with tab_infos:
        display_embeddings_config(
            config_embeddings, parameters.emb1_id, parameters.emb2_id
        )

    comparison = create_comparison(
        config_embeddings,
        emb1_id=parameters.emb1_id,
        emb2_id=parameters.emb2_id,
        n_neighbors=parameters.n_neighbors,
        max_emb_size=parameters.max_emb_size,
        min_frequency=parameters.min_frequency,
    )

    # Display number of element in both embedding and common elements
    with tab_infos:
        display_numbers_of_elements(comparison)

    # Display statistics
    with tab_stats:
        display_statistics_comparison(comparison)

    if not comparison.common_keys:
        st.warning("The embeddings have no element in common")
        st.stop()

    # Comparison below are based on common elements comparison
    with tab_spaces:
        display_spaces_comparison(comparison)

    with tab_neighbors:
        display_neighborhoods_similarities(comparison)

    with tab_compare:
        display_elements_comparison(comparison)

    with tab_compare_custom:
        display_custom_elements_comparison(comparison)

    with tab_frequencies:
        display_frequencies_comparison(comparison)

embcompare's People

Contributors

dependabot[bot] avatar h4c5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embcompare's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.