Giter Club home page Giter Club logo

Comments (3)

yumeng5 avatar yumeng5 commented on May 31, 2024 2

Hi Loreto,

Thanks for the question. You are right that the current framework learns word-based embeddings (Word2Vec-like), instead of character n-gram/subword embeddings (fastText-like).

To incorporate subword information into embedding learning, fastText extends Word2Vec by decomposing a word embedding vector into the summation of all its character n-gram embedding vectors.

The above approach seems intuitive and straightforward in the Euclidean space, but it may not be easy to adapt it to the spherical space---In spherical text embedding, each embedding vector (word/paragraph) is constrained on the unit sphere (vector norm = 1), and it does not hold that the summation of unit vectors still has unit norm. For example, if "faster" is decomposed into "fast" and "er", then we have v_{faster} = v_{fast} + v_{er} according to fastText implementation, but the unit norm constraint ||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1 might be violated. In short, it requires better design than simply decomposing a word vector into the summation of subword vectors for spherical embeddings.

An easy fix would be to decompose a word vector into the normalized summation of its character n-grams, but it lacks a clear theoretical explanation for doing so, and thus we did not incorporate this kind of design into our current framework. That being said, if this method does lead to encouraging results, it might still be beneficial for subword embedding learning. I might come back at some point in the future to try it, but I'll have to leave it as future work for now.

Please let me know if you have further concerns/questions!

Best,
Yu

from spherical-text-embedding.

loretoparisi avatar loretoparisi commented on May 31, 2024 1

@yumeng5 thanks for the explanation. In the recent Language Models, the sub wording (as Word2Vec / FastText does) has been effectively replace by Byte Pair Encoding, that introduce subwords units encoded as codes, and available in several flavours like SentencePiece, WordPiece, etc. A very old but good one is subword-nmt, while a super blazing fast (via Rust) is HuggingFace tokenizers.
That said, I wonder if the same concept of unit norm constraint

||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1

would be violated in case of using the BPE codes, or, maybe, due the nature of byte pair encoding, would be possibile to follow this approach.

from spherical-text-embedding.

yumeng5 avatar yumeng5 commented on May 31, 2024 1

Hi @loretoparisi,

Thanks for bringing this up. According to my understanding, the BPE-like approaches essentially construct vocabularies with subword units instead of whole-words. And this subword segmentation step by itself does not prevent any assumptions from being made regarding the word embedding space (Euclidean, spherical, etc.) because the embedding learning procedure is independent of the content of the vocabulary.

The only case that makes applying the spherical embedding approach difficult is when we make additional assumptions regarding the subword embeddings that conflict with the spherical space constraints. As mentioned in my previous post, fastText assumes that the whole word embedding is the summation of its subword embeddings, (e.g., v_{faster} = v_{fast} + v_{er}), which conflicts with the unit norm constraint. However, if we drop this assumption and treat each subword unit as if they are independent words, there will be nothing preventing us from learning their spherical embeddings.

I hope this answers your question! Let me know if anything remains unclear.

Best,
Yu

from spherical-text-embedding.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.