Thanks a lot for this works. According to function ReadW

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[Question] About subwords and bpe tokenization approach about spherical-text-embedding HOT 3 CLOSED

yumeng5 commented on May 31, 2024 1

[Question] About subwords and bpe tokenization approach

from spherical-text-embedding.

Comments (3)

yumeng5 commented on May 31, 2024 2

Hi Loreto,

Thanks for the question. You are right that the current framework learns word-based embeddings (Word2Vec-like), instead of character n-gram/subword embeddings (fastText-like).

To incorporate subword information into embedding learning, fastText extends Word2Vec by decomposing a word embedding vector into the summation of all its character n-gram embedding vectors.

The above approach seems intuitive and straightforward in the Euclidean space, but it may not be easy to adapt it to the spherical space---In spherical text embedding, each embedding vector (word/paragraph) is constrained on the unit sphere (vector norm = 1), and it does not hold that the summation of unit vectors still has unit norm. For example, if "faster" is decomposed into "fast" and "er", then we have v_{faster} = v_{fast} + v_{er} according to fastText implementation, but the unit norm constraint ||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1 might be violated. In short, it requires better design than simply decomposing a word vector into the summation of subword vectors for spherical embeddings.

An easy fix would be to decompose a word vector into the normalized summation of its character n-grams, but it lacks a clear theoretical explanation for doing so, and thus we did not incorporate this kind of design into our current framework. That being said, if this method does lead to encouraging results, it might still be beneficial for subword embedding learning. I might come back at some point in the future to try it, but I'll have to leave it as future work for now.

Please let me know if you have further concerns/questions!

Best,
Yu

from spherical-text-embedding.

loretoparisi commented on May 31, 2024 1

@yumeng5 thanks for the explanation. In the recent Language Models, the sub wording (as Word2Vec / FastText does) has been effectively replace by Byte Pair Encoding, that introduce subwords units encoded as codes, and available in several flavours like SentencePiece, WordPiece, etc. A very old but good one is subword-nmt, while a super blazing fast (via Rust) is HuggingFace tokenizers.
That said, I wonder if the same concept of unit norm constraint

||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1

would be violated in case of using the BPE codes, or, maybe, due the nature of byte pair encoding, would be possibile to follow this approach.

from spherical-text-embedding.

yumeng5 commented on May 31, 2024 1

Hi @loretoparisi,

Thanks for bringing this up. According to my understanding, the BPE-like approaches essentially construct vocabularies with subword units instead of whole-words. And this subword segmentation step by itself does not prevent any assumptions from being made regarding the word embedding space (Euclidean, spherical, etc.) because the embedding learning procedure is independent of the content of the vocabulary.

The only case that makes applying the spherical embedding approach difficult is when we make additional assumptions regarding the subword embeddings that conflict with the spherical space constraints. As mentioned in my previous post, fastText assumes that the whole word embedding is the summation of its subword embeddings, (e.g., v_{faster} = v_{fast} + v_{er}), which conflicts with the unit norm constraint. However, if we drop this assumption and treat each subword unit as if they are independent words, there will be nothing preventing us from learning their spherical embeddings.

I hope this answers your question! Let me know if anything remains unclear.

Best,
Yu

from spherical-text-embedding.

[Question] About subwords and bpe tokenization approach about spherical-text-embedding HOT 3 CLOSED

Comments (3)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent