Comments (3)
Hi Loreto,
Thanks for the question. You are right that the current framework learns word-based embeddings (Word2Vec-like), instead of character n-gram/subword embeddings (fastText-like).
To incorporate subword information into embedding learning, fastText extends Word2Vec by decomposing a word embedding vector into the summation of all its character n-gram embedding vectors.
The above approach seems intuitive and straightforward in the Euclidean space, but it may not be easy to adapt it to the spherical space---In spherical text embedding, each embedding vector (word/paragraph) is constrained on the unit sphere (vector norm = 1), and it does not hold that the summation of unit vectors still has unit norm. For example, if "faster" is decomposed into "fast" and "er", then we have v_{faster} = v_{fast} + v_{er}
according to fastText implementation, but the unit norm constraint ||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1
might be violated. In short, it requires better design than simply decomposing a word vector into the summation of subword vectors for spherical embeddings.
An easy fix would be to decompose a word vector into the normalized summation of its character n-grams, but it lacks a clear theoretical explanation for doing so, and thus we did not incorporate this kind of design into our current framework. That being said, if this method does lead to encouraging results, it might still be beneficial for subword embedding learning. I might come back at some point in the future to try it, but I'll have to leave it as future work for now.
Please let me know if you have further concerns/questions!
Best,
Yu
from spherical-text-embedding.
@yumeng5 thanks for the explanation. In the recent Language Models, the sub wording (as Word2Vec / FastText does) has been effectively replace by Byte Pair Encoding, that introduce subwords units encoded as codes, and available in several flavours like SentencePiece, WordPiece, etc. A very old but good one is subword-nmt, while a super blazing fast (via Rust) is HuggingFace tokenizers.
That said, I wonder if the same concept of unit norm constraint
||v_{faster}|| = ||v_{fast}|| = ||v_{er}|| = 1
would be violated in case of using the BPE codes, or, maybe, due the nature of byte pair encoding, would be possibile to follow this approach.
from spherical-text-embedding.
Hi @loretoparisi,
Thanks for bringing this up. According to my understanding, the BPE-like approaches essentially construct vocabularies with subword units instead of whole-words. And this subword segmentation step by itself does not prevent any assumptions from being made regarding the word embedding space (Euclidean, spherical, etc.) because the embedding learning procedure is independent of the content of the vocabulary.
The only case that makes applying the spherical embedding approach difficult is when we make additional assumptions regarding the subword embeddings that conflict with the spherical space constraints. As mentioned in my previous post, fastText assumes that the whole word embedding is the summation of its subword embeddings, (e.g., v_{faster} = v_{fast} + v_{er}
), which conflicts with the unit norm constraint. However, if we drop this assumption and treat each subword unit as if they are independent words, there will be nothing preventing us from learning their spherical embeddings.
I hope this answers your question! Let me know if anything remains unclear.
Best,
Yu
from spherical-text-embedding.
Related Issues (15)
- uSIF vs Averaging HOT 1
- 400D and 500D Spherical embeddings for NER HOT 3
- Python package (bindings) HOT 2
- Experiment setting HOT 3
- Is it applicable to other data? HOT 1
- Unbounded write
- Can you provide the pretrained word embeddings? HOT 1
- [Question] What does '<\s>' in vocabulary means? HOT 2
- Is the method suited for representing Sentence Embedding for paraphrase identification task? HOT 6
- Paper to code mapping HOT 6
- Faulty Memory Management in Implementation of Hash Table with Linear Probing HOT 2
- Licensing HOT 1
- Segmentation Fault HOT 5
- OOV problem HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spherical-text-embedding.