Cookbook

Embeddings

Word Vector

From Tokenization to Embeddings

Aspect	Tokenization	Embeddings
Definition	The process of splitting text into smaller units (tokens) such as words, subwords, or characters.	The representation of tokens as dense vectors in a continuous vector space.
Purpose	To convert text into a format that can be processed by models.	To capture the semantic meaning and relationships between tokens.
Output	A sequence of tokens (strings or subunits).	A sequence of vectors (numerical representations).
Role in NLP	Initial step in text processing.	Used after tokenization for feature extraction and input to models.
Examples	Splitting "Hello world!" into ["Hello", "world", "!"].	Representing "Hello" as [0.25, -0.14, 0.44, ...].
Techniques	Word-level, subword-level (BPE, WordPiece), character-level.	Word2Vec, GloVe, FastText, BERT embeddings.
Data Dependency	Does not inherently learn from data; it's a rule-based or deterministic process.	Learned from a large corpora to capture contextual information.
Flexibility	Less flexible, depends on predefined rules.	Highly flexible, adapts to various contexts and uses.
Computational Complexity	Low complexity, simple to implement.	Higher complexity, requires training on large datasets.
Visualization Tool	https://www.pinecone.io/learn/tokenization/	https://docs.wandb.ai/guides/app/features/panels/query-panel/embedding-projector

From Word2Vec to Transformers

Q. How can you represent a word in N dimensions and represent any long sequence of words using the same N dimensions? Won't that compress the meaning of that long sequence of words to a word-equivalent?

Claude AI: You raise a very valid point. Representing a long sequence of words, like a sentence or document, as a fixed-length vector of the same dimensionality as word embeddings can indeed result in a loss of information and compression of meaning.

The key reason this is possible is that word embeddings and sequence embeddings capture different types of information and operate at different levels of granularity.

Word embeddings, like those from Word2Vec or GloVe, represent individual words in a continuous vector space, where each dimension encodes some aspect of the word's meaning, usage, or relationship to other words. These embeddings are trained to capture semantic and syntactic similarities between words based on their co-occurrences in large corpora.
On the other hand, sequence embeddings, which are typically generated by neural network models like RNNs, LSTMs, or Transformers, aim to capture the meaning and context of an entire sequence of words, not just individual words. These models process the sequence of word embeddings and learn to combine and contextualize the information in a way that represents the overall meaning of the sequence.

While the sequence embedding has the same dimensionality as the word embeddings, each dimension in the sequence embedding vector now encodes a different aspect of the sequence's meaning, rather than just the meaning of individual words. The neural network model learns to compose and distribute the information from the word embeddings into this fixed-length representation in a way that preserves as much of the relevant information as possible.

However, you are correct that there is inevitably some loss of information and compression of meaning when representing a long sequence as a fixed-length vector. The degree of compression and information loss depends on the complexity of the sequence, the capacity of the neural network model, and the specific task or application.

To mitigate this issue, several techniques are employed:

Attention Mechanisms: Models like Transformers use attention mechanisms to selectively focus on the most relevant parts of the input sequence when generating the sequence embedding, reducing the need to compress all information into a fixed vector.
Hierarchical Architectures: Some models use hierarchical architectures that first generate embeddings for smaller units (e.g., sentences), which are then combined to create document-level embeddings, allowing for more granular representation.
Contextualized Embeddings: Models like BERT and ELMo generate contextualized word embeddings, which are dynamically informed by the surrounding context, rather than using static word embeddings. This can help preserve more nuanced meaning in the sequence embedding.
Multi-Task Learning: Training models on multiple tasks simultaneously, such as language modeling, translation, and question answering, can help them learn richer and more transferable sequence representations.

While sequence embeddings can compress meaning to some extent, they still provide a powerful way to capture and leverage the semantic content of longer texts in a compact and computationally efficient manner. The trade-off between compression and information preservation is an active area of research, with ongoing efforts to develop more expressive and effective sequence representation techniques.

Visualization

Visualizing embeddings in 2D/3D: In the 3D plot, it starts appearing to eyes that very few negative (1/2 starred) reviews are clearly separable from positive (4/5 starred) reviews but many others are embedded in tight neighbouring or shared spaces. You wish we could do the visualization exercise in one more dimension and maybe the separation boxes/planes/lines would be clearer for more data points.
- W&B Embedding Projector tool: E.g. W&B Embedding Projector to visualize OpenAI Embeddings alongside the Amazon food reviews data that produced them
Chunking: Every tokenizer or GPT model has a token limit (i.e. a maximum input length), so chunking helps in breaking text to process larger texts that exceed this limit, and it also allows concurrent processing potentially speeding up the overall processing time. E.g. to translate a book, we will first split the book into chunks, each roughly a page long, then translate each chunk, and finally stitch them back together.
Custom or fine-tuned embeddings: OpenAI's embedding model weights cannot be fine-tuned, but we can use training data to customize embeddings to our application.

Classification

Zero Shot Classification: Turned into a fun puzzle experience! I was fun trying to improve the accuracy of the "prompt" (that is, sentences representing positive and negative classifications which then goes into the embedding model) versus what's given in the OpenAI blog, but in the end overall accuracy remained the same with higher precision but lower recall. It should be noted that the data is too skewed towards positive (particulary 5-star) reviews and their scores don't necessarily correlate with the opinion on product directly e.g. negative reviews could be complementing the product while saying that it was not what they had anticipated or that they don't like anything among this kind (category) of products.
Multi-Class Classification: Data has mostly unique users interacting with unique products. Representing a user as an average (or weighted average) of products (rather product embeddings) they have interacted with in the training set concludes in a weak correlation of similarity between that user embedding and this new product's embedding with their rating score on this new product. Tried extending the same experiment as a multi-class classification problem to understand if collaborative filtering or a neural network could be accurate at rating score classification, but both are doing worst. For this data, it can be inferred that review texts (rather their embeddings) show very reliable correlation with positive and negative sentiment classes but very poor correlation with 5 types of rating scores.

Retrieval Augmented Generation

Code Walkthroughs

https://github.com/mrdbourke/simple-local-rag: All the way from PDF ingestion to "chat with PDF" style features using open-source tools.

Working With Foundation Model Services

Rate Limiting

# Negative example (slow and rate-limited)
from openai import OpenAI
client = OpenAI()

num_embeddings = 10000 # Some large number
for i in range(num_embeddings):
    embedding = client.embeddings.create(
        input="Your text goes here", model="text-embedding-3-small"
    ).data[0].embedding
    print(len(embedding))

# Best practice
from tenacity import retry, wait_random_exponential, stop_after_attempt
from openai import OpenAI
client = OpenAI()

# Retry up to 6 times with exponential backoff, starting at 1 second and maxing out at 20 seconds delay
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    return client.embeddings.create(input=[text], model=model).data[0].embedding

embedding = get_embedding("Your text goes here", model="text-embedding-3-small")
print(len(embedding))

However, if you're processing large volumes of batch data, where throughput matters more than latency, there are a few other things you can do in addition to backoff and retry. e.g. batching requests pand roactively adding delay between requests. Here is an example script for parallel processing large quantities of API requests: api_request_parallel_processor.py.

kgj-0x64 / llm-cookbook Goto Github PK

llm-cookbook's Introduction

Cookbook

Embeddings

Word Vector

From Tokenization to Embeddings

From Word2Vec to Transformers

Visualization

Classification

Retrieval Augmented Generation

Code Walkthroughs

Working With Foundation Model Services

Rate Limiting

llm-cookbook's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent