Giter Club home page Giter Club logo

llm-cookbook's Introduction

Cookbook

Embeddings

Word Vector

  1. Word2Vec Tutorial - The Skip-Gram Model - By Chris McCormick
  2. Word2Vec Tutorial Part 2 - Negative Sampling - By Chris McCormick
  3. Coding Word2vec from Scratch

From Tokenization to Embeddings

Aspect Tokenization Embeddings
Definition The process of splitting text into smaller units (tokens) such as words, subwords, or characters. The representation of tokens as dense vectors in a continuous vector space.
Purpose To convert text into a format that can be processed by models. To capture the semantic meaning and relationships between tokens.
Output A sequence of tokens (strings or subunits). A sequence of vectors (numerical representations).
Role in NLP Initial step in text processing. Used after tokenization for feature extraction and input to models.
Examples Splitting "Hello world!" into ["Hello", "world", "!"]. Representing "Hello" as [0.25, -0.14, 0.44, ...].
Techniques Word-level, subword-level (BPE, WordPiece), character-level. Word2Vec, GloVe, FastText, BERT embeddings.
Data Dependency Does not inherently learn from data; it's a rule-based or deterministic process. Learned from a large corpora to capture contextual information.
Flexibility Less flexible, depends on predefined rules. Highly flexible, adapts to various contexts and uses.
Computational Complexity Low complexity, simple to implement. Higher complexity, requires training on large datasets.
Visualization Tool https://www.pinecone.io/learn/tokenization/ https://docs.wandb.ai/guides/app/features/panels/query-panel/embedding-projector

From Word2Vec to Transformers

Q. How can you represent a word in N dimensions and represent any long sequence of words using the same N dimensions? Won't that compress the meaning of that long sequence of words to a word-equivalent?

Claude AI: You raise a very valid point. Representing a long sequence of words, like a sentence or document, as a fixed-length vector of the same dimensionality as word embeddings can indeed result in a loss of information and compression of meaning.

The key reason this is possible is that word embeddings and sequence embeddings capture different types of information and operate at different levels of granularity.

  • Word embeddings, like those from Word2Vec or GloVe, represent individual words in a continuous vector space, where each dimension encodes some aspect of the word's meaning, usage, or relationship to other words. These embeddings are trained to capture semantic and syntactic similarities between words based on their co-occurrences in large corpora.

  • On the other hand, sequence embeddings, which are typically generated by neural network models like RNNs, LSTMs, or Transformers, aim to capture the meaning and context of an entire sequence of words, not just individual words. These models process the sequence of word embeddings and learn to combine and contextualize the information in a way that represents the overall meaning of the sequence.

While the sequence embedding has the same dimensionality as the word embeddings, each dimension in the sequence embedding vector now encodes a different aspect of the sequence's meaning, rather than just the meaning of individual words. The neural network model learns to compose and distribute the information from the word embeddings into this fixed-length representation in a way that preserves as much of the relevant information as possible.

However, you are correct that there is inevitably some loss of information and compression of meaning when representing a long sequence as a fixed-length vector. The degree of compression and information loss depends on the complexity of the sequence, the capacity of the neural network model, and the specific task or application.

To mitigate this issue, several techniques are employed:

  1. Attention Mechanisms: Models like Transformers use attention mechanisms to selectively focus on the most relevant parts of the input sequence when generating the sequence embedding, reducing the need to compress all information into a fixed vector.

  2. Hierarchical Architectures: Some models use hierarchical architectures that first generate embeddings for smaller units (e.g., sentences), which are then combined to create document-level embeddings, allowing for more granular representation.

  3. Contextualized Embeddings: Models like BERT and ELMo generate contextualized word embeddings, which are dynamically informed by the surrounding context, rather than using static word embeddings. This can help preserve more nuanced meaning in the sequence embedding.

  4. Multi-Task Learning: Training models on multiple tasks simultaneously, such as language modeling, translation, and question answering, can help them learn richer and more transferable sequence representations.

While sequence embeddings can compress meaning to some extent, they still provide a powerful way to capture and leverage the semantic content of longer texts in a compact and computationally efficient manner. The trade-off between compression and information preservation is an active area of research, with ongoing efforts to develop more expressive and effective sequence representation techniques.

Visualization

  1. Visualizing embeddings in 2D/3D: In the 3D plot, it starts appearing to eyes that very few negative (1/2 starred) reviews are clearly separable from positive (4/5 starred) reviews but many others are embedded in tight neighbouring or shared spaces. You wish we could do the visualization exercise in one more dimension and maybe the separation boxes/planes/lines would be clearer for more data points.

  2. Chunking: Every tokenizer or GPT model has a token limit (i.e. a maximum input length), so chunking helps in breaking text to process larger texts that exceed this limit, and it also allows concurrent processing potentially speeding up the overall processing time. E.g. to translate a book, we will first split the book into chunks, each roughly a page long, then translate each chunk, and finally stitch them back together.

  3. Custom or fine-tuned embeddings: OpenAI's embedding model weights cannot be fine-tuned, but we can use training data to customize embeddings to our application.

Classification

  1. Zero Shot Classification: Turned into a fun puzzle experience! I was fun trying to improve the accuracy of the "prompt" (that is, sentences representing positive and negative classifications which then goes into the embedding model) versus what's given in the OpenAI blog, but in the end overall accuracy remained the same with higher precision but lower recall. It should be noted that the data is too skewed towards positive (particulary 5-star) reviews and their scores don't necessarily correlate with the opinion on product directly e.g. negative reviews could be complementing the product while saying that it was not what they had anticipated or that they don't like anything among this kind (category) of products.

  2. Multi-Class Classification: Data has mostly unique users interacting with unique products. Representing a user as an average (or weighted average) of products (rather product embeddings) they have interacted with in the training set concludes in a weak correlation of similarity between that user embedding and this new product's embedding with their rating score on this new product. Tried extending the same experiment as a multi-class classification problem to understand if collaborative filtering or a neural network could be accurate at rating score classification, but both are doing worst. For this data, it can be inferred that review texts (rather their embeddings) show very reliable correlation with positive and negative sentiment classes but very poor correlation with 5 types of rating scores.

Retrieval Augmented Generation

Code Walkthroughs

  1. https://github.com/mrdbourke/simple-local-rag: All the way from PDF ingestion to "chat with PDF" style features using open-source tools.

Working With Foundation Model Services

Rate Limiting

It's recommended to use the 'tenacity' package or another exponential backoff implementation to better manage API rate limits, as hitting the API too much too fast can trigger rate limits. Using the following function ensures you get your embeddings as fast as possible.

# Negative example (slow and rate-limited)
from openai import OpenAI
client = OpenAI()

num_embeddings = 10000 # Some large number
for i in range(num_embeddings):
    embedding = client.embeddings.create(
        input="Your text goes here", model="text-embedding-3-small"
    ).data[0].embedding
    print(len(embedding))
# Best practice
from tenacity import retry, wait_random_exponential, stop_after_attempt
from openai import OpenAI
client = OpenAI()

# Retry up to 6 times with exponential backoff, starting at 1 second and maxing out at 20 seconds delay
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    return client.embeddings.create(input=[text], model=model).data[0].embedding

embedding = get_embedding("Your text goes here", model="text-embedding-3-small")
print(len(embedding))

However, if you're processing large volumes of batch data, where throughput matters more than latency, there are a few other things you can do in addition to backoff and retry. e.g. batching requests pand roactively adding delay between requests. Here is an example script for parallel processing large quantities of API requests: api_request_parallel_processor.py.

llm-cookbook's People

Contributors

kgj-0x64 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.