Feature Deion LlamaCPP is able to cache prompts to a specifi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I really like the idea :) I've experimented with <code class="notran

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Support file based prompt caching about node-llama-cpp HOT 9 OPEN

StrangeBytesDev commented on June 13, 2024 1

Support file based prompt caching

from node-llama-cpp.

Comments (9)

giladgd commented on June 13, 2024 1

@Madd0g The way it works is that it reuses the existing context state for the new evaluation, and since the start of the current context state is the same, it allows it to start the evaluation of the new prompt at the first different token in the existing context state.

This feature already exists in node-llama-cpp, you just have to reuse the same context sequence across multiple chats.
For example, using the version 3 beta, you can do this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    contextSize: Math.min(4096, model.trainContextSize)
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
    contextSequence,
    autoDisposeSequence: false
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

session.dispose();


const session2 = new LlamaChatSession({
    contextSequence
});

const q1a = "Hi there";
console.log("User: " + q1a);

const a1a = await session2.prompt(q1a);
console.log("AI: " + a1a);

from node-llama-cpp.

StrangeBytesDev commented on June 13, 2024

I think this function in llama.cpp might be the right one to call to try to implement this.
https://github.com/ggerganov/llama.cpp/blob/b2440/llama.cpp#L14010

But I've never done any kind of C++ to Nodejs bindings before, so I'm doing my best to try and work through how that works and how to implement this here just by inferring from addon.cpp.

from node-llama-cpp.

giladgd commented on June 13, 2024

I really like the idea :)

I've experimented with llama_load_session_file in the past, and have a few conclusions:

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.
It saves the entire context state, including all unused buffers, so the generated files are huge and can quickly fill up the storage of your machine if used frequently.
It depends on the specific implementation of the current binary, so if you update to the latest version of llama.cpp or node-llama-cpp (a new version of llama.cpp is released every few hours), every slight difference in the implementation that can affect how things are saved in memory and will make it impossible to load such a memory dump safely in another version without crashing or leading to memory corruptions.
IIRC (since I experimented with it months ago) It depends on the context size you created the context with and some other parameters, so the new context must match the parameters of the previous one you saved the state of, which can pretty easily lead to memory corruptions and crashes.

If you like you can try to add the ability to save and load only the evaluation cache of a context sequence to llama.cpp, which will solve most of the problems I encountered and make it viable to add support for in node-llama-cpp.

from node-llama-cpp.

Madd0g commented on June 13, 2024

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.

I've used the oobabooga API for some batch tasks and it is noticeably fast for sequential large prompts if just the start of the text is the same but the ending is different. It seems to be a feature of llama-cpp-python? Is that a different implementation of the prefix caching?

I was hoping to benefit from this feature too, forgot that llama.cpp and the python version are two different things

from node-llama-cpp.

Madd0g commented on June 13, 2024

@giladgd - thanks I played around today with the beta.

I tried running on CPU, I'm looping over an array of strings, for me the evaluation takes longer only if I dispose and fully recreate the session.

reusing the contextSequence but recreating the session - no difference
autoDisposeSequence - true/false - no difference

I'm resetting the history in the loop to only keep the system message:

const context = await model.createContext({
  contextSize: Math.min(4096, model.trainContextSize),
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
  systemPrompt,
  contextSequence,
  autoDisposeSequence: false,
});

// for of ....
// and in the loop to keep the system message:
    session.setChatHistory(session.getChatHistory().slice(0, 1));

am I doing something wrong?

from node-llama-cpp.

StrangeBytesDev commented on June 13, 2024

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.

It saves the entire context state, including all unused buffers, so the generated files are huge and can quickly fill up the storage of your machine if used frequently.

It depends on the specific implementation of the current binary, so if you update to the latest version of llama.cpp or node-llama-cpp (a new version of llama.cpp is released every few hours), every slight difference in the implementation that can affect how things are saved in memory and will make it impossible to load such a memory dump safely in another version without crashing or leading to memory corruptions.

IIRC (since I experimented with it months ago) It depends on the context size you created the context with and some other parameters, so the new context must match the parameters of the previous one you saved the state of, which can pretty easily lead to memory corruptions and crashes.

I'm a little fuzzy on what you mean between the "entire context state" vs "evaluation cache" because I don't have a super solid conceptual idea of how things work under the hood on LlamaCPP for batching. It sounds to me like the existing prompt based caching would only really be useful for single user setups, and for short term caching. Is there a way to cache a context to disk on the Node side with the V3 beta? I'm assuming a naive attempt to do something like this won't actually work:

const context = await model.createContext({
    contextSize: 2048,
})
fs.writeFileSync("context.bin", context)

from node-llama-cpp.

giladgd commented on June 13, 2024

@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new LlamaChatSession like in my example.
The LlamaChatSession is just a wrapper around a LlamaContextSequence to facilitate chatting with a model, so there's no significant performance value to reusing that object.

The next beta version should be released next week and include a tokenMeter on every LlamaContextSequence that will allow you to see exactly how many tokens were evaluated and generated.

from node-llama-cpp.

giladgd commented on June 13, 2024

@StrangeBytesDev A context can have multiple sequences, and each sequence has its own state and history.
When you evaluate things on a specific sequence, other sequences are not affected and are not aware of the evaluation.
Using multiple sequences on a single context has a performance advantage over creating multiple contexts with a single sequence on each, which is why I opted to expose that concept as-is in node-llama-cpp.

Since every sequence is supposed to be independent and have its own state, there shouldn't be any functions that have side effects that can affect other sequences when you only intend to affect a specific sequence.

The problem with llama_load_session_file is that it restores the state of a context with all of its sequences, which makes it incompatible with the concept of multiple independent sequences.
While it can be beneficial when you only use a single sequence on a context, I opted not to add support for this due to the rest of the issues I mentioned earlier.
Also, in order for the rest of the optimizations that node-llama-cpp employs to work properly after loading a context state from a file it would require creating a different file format with additional data that node-llama-cpp needs, so it won't be as simple as just exposing the native function in the JS side.

from node-llama-cpp.

Madd0g commented on June 13, 2024

@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new LlamaChatSession like in my example.

Thanks, I initially couldn't get it to work without it retaining history, I was doing something wrong. Today I did manage to do it correctly, with something like this in a loop:

if (session) {
  session.dispose();
}
session = new LlamaChatSession({ contextSequence, systemPrompt, chatWrapper: "auto" })
console.log(session.getChatHistory());

I tried to get chatHistory out of the session and I correctly see only one the system message in there.

from node-llama-cpp.

Support file based prompt caching about node-llama-cpp HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent