Giter Club home page Giter Club logo

Comments (9)

giladgd avatar giladgd commented on June 13, 2024 1

@Madd0g The way it works is that it reuses the existing context state for the new evaluation, and since the start of the current context state is the same, it allows it to start the evaluation of the new prompt at the first different token in the existing context state.

This feature already exists in node-llama-cpp, you just have to reuse the same context sequence across multiple chats.
For example, using the version 3 beta, you can do this:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext({
    contextSize: Math.min(4096, model.trainContextSize)
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
    contextSequence,
    autoDisposeSequence: false
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

session.dispose();


const session2 = new LlamaChatSession({
    contextSequence
});

const q1a = "Hi there";
console.log("User: " + q1a);

const a1a = await session2.prompt(q1a);
console.log("AI: " + a1a);

from node-llama-cpp.

StrangeBytesDev avatar StrangeBytesDev commented on June 13, 2024

I think this function in llama.cpp might be the right one to call to try to implement this.
https://github.com/ggerganov/llama.cpp/blob/b2440/llama.cpp#L14010

But I've never done any kind of C++ to Nodejs bindings before, so I'm doing my best to try and work through how that works and how to implement this here just by inferring from addon.cpp.

from node-llama-cpp.

giladgd avatar giladgd commented on June 13, 2024

I really like the idea :)

I've experimented with llama_load_session_file in the past, and have a few conclusions:

  • The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.
  • It saves the entire context state, including all unused buffers, so the generated files are huge and can quickly fill up the storage of your machine if used frequently.
  • It depends on the specific implementation of the current binary, so if you update to the latest version of llama.cpp or node-llama-cpp (a new version of llama.cpp is released every few hours), every slight difference in the implementation that can affect how things are saved in memory and will make it impossible to load such a memory dump safely in another version without crashing or leading to memory corruptions.
  • IIRC (since I experimented with it months ago) It depends on the context size you created the context with and some other parameters, so the new context must match the parameters of the previous one you saved the state of, which can pretty easily lead to memory corruptions and crashes.

If you like you can try to add the ability to save and load only the evaluation cache of a context sequence to llama.cpp, which will solve most of the problems I encountered and make it viable to add support for in node-llama-cpp.

from node-llama-cpp.

Madd0g avatar Madd0g commented on June 13, 2024

The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.

I've used the oobabooga API for some batch tasks and it is noticeably fast for sequential large prompts if just the start of the text is the same but the ending is different. It seems to be a feature of llama-cpp-python? Is that a different implementation of the prefix caching?

I was hoping to benefit from this feature too, forgot that llama.cpp and the python version are two different things

from node-llama-cpp.

Madd0g avatar Madd0g commented on June 13, 2024

@giladgd - thanks I played around today with the beta.

I tried running on CPU, I'm looping over an array of strings, for me the evaluation takes longer only if I dispose and fully recreate the session.

  1. reusing the contextSequence but recreating the session - no difference
  2. autoDisposeSequence - true/false - no difference

I'm resetting the history in the loop to only keep the system message:

const context = await model.createContext({
  contextSize: Math.min(4096, model.trainContextSize),
});
const contextSequence = context.getSequence();
const session = new LlamaChatSession({
  systemPrompt,
  contextSequence,
  autoDisposeSequence: false,
});

// for of ....
// and in the loop to keep the system message:
    session.setChatHistory(session.getChatHistory().slice(0, 1));

am I doing something wrong?

from node-llama-cpp.

StrangeBytesDev avatar StrangeBytesDev commented on June 13, 2024
  • The main problem is that it holds the entire context state and not just the evaluation cache of the tokens used in a specific context sequence, so it cannot be used together with multiple sequences, thus eliminating the ability to do efficient batching this way.
  • It saves the entire context state, including all unused buffers, so the generated files are huge and can quickly fill up the storage of your machine if used frequently.
  • It depends on the specific implementation of the current binary, so if you update to the latest version of llama.cpp or node-llama-cpp (a new version of llama.cpp is released every few hours), every slight difference in the implementation that can affect how things are saved in memory and will make it impossible to load such a memory dump safely in another version without crashing or leading to memory corruptions.
  • IIRC (since I experimented with it months ago) It depends on the context size you created the context with and some other parameters, so the new context must match the parameters of the previous one you saved the state of, which can pretty easily lead to memory corruptions and crashes.

I'm a little fuzzy on what you mean between the "entire context state" vs "evaluation cache" because I don't have a super solid conceptual idea of how things work under the hood on LlamaCPP for batching. It sounds to me like the existing prompt based caching would only really be useful for single user setups, and for short term caching. Is there a way to cache a context to disk on the Node side with the V3 beta? I'm assuming a naive attempt to do something like this won't actually work:

const context = await model.createContext({
    contextSize: 2048,
})
fs.writeFileSync("context.bin", context)

from node-llama-cpp.

giladgd avatar giladgd commented on June 13, 2024

@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new LlamaChatSession like in my example.
The LlamaChatSession is just a wrapper around a LlamaContextSequence to facilitate chatting with a model, so there's no significant performance value to reusing that object.

The next beta version should be released next week and include a tokenMeter on every LlamaContextSequence that will allow you to see exactly how many tokens were evaluated and generated.

from node-llama-cpp.

giladgd avatar giladgd commented on June 13, 2024

@StrangeBytesDev A context can have multiple sequences, and each sequence has its own state and history.
When you evaluate things on a specific sequence, other sequences are not affected and are not aware of the evaluation.
Using multiple sequences on a single context has a performance advantage over creating multiple contexts with a single sequence on each, which is why I opted to expose that concept as-is in node-llama-cpp.

Since every sequence is supposed to be independent and have its own state, there shouldn't be any functions that have side effects that can affect other sequences when you only intend to affect a specific sequence.

The problem with llama_load_session_file is that it restores the state of a context with all of its sequences, which makes it incompatible with the concept of multiple independent sequences.
While it can be beneficial when you only use a single sequence on a context, I opted not to add support for this due to the rest of the issues I mentioned earlier.
Also, in order for the rest of the optimizations that node-llama-cpp employs to work properly after loading a context state from a file it would require creating a different file format with additional data that node-llama-cpp needs, so it won't be as simple as just exposing the native function in the JS side.

from node-llama-cpp.

Madd0g avatar Madd0g commented on June 13, 2024

@Madd0g It's not a good idea to manually truncate chat history like that just in order to reset it; you better create a new LlamaChatSession like in my example.

Thanks, I initially couldn't get it to work without it retaining history, I was doing something wrong. Today I did manage to do it correctly, with something like this in a loop:

if (session) {
  session.dispose();
}
session = new LlamaChatSession({ contextSequence, systemPrompt, chatWrapper: "auto" })
console.log(session.getChatHistory());

I tried to get chatHistory out of the session and I correctly see only one the system message in there.

from node-llama-cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.