Giter Club home page Giter Club logo

llama.go's People

Contributors

gotzmann avatar mfreeman451 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama.go's Issues

How to implement INT4/INT8 quantization and optimal way to use AVX instructions?

in reply from: @gotzmann

Implementing INT4/INT8 quantization and using AVX instructions can be challenging, mainly due to the limitations of INT8 multiplication instructions. However, here are some ideas to help you get started:

Quantization:

  • For INT8, you can normalize the data in float32 to the range [-128, 127] and round it to integers. Remember to store the scale factors to convert back to float32 during dequantization.
  • For INT4, you can normalize the data in float32 to the range [-8, 7] and round it to integers. Similar to the INT8 case, store the scale factors for dequantization.

AVX Instructions:

  • AVX-512 instructions can be used to accelerate operations on INT8 and INT4 data arrays. You can use instructions like _mm512_maddubs_epi16() and _mm512_add_epi16() to perform multiplication and addition, respectively.

To deal with the lack of specific INT8 multiplication instructions, you can try converting the INT8 data to INT16 before performing the multiplication. Here is a basic example of how you can do this:

#include <immintrin.h>

__m512i int8_mul(__m512i a, __m512i b) {
  // Converta os vetores INT8 para INT16
  __m512i a_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(a));
  __m512i a_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(a, 1));
  __m512i b_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(b));
  __m512i b_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(b, 1));

  // Multiplique os vetores INT16
  __m512i product_lo = _mm512_mullo_epi16(a_lo, b_lo);
  __m512i product_hi = _mm512_mullo_epi16(a_hi, b_hi);

  // Combine os resultados em um vetor INT8
  __m512i result = _mm512_packs_epi16(product_lo, product_hi);

  return result;
}

Please note that this example is simplified and may not be the most efficient. It also doesn't handle saturation, so you'll need to tweak the code as needed to handle overflow cases.

These ideas should help you get started implementing INT4/INT8 quantization and using AVX instructions. Keep in mind that performance optimization is an iterative process and you may need to adjust and experiment with various approaches to find the most efficient solution for your specific case. If you need more information don't hesitate to call me, I don't really understand c++ as much as Go, but I'm at your disposal.

Trying to run thebloke guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin but got this error, how to convert to format for llama.go?

./llama-go-v1.4.0-linux --model=guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin --prompt="write a story about alibaba and snow white"
                                                    
  /▒▒       /▒▒         /▒▒▒/▒▒▒   /▒▒/▒▒▒▒/▒▒   /▒▒▒/▒▒▒      /▒▒▒▒/▒▒   /▒▒▒/▒▒▒    
  /▒▒▒      /▒▒▒      /▒▒▒/ /▒▒▒ /▒▒▒/▒▒▒▒/▒▒▒ /▒▒▒/ /▒▒▒     /▒▒▒▒ //   /▒▒▒▒//▒▒▒  
  /▒▒▒▒/▒▒  /▒▒▒▒/▒▒  /▒▒▒▒/▒▒▒▒ /▒▒▒/▒▒▒▒/▒▒▒ /▒▒▒▒/▒▒▒▒ /▒▒ /▒▒▒▒/▒▒▒▒ /▒▒▒ /▒▒▒▒ 
  /▒▒▒▒/▒▒▒ /▒▒▒▒/▒▒▒ /▒▒▒ /▒▒▒▒ /▒▒▒//▒▒ /▒▒▒ /▒▒▒ /▒▒▒▒ /▒▒▒//▒▒▒▒/▒▒  //▒▒▒/▒▒▒
  //// ///  //// ///  ///  ////  ///  //  ///  ///  ////  ///  //// //    /// ///

   ▒▒▒▒ [ LLaMA.go v1.4.0 ] [ LLaMA GPT in pure Golang - based on LLaMA C++ ] ▒▒▒▒


[ERROR] Invalid model file 'guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin'! Too old, regenerate!
[ ERROR ] Failed to load model "guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin"

Failed to load model, Wrong MAGIC in header

I quantized the llama 7b-chat model by llama.cpp, and get model ggml-model-q4_0.gguf. But llama.go seems not support the gguf version,
it shows the error:
`
[ERROR] Invalid model file '../llama.cpp/models/7B/ggml-model-q4_0.gguf'! Wrong MAGIC in header

[ ERROR ] Failed to load model "../llama.cpp/models/7B/ggml-model-q4_0.gguf"
`

Postman gives "Cannot POST /" error with server mode.

I started llama-go-v1.4.0.exe with these CLI code (I started CMD as administrator)
llama-go-v1.4.0 --model llama-7b-fp32.bin --server --host 127.0.0.1 --port 8082 --pods 1 -- threads 4

and it gave me this output.
[ INIT ] REST server ready on 127.0.0.1:8082
llama-go_server_error

and I tried this json formed request

{
    "id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3",
    "prompt": "Can you connect the internet ?"
}

to

- 127.0.0.1:8082
- http://127.0.0.1:8082
- localhost:8082

but it gave me an error like this. Cannot POST /
image

and http://localhost:8082/jobs/status/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3 couldn't find any request with that ID.

Is there any suggestion for that ?

System:

Intel i7 6900K 8 Core 16 Thread
64 GB DDR4 RAM
RTX 2080S
Samsung 980 NVMe
Windows Server 2022 Standart

I don't think my system can't handle 1 pod with 4 thread.

How do I convert huggingface llama models for llama.go?

I wanted to try to use RedPajama-INCITE-Chat-3B-v1 with llama.go, but I haven't been able to figure out how to convert it or if its possible to convert. The python script is saying the repo is missing some of the required files, tokenizer.model, params.json, ect.

Would someone be able to provide an example of how to use models from huggingface? I was able to use the models using the python library, just not llama.go.

Able to build and run not able to test

I am able to run llama-go-v1.exe. It gives the same output as that of in Readme file. It gives me " Loading model, Please wait...".
later REST Server ready on 127.0.0.1:8080. When I enter the URL in browser, it's giving Can't GET.
Please help in how to test ... Please give me clear steps to test . Thanks in advance.

How to quantize

The Makefile calls:

./quantize ~/models/7B/ggml-model-f32.bin ~/models/7B/ggml-model-q4_0.bin 2

Where is the quantize script/binary? Or how to build it?

Thanks

What is the project's maturity level?

Is the project ready for production use? What is the minimum required hardware to run version 7B? (recommended CPU? how many CPU threads?)
Can the project handle 8GB of RAM? 32GB is excessive!

avx hardware calculation optimization doesn't seem to work

https://item.jd.com/10076686823591.html#crumb-wrap

Hi, I have a Lenovo Ren 9000 desktop computer here. For specific configuration, please refer to the shopping mall purchase link.
Use the lscpu command to check that the machine supports avx computing optimization technology, but after adding this flag, I found that there is no difference in the speed of generating tokens. Why is this?

image

This is the effect without avx parameters:
image

This is the effect of having avx parameters:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.