gotzmann / llama.go Goto Github PK

View Code? Open in Web Editor NEW

1.2K 28.0 60.0 9.55 MB

llama.go is like llama.cpp in pure Golang!

License: Other

Go 67.83% Python 4.81% Makefile 2.09% Assembly 21.89% C 3.38%

llama alpaca chatgpt dalai gpt gpt3 gpt4 llama-cpp llm gpt4all

llama.go's People

Contributors

Stargazers

Watchers

llama.go's Issues

How to implement INT4/INT8 quantization and optimal way to use AVX instructions?

in reply from: @gotzmann

Implementing INT4/INT8 quantization and using AVX instructions can be challenging, mainly due to the limitations of INT8 multiplication instructions. However, here are some ideas to help you get started:

Quantization:

For INT8, you can normalize the data in float32 to the range [-128, 127] and round it to integers. Remember to store the scale factors to convert back to float32 during dequantization.
For INT4, you can normalize the data in float32 to the range [-8, 7] and round it to integers. Similar to the INT8 case, store the scale factors for dequantization.

AVX Instructions:

AVX-512 instructions can be used to accelerate operations on INT8 and INT4 data arrays. You can use instructions like _mm512_maddubs_epi16() and _mm512_add_epi16() to perform multiplication and addition, respectively.

To deal with the lack of specific INT8 multiplication instructions, you can try converting the INT8 data to INT16 before performing the multiplication. Here is a basic example of how you can do this:

#include <immintrin.h>

__m512i int8_mul(__m512i a, __m512i b) {
  // Converta os vetores INT8 para INT16
  __m512i a_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(a));
  __m512i a_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(a, 1));
  __m512i b_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(b));
  __m512i b_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(b, 1));

  // Multiplique os vetores INT16
  __m512i product_lo = _mm512_mullo_epi16(a_lo, b_lo);
  __m512i product_hi = _mm512_mullo_epi16(a_hi, b_hi);

  // Combine os resultados em um vetor INT8
  __m512i result = _mm512_packs_epi16(product_lo, product_hi);

  return result;
}

Please note that this example is simplified and may not be the most efficient. It also doesn't handle saturation, so you'll need to tweak the code as needed to handle overflow cases.

These ideas should help you get started implementing INT4/INT8 quantization and using AVX instructions. Keep in mind that performance optimization is an iterative process and you may need to adjust and experiment with various approaches to find the most efficient solution for your specific case. If you need more information don't hesitate to call me, I don't really understand c++ as much as Go, but I'm at your disposal.

feat: interactive mode for a chatgpt like experience

Just wanted to create an issue to track this, I am going to implement it in a different branch and will submit a PR when it's usable.

Trying to run thebloke guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin but got this error, how to convert to format for llama.go?

./llama-go-v1.4.0-linux --model=guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin --prompt="write a story about alibaba and snow white"
                                                    
  /▒▒       /▒▒         /▒▒▒/▒▒▒   /▒▒/▒▒▒▒/▒▒   /▒▒▒/▒▒▒      /▒▒▒▒/▒▒   /▒▒▒/▒▒▒    
  /▒▒▒      /▒▒▒      /▒▒▒/ /▒▒▒ /▒▒▒/▒▒▒▒/▒▒▒ /▒▒▒/ /▒▒▒     /▒▒▒▒ //   /▒▒▒▒//▒▒▒  
  /▒▒▒▒/▒▒  /▒▒▒▒/▒▒  /▒▒▒▒/▒▒▒▒ /▒▒▒/▒▒▒▒/▒▒▒ /▒▒▒▒/▒▒▒▒ /▒▒ /▒▒▒▒/▒▒▒▒ /▒▒▒ /▒▒▒▒ 
  /▒▒▒▒/▒▒▒ /▒▒▒▒/▒▒▒ /▒▒▒ /▒▒▒▒ /▒▒▒//▒▒ /▒▒▒ /▒▒▒ /▒▒▒▒ /▒▒▒//▒▒▒▒/▒▒  //▒▒▒/▒▒▒
  //// ///  //// ///  ///  ////  ///  //  ///  ///  ////  ///  //// //    /// ///

   ▒▒▒▒ [ LLaMA.go v1.4.0 ] [ LLaMA GPT in pure Golang - based on LLaMA C++ ] ▒▒▒▒


[ERROR] Invalid model file 'guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin'! Too old, regenerate!
[ ERROR ] Failed to load model "guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin"

Failed to load model, Wrong MAGIC in header

I quantized the llama 7b-chat model by llama.cpp, and get model ggml-model-q4_0.gguf. But llama.go seems not support the gguf version,
it shows the error:
`
[ERROR] Invalid model file '../llama.cpp/models/7B/ggml-model-q4_0.gguf'! Wrong MAGIC in header

[ ERROR ] Failed to load model "../llama.cpp/models/7B/ggml-model-q4_0.gguf"
`

@mfreeman451 possible to take this forward with a forked repo?

@mfreeman451 possible to take this forward with a forked repo? u seemed to be the only other contributor who knows how to take this forward.

i can code golang and if you can guide me, i can help do the 4 bit quantisation etc

Where is the model ggml-model-f32.bin ?

Postman gives "Cannot POST /" error with server mode.

I started llama-go-v1.4.0.exe with these CLI code (I started CMD as administrator)
llama-go-v1.4.0 --model llama-7b-fp32.bin --server --host 127.0.0.1 --port 8082 --pods 1 -- threads 4

and it gave me this output.
[ INIT ] REST server ready on 127.0.0.1:8082

and I tried this json formed request

{
    "id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3",
    "prompt": "Can you connect the internet ?"
}

- 127.0.0.1:8082
- http://127.0.0.1:8082
- localhost:8082

but it gave me an error like this. Cannot POST /

and http://localhost:8082/jobs/status/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3 couldn't find any request with that ID.

Is there any suggestion for that ?

System:

Intel i7 6900K 8 Core 16 Thread
64 GB DDR4 RAM
RTX 2080S
Samsung 980 NVMe
Windows Server 2022 Standart

I don't think my system can't handle 1 pod with 4 thread.

How do I convert huggingface llama models for llama.go?

I wanted to try to use RedPajama-INCITE-Chat-3B-v1 with llama.go, but I haven't been able to figure out how to convert it or if its possible to convert. The python script is saying the repo is missing some of the required files, tokenizer.model, params.json, ect.

Would someone be able to provide an example of how to use models from huggingface? I was able to use the models using the python library, just not llama.go.

Able to build and run not able to test

I am able to run llama-go-v1.exe. It gives the same output as that of in Readme file. It gives me " Loading model, Please wait...".
later REST Server ready on 127.0.0.1:8080. When I enter the URL in browser, it's giving Can't GET.
Please help in how to test ... Please give me clear steps to test . Thanks in advance.

add a readme in project details please.

is this the llama of the goal? is the alpaca version available? how does it work and how to install/use?

Go commands to build from source seem wrong

These go commands to install dependencies did not work for me:

go tidy
go vendor

Instead I had to run

go mod tidy
go mod vendor

benchmarks comparison to C++ version

would be great to see how this code stands in comparison to C++

nice project btw

How to quantize

The Makefile calls:

./quantize ~/models/7B/ggml-model-f32.bin ~/models/7B/ggml-model-q4_0.bin 2

Where is the quantize script/binary? Or how to build it?

Thanks

interested in ai and would like to contribute code but not sure where to start

sounds a bit advertising but i was wondering if i can contribute code indirectly through
https://github.com/cloudxaas repo

i can optimize code sections if you need me to for zero allocation work etc but hope the code can be used from cloudxaas repo.

is there a possibility for such collaborations?

Anyone taking this project forward for real?

looking forward to latest model and running 4bit quantisation on windows.

is this proj abandon?

What is the project's maturity level?

Is the project ready for production use? What is the minimum required hardware to run version 7B? (recommended CPU? how many CPU threads?)
Can the project handle 8GB of RAM? 32GB is excessive!

Need to move faster

@gotzmann

the function of this project still not perfect.

avx hardware calculation optimization doesn't seem to work

https://item.jd.com/10076686823591.html#crumb-wrap

Hi, I have a Lenovo Ren 9000 desktop computer here. For specific configuration, please refer to the shopping mall purchase link.
Use the lscpu command to check that the machine supports avx computing optimization technology, but after adding this flag, I found that there is no difference in the speed of generating tokens. Why is this?

This is the effect without avx parameters:

This is the effect of having avx parameters:

gotzmann / llama.go Goto Github PK

llama.go's People

Contributors

Stargazers

Watchers

Forkers

llama.go's Issues

Quantization:

AVX Instructions:

Recommend Projects

Recommend Topics

Recommend Org