gotzmann / llama.go Goto Github PK
View Code? Open in Web Editor NEWllama.go is like llama.cpp in pure Golang!
License: Other
llama.go is like llama.cpp in pure Golang!
License: Other
in reply from: @gotzmann
Implementing INT4/INT8 quantization and using AVX instructions can be challenging, mainly due to the limitations of INT8 multiplication instructions. However, here are some ideas to help you get started:
To deal with the lack of specific INT8 multiplication instructions, you can try converting the INT8 data to INT16 before performing the multiplication. Here is a basic example of how you can do this:
#include <immintrin.h>
__m512i int8_mul(__m512i a, __m512i b) {
// Converta os vetores INT8 para INT16
__m512i a_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(a));
__m512i a_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(a, 1));
__m512i b_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(b));
__m512i b_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(b, 1));
// Multiplique os vetores INT16
__m512i product_lo = _mm512_mullo_epi16(a_lo, b_lo);
__m512i product_hi = _mm512_mullo_epi16(a_hi, b_hi);
// Combine os resultados em um vetor INT8
__m512i result = _mm512_packs_epi16(product_lo, product_hi);
return result;
}
Please note that this example is simplified and may not be the most efficient. It also doesn't handle saturation, so you'll need to tweak the code as needed to handle overflow cases.
These ideas should help you get started implementing INT4/INT8 quantization and using AVX instructions. Keep in mind that performance optimization is an iterative process and you may need to adjust and experiment with various approaches to find the most efficient solution for your specific case. If you need more information don't hesitate to call me, I don't really understand c++ as much as Go, but I'm at your disposal.
Just wanted to create an issue to track this, I am going to implement it in a different branch and will submit a PR when it's usable.
./llama-go-v1.4.0-linux --model=guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin --prompt="write a story about alibaba and snow white"
/▒▒ /▒▒ /▒▒▒/▒▒▒ /▒▒/▒▒▒▒/▒▒ /▒▒▒/▒▒▒ /▒▒▒▒/▒▒ /▒▒▒/▒▒▒
/▒▒▒ /▒▒▒ /▒▒▒/ /▒▒▒ /▒▒▒/▒▒▒▒/▒▒▒ /▒▒▒/ /▒▒▒ /▒▒▒▒ // /▒▒▒▒//▒▒▒
/▒▒▒▒/▒▒ /▒▒▒▒/▒▒ /▒▒▒▒/▒▒▒▒ /▒▒▒/▒▒▒▒/▒▒▒ /▒▒▒▒/▒▒▒▒ /▒▒ /▒▒▒▒/▒▒▒▒ /▒▒▒ /▒▒▒▒
/▒▒▒▒/▒▒▒ /▒▒▒▒/▒▒▒ /▒▒▒ /▒▒▒▒ /▒▒▒//▒▒ /▒▒▒ /▒▒▒ /▒▒▒▒ /▒▒▒//▒▒▒▒/▒▒ //▒▒▒/▒▒▒
//// /// //// /// /// //// /// // /// /// //// /// //// // /// ///
▒▒▒▒ [ LLaMA.go v1.4.0 ] [ LLaMA GPT in pure Golang - based on LLaMA C++ ] ▒▒▒▒
[ERROR] Invalid model file 'guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin'! Too old, regenerate!
[ ERROR ] Failed to load model "guanaco-3b-uncensored-v2.ggmlv1.q4_0.bin"
I quantized the llama 7b-chat model by llama.cpp, and get model ggml-model-q4_0.gguf. But llama.go seems not support the gguf version,
it shows the error:
`
[ERROR] Invalid model file '../llama.cpp/models/7B/ggml-model-q4_0.gguf'! Wrong MAGIC in header
[ ERROR ] Failed to load model "../llama.cpp/models/7B/ggml-model-q4_0.gguf"
`
@mfreeman451 possible to take this forward with a forked repo? u seemed to be the only other contributor who knows how to take this forward.
i can code golang and if you can guide me, i can help do the 4 bit quantisation etc
Where is the model ggml-model-f32.bin ?
I started llama-go-v1.4.0.exe
with these CLI code (I started CMD as administrator)
llama-go-v1.4.0 --model llama-7b-fp32.bin --server --host 127.0.0.1 --port 8082 --pods 1 -- threads 4
and it gave me this output.
[ INIT ] REST server ready on 127.0.0.1:8082
and I tried this json formed request
{
"id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3",
"prompt": "Can you connect the internet ?"
}
to
- 127.0.0.1:8082
- http://127.0.0.1:8082
- localhost:8082
but it gave me an error like this. Cannot POST /
and http://localhost:8082/jobs/status/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc3
couldn't find any request with that ID.
Is there any suggestion for that ?
System:
Intel i7 6900K 8 Core 16 Thread
64 GB DDR4 RAM
RTX 2080S
Samsung 980 NVMe
Windows Server 2022 Standart
I don't think my system can't handle 1 pod with 4 thread.
I wanted to try to use RedPajama-INCITE-Chat-3B-v1 with llama.go, but I haven't been able to figure out how to convert it or if its possible to convert. The python script is saying the repo is missing some of the required files, tokenizer.model, params.json, ect.
Would someone be able to provide an example of how to use models from huggingface? I was able to use the models using the python library, just not llama.go.
I am able to run llama-go-v1.exe. It gives the same output as that of in Readme file. It gives me " Loading model, Please wait...".
later REST Server ready on 127.0.0.1:8080. When I enter the URL in browser, it's giving Can't GET.
Please help in how to test ... Please give me clear steps to test . Thanks in advance.
is this the llama of the goal? is the alpaca version available? how does it work and how to install/use?
These go commands to install dependencies did not work for me:
go tidy
go vendor
Instead I had to run
go mod tidy
go mod vendor
would be great to see how this code stands in comparison to C++
nice project btw
The Makefile calls:
./quantize ~/models/7B/ggml-model-f32.bin ~/models/7B/ggml-model-q4_0.bin 2
Where is the quantize script/binary? Or how to build it?
Thanks
sounds a bit advertising but i was wondering if i can contribute code indirectly through
https://github.com/cloudxaas repo
i can optimize code sections if you need me to for zero allocation work etc but hope the code can be used from cloudxaas repo.
is there a possibility for such collaborations?
looking forward to latest model and running 4bit quantisation on windows.
is this proj abandon?
Is the project ready for production use? What is the minimum required hardware to run version 7B? (recommended CPU? how many CPU threads?)
Can the project handle 8GB of RAM? 32GB is excessive!
the function of this project still not perfect.
https://item.jd.com/10076686823591.html#crumb-wrap
Hi, I have a Lenovo Ren 9000 desktop computer here. For specific configuration, please refer to the shopping mall purchase link.
Use the lscpu command to check that the machine supports avx computing optimization technology, but after adding this flag, I found that there is no difference in the speed of generating tokens. Why is this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.