Comments (10)
Yes cpu fp32i8 is a hidden feature :)
It's because I am using a silly slow method to dynamically convert INT8 -> FPXX
from chatrwkv.
Please update ChatRWKV v2 & pip rwkv package (0.3.1) for 2x faster f16i8 (and less VRAM) and fast f16i8+ streaming.
I think cpu fp32i8 might be faster too. Please test :)
from chatrwkv.
you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1
This is much slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1
I tried:
cuda fp16i8 *14+ -> cpu fp32 *1
cuda fp16i8 *14+
I also tried cuda fp16 *7+ -> cpu fp32 *1
(not i8
) and it seemed either about the same or maybe a little slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1
. Definitely much faster than either of the two above though.
With my hardware, I haven't seen any case where using fp16i8
was faster than just using half the number of layers with fp16
.
from chatrwkv.
Update ChatRWKV v2 & pip rwkv package (0.4.2) for 2x speed in all modes @KerfuffleV2
Please join RWKV Discord where you can share your thoughtful benchmark results
from chatrwkv.
I've git pulled the repository and upgraded the rwkv
package via pip.
First I used oobabooga's Text Generation Web UI. After 7B loaded and I disabled swap space, the memory usage is at 7.8 GiB and spiked at 9.0 GiB once (after that, it spiked highest at 8.4 GiB). I was able to generate at what felt like a token every ~15 seconds (which isn't a problem for me, but still). So far it looks better than my previous test with 7B on standalone ChatRWKV.
I've decided to give it a try with standalone ChatRWKV again, going back to my 1.5B testing.
- First attempt: 1.5B takes 6.7 GiB to load, and then idles at 4.6 GiB (previously it idled at 5.8 GiB).
- Second attempt: 1.5B takes 6.8 GiB to load, and idles at 4.6 GiB.
- Third attempt: 1.5B takes ~4 GiB (I forgot the exact number?) to load, and it idles at 2.5 GiB (which spikes to 4.0 GiB during generation, and after generation idles at 3.3 GiB).
It seems the optimization for f16i8 might've benefit fp32i8 as well, as I notice the memory requirements might have been cut down slightly further than last time?
The memory fluctuation still seems to be there, though; aside from the 1.5B tests, quick tests with 169M gave me results ranging from 663.6 MiB to 976.3 MiB for fp32i8. I'm unsure if this is on RWKV's end or my operating system's end (I'm using Void Linux, if that helps).
I haven't kept an eye out on whether or not there was a difference in speed. I'm considering seeing if cpu fp32
and cpu fp32i8
are compatible with streaming as well (I need to look into how they work soon).
Either way, thank you again for considering lower-end and CPU users with this project! I feel like these optimization efforts helps make LLMs more accessible to those with weaker hardware.
from chatrwkv.
I've git pulled the repository and upgraded the
rwkv
package via pip.
Cool please update to latest code and 0.3.1 (better generation quality)
from chatrwkv.
Mm, I think that's what I've done, actually! I've did it again just to make sure; both this code and the pip package are up-to-date.
from chatrwkv.
This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8
at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)
The best performance I've been able to achieve is using cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1
. This generates about a token a second. (Other applications are already using about 1.2G VRAM, so on a server could probably go up to *10
).
cuda fp16i8 *16 -> cuda fp16 *0+ -> cpu fp32 *1
— Noticeably slower. (Seems about 2sec/token)cuda fp16 *9 -> cuda fp16i8 *0+ -> cpu fp32 *1
— Also slower.cuda fp16 *4 -> cuda fp16 *0+ -> cuda fp16 *4 -> cpu fp32 *1
— Seems about the same as the best strategy I already mentioned.
I didn't think the third one would be better, just tried it out of curiosity. I don't know if there's any advantage running specific layers on dedicated memory rather than streaming, aside from the 33rd.
from chatrwkv.
This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use
i8
at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)
you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1
from chatrwkv.
Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
PLEASE VERIFY THE GENERATION QUALITY IS UNCHANGED.
from chatrwkv.
Related Issues (20)
- Add a support to "stop_words" in PIPELINE
- 开源中文NSFW微调模型
- demo ? HOT 2
- demo true error ? HOT 1
- 'No CUDA GPUs are available' in google colab with V100 GPU and high RAM HOT 2
- huggingface无法访问,模型无法下载 HOT 4
- Prompt for RAG with RWKV-4-World-7B-v1-20230626-ctx4096 HOT 1
- [Feature Request] text2music HOT 2
- RuntimeError: Error building extension 'wkv_cuda_v1' HOT 2
- How to write the RWKV in autogressive style like RNN HOT 2
- NameError: name 'PIPELINE' is not defined HOT 1
- 大哥,乱码了 HOT 1
- 回复总是截断了,如何让回复自然的结束 HOT 1
- eagle-7B HOT 1
- Inference doesn't work on Apple Macbook even when using CPU fp32 as strategy HOT 1
- "cpu fp32i8" strategy not working in RWKV v6 through Python rwkv module HOT 2
- How to run new v5-Eagle-7B HOT 2
- mps slower than cpu HOT 1
- model path list HOT 1
- add text condition for gen music HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatrwkv.