Hi. This isn't an issue , but I didn't know where else to put this, haha.

I've git pulled the repository and upgraded the <code class="notranslate"

cpu fp32i8 for low RAM usage on CPU? about chatrwkv HOT 10 CLOSED

blinkdl commented on May 8, 2024

cpu fp32i8 for low RAM usage on CPU?

from chatrwkv.

Comments (10)

BlinkDL commented on May 8, 2024 2

Yes cpu fp32i8 is a hidden feature :)
It's because I am using a silly slow method to dynamically convert INT8 -> FPXX

from chatrwkv.

BlinkDL commented on May 8, 2024 1

Please update ChatRWKV v2 & pip rwkv package (0.3.1) for 2x faster f16i8 (and less VRAM) and fast f16i8+ streaming.
I think cpu fp32i8 might be faster too. Please test :)

from chatrwkv.

KerfuffleV2 commented on May 8, 2024 1

you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1

This is much slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1

I tried:

cuda fp16i8 *14+ -> cpu fp32 *1
cuda fp16i8 *14+

I also tried cuda fp16 *7+ -> cpu fp32 *1 (not i8) and it seemed either about the same or maybe a little slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1. Definitely much faster than either of the two above though.

With my hardware, I haven't seen any case where using fp16i8 was faster than just using half the number of layers with fp16.

from chatrwkv.

BlinkDL commented on May 8, 2024 1

Update ChatRWKV v2 & pip rwkv package (0.4.2) for 2x speed in all modes @KerfuffleV2

Please join RWKV Discord where you can share your thoughtful benchmark results

from chatrwkv.

Crataco commented on May 8, 2024

I've git pulled the repository and upgraded the rwkv package via pip.

First I used oobabooga's Text Generation Web UI. After 7B loaded and I disabled swap space, the memory usage is at 7.8 GiB and spiked at 9.0 GiB once (after that, it spiked highest at 8.4 GiB). I was able to generate at what felt like a token every ~15 seconds (which isn't a problem for me, but still). So far it looks better than my previous test with 7B on standalone ChatRWKV.

I've decided to give it a try with standalone ChatRWKV again, going back to my 1.5B testing.

First attempt: 1.5B takes 6.7 GiB to load, and then idles at 4.6 GiB (previously it idled at 5.8 GiB).
Second attempt: 1.5B takes 6.8 GiB to load, and idles at 4.6 GiB.
Third attempt: 1.5B takes ~4 GiB (I forgot the exact number?) to load, and it idles at 2.5 GiB (which spikes to 4.0 GiB during generation, and after generation idles at 3.3 GiB).

It seems the optimization for f16i8 might've benefit fp32i8 as well, as I notice the memory requirements might have been cut down slightly further than last time?

The memory fluctuation still seems to be there, though; aside from the 1.5B tests, quick tests with 169M gave me results ranging from 663.6 MiB to 976.3 MiB for fp32i8. I'm unsure if this is on RWKV's end or my operating system's end (I'm using Void Linux, if that helps).

I haven't kept an eye out on whether or not there was a difference in speed. I'm considering seeing if cpu fp32 and cpu fp32i8 are compatible with streaming as well (I need to look into how they work soon).

Either way, thank you again for considering lower-end and CPU users with this project! I feel like these optimization efforts helps make LLMs more accessible to those with weaker hardware.

from chatrwkv.

BlinkDL commented on May 8, 2024

I've git pulled the repository and upgraded the rwkv package via pip.

Cool please update to latest code and 0.3.1 (better generation quality)

from chatrwkv.

Crataco commented on May 8, 2024

Mm, I think that's what I've done, actually! I've did it again just to make sure; both this code and the pip package are up-to-date.

from chatrwkv.

KerfuffleV2 commented on May 8, 2024

This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8 at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)

The best performance I've been able to achieve is using cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1. This generates about a token a second. (Other applications are already using about 1.2G VRAM, so on a server could probably go up to *10).

cuda fp16i8 *16 -> cuda fp16 *0+ -> cpu fp32 *1 — Noticeably slower. (Seems about 2sec/token)
cuda fp16 *9 -> cuda fp16i8 *0+ -> cpu fp32 *1 — Also slower.
cuda fp16 *4 -> cuda fp16 *0+ -> cuda fp16 *4 -> cpu fp32 *1 — Seems about the same as the best strategy I already mentioned.

I didn't think the third one would be better, just tried it out of curiosity. I don't know if there's any advantage running specific layers on dedicated memory rather than streaming, aside from the 33rd.

from chatrwkv.

BlinkDL commented on May 8, 2024

This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8 at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)

you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1

from chatrwkv.

BlinkDL commented on May 8, 2024

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
PLEASE VERIFY THE GENERATION QUALITY IS UNCHANGED.

from chatrwkv.

cpu fp32i8 for low RAM usage on CPU? about chatrwkv HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent