Giter Club home page Giter Club logo

Comments (10)

BlinkDL avatar BlinkDL commented on May 8, 2024 2

Yes cpu fp32i8 is a hidden feature :)
It's because I am using a silly slow method to dynamically convert INT8 -> FPXX

from chatrwkv.

BlinkDL avatar BlinkDL commented on May 8, 2024 1

Please update ChatRWKV v2 & pip rwkv package (0.3.1) for 2x faster f16i8 (and less VRAM) and fast f16i8+ streaming.
I think cpu fp32i8 might be faster too. Please test :)

from chatrwkv.

KerfuffleV2 avatar KerfuffleV2 commented on May 8, 2024 1

you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1

This is much slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1

I tried:

  1. cuda fp16i8 *14+ -> cpu fp32 *1
  2. cuda fp16i8 *14+

I also tried cuda fp16 *7+ -> cpu fp32 *1 (not i8) and it seemed either about the same or maybe a little slower than cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1. Definitely much faster than either of the two above though.

With my hardware, I haven't seen any case where using fp16i8 was faster than just using half the number of layers with fp16.

from chatrwkv.

BlinkDL avatar BlinkDL commented on May 8, 2024 1

Update ChatRWKV v2 & pip rwkv package (0.4.2) for 2x speed in all modes @KerfuffleV2

Please join RWKV Discord where you can share your thoughtful benchmark results

from chatrwkv.

Crataco avatar Crataco commented on May 8, 2024

I've git pulled the repository and upgraded the rwkv package via pip.

First I used oobabooga's Text Generation Web UI. After 7B loaded and I disabled swap space, the memory usage is at 7.8 GiB and spiked at 9.0 GiB once (after that, it spiked highest at 8.4 GiB). I was able to generate at what felt like a token every ~15 seconds (which isn't a problem for me, but still). So far it looks better than my previous test with 7B on standalone ChatRWKV.

I've decided to give it a try with standalone ChatRWKV again, going back to my 1.5B testing.

  • First attempt: 1.5B takes 6.7 GiB to load, and then idles at 4.6 GiB (previously it idled at 5.8 GiB).
  • Second attempt: 1.5B takes 6.8 GiB to load, and idles at 4.6 GiB.
  • Third attempt: 1.5B takes ~4 GiB (I forgot the exact number?) to load, and it idles at 2.5 GiB (which spikes to 4.0 GiB during generation, and after generation idles at 3.3 GiB).

It seems the optimization for f16i8 might've benefit fp32i8 as well, as I notice the memory requirements might have been cut down slightly further than last time?

The memory fluctuation still seems to be there, though; aside from the 1.5B tests, quick tests with 169M gave me results ranging from 663.6 MiB to 976.3 MiB for fp32i8. I'm unsure if this is on RWKV's end or my operating system's end (I'm using Void Linux, if that helps).

I haven't kept an eye out on whether or not there was a difference in speed. I'm considering seeing if cpu fp32 and cpu fp32i8 are compatible with streaming as well (I need to look into how they work soon).

Either way, thank you again for considering lower-end and CPU users with this project! I feel like these optimization efforts helps make LLMs more accessible to those with weaker hardware.

from chatrwkv.

BlinkDL avatar BlinkDL commented on May 8, 2024

I've git pulled the repository and upgraded the rwkv package via pip.

Cool please update to latest code and 0.3.1 (better generation quality)

from chatrwkv.

Crataco avatar Crataco commented on May 8, 2024

Mm, I think that's what I've done, actually! I've did it again just to make sure; both this code and the pip package are up-to-date.
image
image

from chatrwkv.

KerfuffleV2 avatar KerfuffleV2 commented on May 8, 2024

This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8 at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)

The best performance I've been able to achieve is using cuda fp16 *8 -> cuda fp16 *0+ -> cpu fp32 *1. This generates about a token a second. (Other applications are already using about 1.2G VRAM, so on a server could probably go up to *10).

  1. cuda fp16i8 *16 -> cuda fp16 *0+ -> cpu fp32 *1 — Noticeably slower. (Seems about 2sec/token)
  2. cuda fp16 *9 -> cuda fp16i8 *0+ -> cpu fp32 *1 — Also slower.
  3. cuda fp16 *4 -> cuda fp16 *0+ -> cuda fp16 *4 -> cpu fp32 *1 — Seems about the same as the best strategy I already mentioned.

I didn't think the third one would be better, just tried it out of curiosity. I don't know if there's any advantage running specific layers on dedicated memory rather than streaming, aside from the 33rd.

from chatrwkv.

BlinkDL avatar BlinkDL commented on May 8, 2024

This is very unscientific, but I've been playing with using different strategies on my 6G GPU (GeForce GTX 1060) including with the latest release, 3.1. As far as I can tell, it just doesn't seem worth it to use i8 at all at least on this GPU, even if though it's possible to fit twice as many layers. (I know this isn't really relevant to pure CPU-based calculation, but there was already a discussion going on about the changes.)

you can try 'cuda fp16i8 *14+' (and increase 14) to stream fp16i8 layers with 0.3.1

from chatrwkv.

BlinkDL avatar BlinkDL commented on May 8, 2024

Update ChatRWKV v2 & pip rwkv package (0.5.0) and set os.environ["RWKV_CUDA_ON"] = '1'
for 1.5x speed f16i8 (and 10% less VRAM, now 14686MB for 14B instead of 16462M - so you can put more layers on GPU)
PLEASE VERIFY THE GENERATION QUALITY IS UNCHANGED.

from chatrwkv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.