blinkdl / chatrwkv Goto Github PK

View Code? Open in Web Editor NEW

9.3K 91.0 683.0 30.15 MB

ChatRWKV is like ChatGPT but powered by RWKV (100% RNN) language model, and open source.

License: Apache License 2.0

Python 91.42% Cuda 4.50% C++ 4.07%

chatbot chatgpt language-model pytorch rnn rwkv

chatrwkv's Introduction

BlinkDL

A minimalist deep learning library in Javascript using WebGL + asm.js. Runs in your browser.

Currently it is a proof-of-concept (inference only). Note: Convolution is buggy when memories overlap.

The WebGL backend is powered by weblas: https://github.com/waylonflinn/weblas.

Example

https://withablink.coding.me/goPolicyNet/ : a weiqi (baduk, go) policy network in AlphaGo style:

const N = 19;
const NN = N * N;
const nFeaturePlane = 8;
const nFilter = 128;

const x = new BlinkArray();
x.Init('weblas');
x.nChannel = nFeaturePlane;
x.data = new Float32Array(nFeaturePlane * NN);
for (var i = 0; i < NN; i++)
    x.data[5 * NN + i] = 1; // set feature plane for empty board

// pre-act residual network with 6 residual blocks
const bak = new Float32Array(nFilter * NN);
x.Convolution(nFilter, 3);
x.CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak).CopyTo(bak);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.BatchNorm().ReLU().Convolution(nFilter, 3);
x.Add(bak);
x.BatchNorm().ReLU().Convolution(1, 1).Softmax();

Usage

<script src='weblas.js' type='text/javascript'></script>
<script src='BlinkDL.js' type='text/javascript'></script>

Todo

chatrwkv's People

Contributors

Stargazers

Watchers

Forkers

ishine masteryuan418 colinsongf jingmouren hubin858130 alicia71911 youssef-sourour benthomasson kastnerkyle brandery localexpert maxmax2016 shafiahmed djon3s xiaoqin00 emmanuel-lud mikaelsouza mondano flycloud2010 automationkit hadryan harsha-hue oftarradiddle mrcodechef orguetta mbrukman positioner hephaex xrdevieee edmontdants mizu1 williamdeve toyslife kodalli oyelowo techthiyanes piotrlnordea michel34343 chinayuan markthree dotkt midou-tech erikluo xiechuxi iuriimattos2 cyborgparadisum iambanma summerflowers sergeykadiyevskiy uakbr marco-nicola straitrobot sanbuphy meet-ai haorand yanboishere ivu4e jerryming1995 yejiahaoye lixianglong1205 dataspeaks2020 aboutmydreams hertera1 haikuoxin guanfoxyier vinhjaxt cyberflamego jon-drugstore 1154761334 jareturing jiuweihuo minican ximianglongchang wizardvan ethanzhangcn t1stteam wangrocky atry awesome-cv curiousal adambear qianzhongfei wuhaoranran sanmuseniors haishengliang hechuan73 idreamsoft ceyase seejser chengfai 5l1v3r1 witchfindertr zhaixingang waynehorse fishgege jm12138 misteo s4ndxyz adonaiw danilloandrade

chatrwkv's Issues

Severe output quality difference between 4096 and 8192 models

Thanks for this great code and models.

I've been testing long form test generation from prompts with:

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth
RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth

I've made the necessary simple edits to chat.py, API_DEMO.py, and your gradio app.py to generate 4000+ tokens with both models using the same settings and found the 4096 model generates relatively coherent output for the entire 4000+ tokens. No small task.

Unfortunately, the 8192 model starts well then severely breaks down at around 2048 max tokens until by the end of 4000-8000 tokens it's simply repeating words. e.g. "... The Prince and the Pauper , The Prince and the Pauper , _The Prince and"

I understand most users are interested in chat-style generation and will never be interested in long form replies but I'm still wondering if there are there are special settings to improve model 8192 output relative to 4096?

Also, I saw a post where a 50b model is in the works? Fantastic if that happens.
That would be a great time to have a version of the API_DEMO.py that runs on multiple gpus.

Cheers,

AssertionError: Torch not compiled with CUDA enabled

`Traceback (most recent call last):
File "\ChatRWKV-main\ChatRWKV-main\chat.py", line 218, in
model = RWKV_RNN(args)
^^^^^^^^^^^^^^
File "\ChatRWKV-main\ChatRWKV-main\venv\Lib\site-packages\torch\jit_script.py", line 292, in init_then_script
original_init(self, *args, **kwargs)
File \ChatRWKV-main\ChatRWKV-main\src\model_run.py", line 75, in init
w[x] = w[x].to(self.RUN_DEVICE)
^^^^^^^^^^^^^^^^^^^^^^^^
File "\ChatRWKV-main\ChatRWKV-main\venv\Lib\site-packages\torch\cuda_init_.py", line 239, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

进程已结束,退出代码1
`

RuntimeError: default_program(57): error: identifier "aten_add_flat__1" is undefined

I download and use RWKV-4-Pile-3B-20221110-ctx4096.bin to test(python chat.py), but there are errors:
Run prompt...
Traceback (most recent call last):
File "chat.py", line 167, in
out = run_rnn(pipeline.encode(init_prompt))
File "chat.py", line 136, in run_rnn
out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
File "/data/home/clarkjiang/ChatRWKV-main/v2/../rwkv_pip_package/src/rwkv/model.py", line 616, in forward
omx, orx, omy, ory,
RuntimeError: default_program(57): error: identifier "aten_add_flat__1" is undefined

default_program(58): error: no operator "=" matches these operands
operand types are: half = float

2 errors detected in the compilation of "default_program".

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template
device T maximum(T a, T b) {
return isnan(a) ? a : (a > b ? a : b);
}

template
device T minimum(T a, T b) {
return isnan(a) ? a : (a < b ? a : b);
}

#define __HALF_TO_US(var) *(reinterpret_cast<unsigned short *>(&(var)))
#define __HALF_TO_CUS(var) *(reinterpret_cast<const unsigned short *>(&(var)))
#if defined(__cplusplus)
struct align(2) __half {
host device __half() { }

protected:
unsigned short __x;
};

/* All intrinsic functions are only available to nvcc compilers /
#if defined(CUDACC)
/ Definitions of intrinsics */
device __half __float2half(const float f) {
__half val;
asm("{ cvt.rn.f16.f32 %0, %1;}\n" : "=h"(__HALF_TO_US(val)) : "f"(f));
return val;
}

__device__ float __half2float(const __half h) {
  float val;
  asm("{  cvt.f32.f16 %0, %1;}\n" : "=f"(val) : "h"(__HALF_TO_CUS(h)));
  return val;
}

#endif /* defined(CUDACC) /
#endif / defined(__cplusplus) */
#undef __HALF_TO_US
#undef __HALF_TO_CUS

typedef __half half;

extern "C" global
void func_1(half* t0, half* t1, half* t2, half* t3, half* t4, half* t5, half* aten_add_flat, half* aten_add_flat_1, half* aten_add_flat_2, half* aten_cat_flat) {
{
aten_cat_flat[512 * blockIdx.x + threadIdx.x] = __float2half((((512 * blockIdx.x + threadIdx.x) / 2560<1 ? 1 : 0) ? __half2float(t5[(512 * blockIdx.x + threadIdx.x) % 2560]) : _half2float(t4[(512 * blockIdx.x + threadIdx.x) - 2560])));
float t1 = _half2float(t1[512 * blockIdx.x + threadIdx.x]);
float aten_add_flat = _half2float(aten_add_flat[512 * blockIdx.x + threadIdx.x]);
float t0 = __half2float(t0[(512 * blockIdx.x + threadIdx.x) % 2560]);
aten_add_flat__1 = float2half(t1 * t0 + ((((512 * blockIdx.x + threadIdx.x) / 2560<1 ? 1 : 0) ? _half2float(t5[(512 * blockIdx.x + threadIdx.x) % 2560]) : half2float(t4[(512 * blockIdx.x + threadIdx.x) - 2560]))) * ((0.f - t0) + 1.f));
aten_add_flat[512 * blockIdx.x + threadIdx.x] = aten_add_flat;
float t2 = __half2float(t2[(512 * blockIdx.x + threadIdx.x) % 2560]);
aten_add_flat_1[512 * blockIdx.x + threadIdx.x] = float2half(t1 * t2 + ((((512 * blockIdx.x + threadIdx.x) / 2560<1 ? 1 : 0) ? __half2float(t5[(512 * blockIdx.x + threadIdx.x) % 2560]) : half2float(t4[(512 * blockIdx.x + threadIdx.x) - 2560]))) * ((0.f - t2) + 1.f));
float t3 = __half2float(t3[(512 * blockIdx.x + threadIdx.x) % 2560]);
aten_add_flat_2[512 * blockIdx.x + threadIdx.x] = float2half(t1 * t3 + ((((512 * blockIdx.x + threadIdx.x) / 2560<1 ? 1 : 0) ? __half2float(t5[(512 * blockIdx.x + threadIdx.x) % 2560]) : _half2float(t4[(512 * blockIdx.x + threadIdx.x) - 2560]))) * ((0.f - t3) + 1.f));
}
}

what is wrong?

"LayerNormKernelImpl" not implemented for 'BFloat16'

Version:

Loading model :
RWKV-4-Pile-7B-Instruct-test1-20230124

Error:
Traceback (most recent call last):
File "chat.py", line 175, in
model = RWKV_RNN(args)
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 272, in init_then_script
original_init(self, *args, **kwargs)
File "/root/work/ChatRWKV/src/model_run.py", line 103, in init
x = self.LN(self.w.emb.weight, self.w.blocks[0].ln0)
File "/root/work/ChatRWKV/src/model_run.py", line 111, in LN
return F.layer_norm(x, (self.args.n_embd,), weight=w.weight, bias=w.bias)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2346, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'BFloat16'

Remove pytorch dependency

The RWKV package includes a

  'torch ~= 1.13.1',

dependency. In cases where pytorch is installed using conda (a very common case) or some other means, this causes a conflict.

Most machine learning libraries do not include pytorch as a requirement.

Is it possible to remove this dependency?

lowering temperature in chat.py leads to error.

In v2, chat.py, when trying to lower the temperature under its 1.0 default, for example GEN_TEMP = 0.9
I'm faced with

File "chat.py", line 393, in
on_message(msg)
File "chat.py", line 300, in on_message
token = pipeline.sample_logits(
File "(...)\utils.py", line 51, in sample_logits
probs = probs.pow(1.0 / temperature)
AttributeError: 'numpy.ndarray' object has no attribute 'pow'

Just that you know.

AMD Ubuntu 22.04 GPU - guide

The below bash history assembled from my local on premise 6900 XT machine
Inference directions, untested for training
not containerized, raw bare metal, no conda or docker involved
using ROCM 5.4 drivers with pytorch rocm 5.2 works for me
Other GPUs from AMD in the same generation may or may not work
I read some reports on forums Kali linux has less preparation (less needed terminal commands) than Ubuntu 22.04
Additional comments are "echo" cmds in the code block lines
Obviously after the below, download and extract the ChatRWKV github and the appropriate model you want and place it in the appropriate directory, cd to the correct ChatRWKV folder, "pip3" install the requirements, and run "python3 chat.py"
I use a separate boot drive from my python packages and cache folders, can save a lot of painful terminal history clutter if you just have a large single NVME for boot and everything
Always use "--extra-index-url https://download.pytorch.org/whl/rocm5.2" when using pip install every single time for every package just to be safe
May be missing a few line items, made best efforts to re-assemble from 600 lines down to the essential 52 lines below
Does NOT include SWAP space directions. SWAP is highly recommended if using less than 32GB RAM
EDIT 2023-03-22 I wiped and started over. This guide is no longer up to date for Ubuntu 22.04.1, working through workaround now - consider using a Colab which can support the 14B model in the mean time using rwkvstic package (check discord)

Your miles may vary:
1;/bin/software-properties-gtk ; echo 'turn on via checkmark all repos in the first tab in the GTK GUI for software properties'
2;sudo apt-get update
3;echo 'Download AMD linux drivers for 6900XT from their support website'
4;ll ~/Downloads/amdgpu-install_5.4.50401-1_all.deb
5;sudo chown _apt ~/Downloads/amdgpu-install_5.4.50401-1_all.deb
6;sudo apt-get install ~/Downloads/amdgpu-install_5.4.50401-1_all.deb
7;sudo chown ubuntu ~/Downloads/amdgpu-install_5.4.50401-1_all.deb
8;sudo apt-get install ~/Downloads/amdgpu-install_5.4.50401-1_all.deb
9;sudo apt-cache showpkg amdgpu-install
10;which -a amdgpu-install
11;sudo amdgpu-install --usecase=hiplibsdk,rocm,hip,dkms,hip-dev
12;sudo apt-get install perl liburi-encode-perl libfile-copy-recursive-perl libtinfo5 libncurses5
13;sudo apt-get install python3-pip
14;/bin/update-manager
15;echo 'update software in ubuntu GUI as well'
16;rocm-smi
17;echo 'the above should display GPU information'
18;export HSA_OVERRIDE_GFX_VERSION=10.3.0 ; echo 'this is important later for pytorch'
19;sudo snap refresh firefox --stable; echo 'only run if your firefox somehow breaks from the above process'
20;sudo shutdown -r now
21;echo 'restart often during this process'
22;echo 'the below 3 commands may be skipped, untested without skipping - the linux username is ubuntu but should be your username'
23;sudo usermod -a -G render ubuntu
24;sudo usermod -a -G video ubuntu
25;sudo shutdown -r now;echo 'restart often during this process'
26;mkdir /media/ubuntu/2TB_fast_nvme_Drive1/pip_cache
27;mkdir /media/ubuntu/2TB_fast_nvme_Drive1/pip_local_site-packages
28;echo 'only need the -t and --cache-dir flags with pip3 in the next command if your boot drive is not your Machine Learning drive'
29;pip3 install --user -t /media/ubuntu/2TB_fast_nvme_Drive1/pip_local_site-packages --cache-dir=/media/ubuntu/2TB_fast_nvme_drive1/pip_cache torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.2
30;echo 'the only one that matters for Natural Language Processing in this history is torch, the others may error and that's ok for this terminal history'
31;echo 'only if your boot drive is not your Machine Learning drive do the below'
32;export PYTHONUSERBASE=/media/ubuntu/2TB_fast_nvme_drive1/pip_local_site-packages
33;export TMPDIR=/media/ubuntu/2TB_fast_nvme_drive1/pip_cache
34;export PYTHONPATH=/media/ubuntu/2TB_fast_nvme_Drive1/pip_local_site-packages ; echo 'only if your boot drive is not your Machine Learning drive'
35;echo 'see you on the flip side, restart'
36;sudo shutdown -r now
37;echo 'one must make sure their non-boot drives are initiated if /etc/fstab is not taking hold - opening a file explorer, navigate to your NVME if not your boot drive manually upon every reboot, if fstab does not gracefully automount the drive at every startup'
38;echo 'clean up a little, just in case'
39;sudo apt-get install --fix-broken
40;sudo apt-get upgrade
41;lspci | grep AMD
42;echo 'mine shows an entry like this "03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c0)" use the beginning of the lspci AMD string output to check a folder'
43;sudo ls /sys/bus/pci/devices/03
44;echo 'the output of the below should be 0, not -1'
45;sudo cat /sys/bus/pci/devices/03/numa_node
46;sudo echo -1 | tee -a "/sys/bus/pci/devices/0000:03:00.0/numa_node"
47;pip3 install -t /media/ubuntu/2TB_fast_nvme_Drive1/pip_local_site-packages --cache-dir=/media/ubuntu/2TB_fast_nvme_Drive1/pip_cache --upgrade transformers accelerate bitsandbytes-rocm --extra-index-url https://download.pytorch.org/whl/rocm5.2
48;echo 'see you on the flip side, restart'
49;sudo shutdown -r now
50;export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib
export LD_LIBRARY_PATH=/opt/rocm-5.4.3/lib:/opt/rocm-5.4.3/lib64
export PATH=$PATH:/opt/rocm-5.4.3/bin:/opt/rocm-5.4.3/opencl/bin
export LD_LIBRARY_PATH=/opt/rocm/lib:/opt/rocm/lib64
export PATH=$PATH:/opt/rocm/bin:/opt/rocm/opencl/bin
51;echo 'begin python3 torch and inference tests'
52;echo "alias python3='rocm-smi --setfan 99%;python3' #AMD fan curve was not aggressive enough for my cooling" >> ~/.bashrc

Add prompt_toolkit to requirements.txt

so that we could install all dependencies by running pip install -r requirements.txt
Or else running chat.py fails:

$ python v2/chat.py 
Traceback (most recent call last):
  File "/mnt/tera/git-repos/ChatRWKV/v2/chat.py", line 10, in <module>
    from prompt_toolkit import prompt
ModuleNotFoundError: No module named 'prompt_toolkit'

Repeated sentences (and a thank you)

Hoi,

First of thank you for making this all open source (including the models). Had a lot of fun with it. Creating myself a virtual 9 year old adventurer girl that goes save a town from evil wizards together with her uncle. Sure it is not perfect and needs lot of nudges to keep it flowing, but the responses you can get are sometimes hilarious. One time I introduced a dragon into the story. Initial the girl was afraid, but after sharing part of her chocolate bar, they became best friends and inseparable ;-)
Also she wanted her uncle to marry the local town witch so she had a real family (somewhat curious what do you feed this model as training data? ;-)

What I do notice though is after about 20 or so entries and responses the model starts to produce repeated sentences. Still correct from a context point of view, but they get more and more frequent. And that often spells the end of that story, because once the engine is in that "state" it start to produce nonsense or only single words repeated forever. Which is a shame for a RNN that, theoretically at least, has infinite context length. Not sure if it is numerical stability or the latent vector goes into some weird state, but after that reset is the only way to get it back. Then it functions correctly again for 20 moves (but of course losing the progress so far). Hinting the problem is with the stored state in the RNN. I tried the model with 4096 context length, but doesn't seem to help much. Not sure if it needs a softmax layer somewhere or if it is an inherent limitation of this RNN. But it could be so much more fun if this didn't happen.

Also I noticed the model has more as just english. After typing +gen Het laatste niews van vandaag (translated from Dutch the latest news from today) it generates complete Dutch paragraphs. Didn't expect that in an English model. Though the quality of the generated text is quite low, what I find curious is that it is completely in dutch and not english/dutch mix.
Not sure if that uses up needed weight capacity in the model, or helps with generalizing across world knowledge. Not much experience with these language models yet. But found it interesting.

Context is everything though. In my created story mode it doesn't happen:

Pjere: Lees het laaste nieuws van vandaag
Alice: I read the newspaper today. It said about the witch who is hurting people, is going to kill a boy in the next few days and wants to save him, it says the punishment is a light prison sentence, but it has to go before a council, what does that mean?

Pjere: +gen Lees het laatste nieuws van vandaag
We respecteren het privacybeleid van het aanklachten site
Vorige maand besloot de Belgische Staat bij vervolging weer van de klager.
Onlangs, op 23 februari 2019, legde het Belgische Instituut voor Onderzoeksjournalistiek (iJOU) een proces-verbaal voor omdat de site voor het onderzoek weigerde met alle relevante informatie te komen over de aanklachten door de staat tegen Roos Van der Hoek. De uitzondering was de nationale rechtsbescherming, de scheidende schadeaanklager die het aanklachten site beschikt over de gegevens. Die had het volste recht te vragen om alle data weg te halen van het iJOU-site en het be

Making this library more like Hugging Face

I have expressed my interest in having RWKV officially implemented in Hugging Face in huggingface/transformers#17230.

Meanwhile, I have a distilled set of suggestions for how this library could be made more familiar to people who are already used to transformers and AutoModelForCausalLM.

Maybe some of these are already possible in the current version of rwkv. If so, I would be grateful if you could let me know how.

1. Being able to load the tokenizer explicitly

with something like

tokenizer = RWKVTokenizer.from_pretrained("/path/to/20B_tokenizer.json")

and then use it with

prompt = "Hello, my name is "
input_ids = tokenizer.encode(prompt)

Having the ability to count the number of tokens in a given prompt is very useful.

2. Generating text with input_ids as input rather than a string

Something like

output_ids = model.generate(input_ids, temperature=0.8, top_p=0.95)
output_text = tokenizer.decode(output_ids)

3. Generation parameters

Many parameters are available for model.generate() in HF, but it seems to me that the absolutely essential ones that everyone uses are:

temperature ✅
top_p ✅
top_k
repetition_penalty

I am aware that alpha_frequency and alpha_presence are implemented, but these parameters are not usually found in presets that people have already come up with while working with other models. For this reason, having repetition_penalty would be valuable.

0.5.0 operators.cu fails to compile on compute 6.x

[...]/rwkv/cuda/operators.cu(123): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (__half *, __half)
          atomicAdd(&y[k], __float2half(y_local));
          ^

This is likely because my GPU (a 1060) only supports compute 6.1 while atomicAdd support for __half requires compute 7.0 per https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd

It seems like

#ifdef __CUDA_ARCH__ <= 600
/* magic stuff here */
#endif

would be needed to support lower compute versions. I don't know enough about this to contribute anything more helpful, unfortunately.

Speeding up loading/decreasing memory use with mmap

Please feel free to just close this if it's a dumb idea. I'm just a normal developer and basically don't know anything about ML, PyTorch, etc.

I had this idea that it would be possible to load models much faster and with much less memory usage by mmaping the data. .pth files are actually ZIP files, but since they don't have compression turned on the actual data is contiguous in the file.

I actually got a working proof of concept going:

import io, mmap, pickle, zipfile, struct
import torch

class MmapEntriesUnpickler(pickle.Unpickler):
  def __init__(self, file, rawentries):
    self.rawentries = rawentries
    self.storage = {}
    super().__init__(file)

  def persistent_load(self, pid):
    entryname = pid[2]
    result = self.storage.get(entryname)
    if result is not None:
      return result
    dtname = pid[1].__name__
    if pid[0] != 'storage' or (not dtname.endswith('Storage')):
      raise ValueError(f'Unexpected persistent storage PID {pid}')
    dtype = getattr(torch, dtname[:-7].lower(), None)
    if dtype is None:
      raise ValueError(f'Unable to handle persistent storage type in PID: {pid}')
    result = torch.frombuffer(self.rawentries[entryname], dtype = dtype).storage()
    self.storage[entryname] = result
    return result


def load_mmapped(filename):
  entries = {}
  with open(filename, 'rb') as fp, zipfile.ZipFile(filename, 'r') as zfp:
    mv = memoryview(mmap.mmap(fp.fileno(), 0, flags=mmap.MAP_PRIVATE + mmap.MAP_DENYWRITE))
    for zi in zfp.infolist():
      if zi.compress_type != zipfile.ZIP_STORED:
        raise ValueError(f'Cannot support non-STORE file [{zi.filename}] in archive {filename}')
      offs = zi.header_offset + len(zi.FileHeader())
      weirdextra = struct.unpack('H', mv[offs + 2:offs + 4])[0] # No idea why this is necessary.
      offs += 4 + weirdextra
      data = mv[offs:offs + zi.file_size]
      # print(zi.filename, offs, weirdextra)
      entryname = zi.filename.rsplit('/', 1)[1]
      entries[entryname] = data
  return MmapEntriesUnpickler(io.BytesIO(entries['data.pkl']), entries).load()

The load_mmapped function can just be used instead of torch.load and it actually loads a bit faster with large models, however the rest of it is still pretty slow and memory intensive. It seems like this is because the models are saved as bfloat16 but RWKV always converts from that format so the process always ends up needing to allocate memory.

Maybe this approach is still worth it just because it speeds up the torch.load step (basically instant when mmapped but takes around 10-15sec to load the 7B model from an SSD the normal way).

I think the only way it could really make a big difference is if it was possible to store the model in a way that could be used more directly without the conversion steps. (There still could be other issues like data alignment, but at the least it might be possible to load/stream data to the GPU without it ever actually having to be loaded to CPU first.)

I guess the question is: Is this even worth continuing to look at? Getting the data into the correct format to be used directly, even just for loading to the GPU is beyond my ability right now.

v2 models

Hello, since I'm running in VRAM constraints I would like to start using v2.
I have seen in the v2/ folder a conversion script. Do I need to convert a RWKV model like before using it in v2?

Test quality with int8 and int4

How will the quality suffer if model is quantized. Will it be able to run on simple CPU and RAM without GPU and VRAM?

setup of ChatRWKV

Hey guys great stuff, can we have very easy setup step process to install ChatRWKV on a ubuntu server for example?

How to change output behaivor?

Great work!

I want to make a api server of these, any way to change output behaivor from output char one by one to output all char when generate done? thanks.

Improve code/instruction for how to download model fsx/BlinkDL/HF-MODEL

Clarify that so it is easier to follow.

Was getting this error

FileNotFoundError: [Errno 2] No such file or directory: '/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-14b/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth'

title = "RWKV-4-Pile-14B-20230313-ctx8192-test1050"
model_path = hf_hub_download(repo_id="BlinkDL/rwkv-4-pile-14b", filename=f"{title}.pth")
model = RWKV(model=model_path, strategy='cpu')

训练

想用自己的中文数据训练要怎么做呢？数据是怎样的？需要什么样的机器配置？

Expanding more on running on Windows with CUDA

Install VS2022 build tools (https://aka.ms/vs/17/release/vs_BuildTools.exe select Desktop C++).

Reinstall CUDA 11.7 (install VC++ extensions).
-- What is the purpose of re-installing if CUDA 11.7 is already installed?

I have CUDA 11.7 and vs2022 installed,
when I try to run

from rwkv.model import RWKV
from rwkv.utils import PIPELINE, PIPELINE_ARGS

I get

Using C:\Users\Jason\AppData\Local\torch_extensions\torch_extensions\Cache\py310_cu116 as PyTorch extensions root...
C:\Users\Jason\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
Detected CUDA files, patching ldflags
Emitting ninja build file C:\Users\Jason\AppData\Local\torch_extensions\torch_extensions\Cache\py310_cu116\wkv_cuda\build.ninja...

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 from rwkv.model import RWKV
      2 from rwkv.utils import PIPELINE, PIPELINE_ARGS

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\rwkv\model.py:29
     27 if os.environ.get('RWKV_CUDA_ON') == '1':
     28     from torch.utils.cpp_extension import load
---> 29     load(
     30         name=f"wkv_cuda",
     31         sources=[f"{current_path}/cuda/wrapper.cpp", f"{current_path}/cuda/operators.cu"],
     32         verbose=True,
     33         extra_cuda_cflags=["-t 4", "-std=c++17", "--use_fast_math", "-O3", "--extra-device-vectorization"],
     34         is_python_module=False)
     36     @MyStatic
     37     def cuda_wkv(T: int, C: int, w, u, k, v, aa, bb, pp):
     38         assert 1 * C % min(C, 32) == 0

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py:1284, in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1192 def load(name,
   1193          sources: Union[str, List[str]],
   1194          extra_cflags=None,
   (...)
   1202          is_standalone=False,
   1203          keep_intermediates=True):
   1204     r'''
   1205     Loads a PyTorch C++ extension just-in-time (JIT).
   1206 
   (...)
   1282         ...     verbose=True)
   1283     '''
-> 1284     return _jit_compile(
   1285         name,
   1286         [sources] if isinstance(sources, str) else sources,
   1287         extra_cflags,
   1288         extra_cuda_cflags,
   1289         extra_ldflags,
   1290         extra_include_paths,
   1291         build_directory or _get_build_directory(name, verbose),
   1292         verbose,
   1293         with_cuda,
   1294         is_python_module,
   1295         is_standalone,
   1296         keep_intermediates=keep_intermediates)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py:1508, in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1504                 hipified_sources.add(hipify_result[s_abs]["hipified_path"] if s_abs in hipify_result else s_abs)
   1506             sources = list(hipified_sources)
-> 1508         _write_ninja_file_and_build_library(
   1509             name=name,
   1510             sources=sources,
   1511             extra_cflags=extra_cflags or [],
   1512             extra_cuda_cflags=extra_cuda_cflags or [],
   1513             extra_ldflags=extra_ldflags or [],
   1514             extra_include_paths=extra_include_paths or [],
   1515             build_directory=build_directory,
   1516             verbose=verbose,
   1517             with_cuda=with_cuda,
   1518             is_standalone=is_standalone)
   1519 finally:
   1520     baton.release()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py:1610, in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
   1607     print(f'Emitting ninja build file {build_file_path}...', file=sys.stderr)
   1608 # NOTE: Emitting a new ninja build file does not cause re-compilation if
   1609 # the sources did not change, so it's ok to re-emit (and it's fast).
-> 1610 _write_ninja_file_to_build_library(
   1611     path=build_file_path,
   1612     name=name,
   1613     sources=sources,
   1614     extra_cflags=extra_cflags or [],
   1615     extra_cuda_cflags=extra_cuda_cflags or [],
   1616     extra_ldflags=extra_ldflags or [],
   1617     extra_include_paths=extra_include_paths or [],
   1618     with_cuda=with_cuda,
   1619     is_standalone=is_standalone)
   1621 if verbose:
   1622     print(f'Building extension module {name}...', file=sys.stderr)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py:2055, in _write_ninja_file_to_build_library(path, name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, with_cuda, is_standalone)
   2052 ext = EXEC_EXT if is_standalone else LIB_EXT
   2053 library_target = f'{name}{ext}'
-> 2055 _write_ninja_file(
   2056     path=path,
   2057     cflags=cflags,
   2058     post_cflags=None,
   2059     cuda_cflags=cuda_flags,
   2060     cuda_post_cflags=None,
   2061     cuda_dlink_post_cflags=None,
   2062     sources=sources,
   2063     objects=objects,
   2064     ldflags=ldflags,
   2065     library_target=library_target,
   2066     with_cuda=with_cuda)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py:2195, in _write_ninja_file(path, cflags, post_cflags, cuda_cflags, cuda_post_cflags, cuda_dlink_post_cflags, sources, objects, ldflags, library_target, with_cuda)
   2193 link_rule = ['rule link']
   2194 if IS_WINDOWS:
-> 2195     cl_paths = subprocess.check_output(['where',
   2196                                         'cl']).decode(*SUBPROCESS_DECODE_ARGS).split('\r\n')
   2197     if len(cl_paths) >= 1:
   2198         cl_path = os.path.dirname(cl_paths[0]).replace(':', '$:')

File ~\AppData\Local\Programs\Python\Python310\lib\subprocess.py:421, in check_output(timeout, *popenargs, **kwargs)
    418         empty = b''
    419     kwargs['input'] = empty
--> 421 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    422            **kwargs).stdout

File ~\AppData\Local\Programs\Python\Python310\lib\subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    524     retcode = process.poll()
    525     if check and retcode:
--> 526         raise CalledProcessError(retcode, process.args,
    527                                  output=stdout, stderr=stderr)
    528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1.

how to make it run on Apple silicon(M1) on Mac

since torch already support MPS backend, it would be nice if RWKV support MPS so we can run the inference on Macbook M1 / m2
i try to change the strategy but i won't work
args.strategy = 'mps fp16'

Compile issue on Linux

Switched over to Linux, installed ninja, and have a compile issue perhaps.
Suggestions?

python chat.py


ChatRWKV v2 https://github.com/BlinkDL/ChatRWKV

English - cuda fp16 - /media/main/C/Users/Jason/Documents/machine_learning/language_ML/ChatRWKV/v2/prompt/default/English-2.py
Using /home/main/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/main/.cache/torch_extensions/py39_cu117/wkv_cuda/build.ninja...
Building extension module wkv_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ wrapper.o operators.cuda.o -shared -L/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/main/miniconda3/envs/gptj/lib64 -lcudart -o wkv_cuda.so
FAILED: wkv_cuda.so 
c++ wrapper.o operators.cuda.o -shared -L/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/home/main/miniconda3/envs/gptj/lib64 -lcudart -o wkv_cuda.so
/usr/bin/ld: cannot find -lcudart: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
    subprocess.run(
  File "/home/main/miniconda3/envs/gptj/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/main/C/Users/Jason/Documents/machine_learning/language_ML/ChatRWKV/v2/chat.py", line 105, in <module>
    from rwkv.model import RWKV
  File "/media/main/C/Users/Jason/Documents/machine_learning/language_ML/ChatRWKV/v2/../rwkv_pip_package/src/rwkv/model.py", line 29, in <module>
    load(
  File "/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1623, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/main/miniconda3/envs/gptj/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1916, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'wkv_cuda'

DefaultCPUAllocator: not enough memory

I succeed to run 7B model. But when I tried to run 14B model on my 4080 GPU by setting "args.strategy = 'cuda fp16i8 *21 -> cuda fp16 *20'"and "os.environ["RWKV_CUDA_ON"] = '0'", it reports an error.
During the process the program consume all my 32GB CPU memory, the log is as followed.

ChatRWKV v2 https://github.com/BlinkDL/ChatRWKV

Chinese - cuda fp16i8 *21 -> cuda fp16 *20 - J:\ChatRWKV\v2/prompt/default/Chinese-2.py
Loading model - J:/ChatRWKV/RWKV-4-Pile-14B-20230313-ctx8192-test1050
RWKV_JIT_ON 1 RWKV_CUDA_ON 0 RESCALE_LAYER 6

Loading J:/ChatRWKV/RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth ...
Strategy: (total 40+1=41 layers)

cuda [float16, uint8], store 21 layers
cuda [float16, float16], store 20 layers
0-cuda-float16-uint8 1-cuda-float16-uint8 2-cuda-float16-uint8 3-cuda-float16-uint8 4-cuda-float16-uint8 5-cuda-float16-uint8 6-cuda-float16-uint8 7-cuda-float16-uint8 8-cuda-float16-uint8 9-cuda-float16-uint8 10-cuda-float16-uint8 11-cuda-float16-uint8 12-cuda-float16-uint8 13-cuda-float16-uint8 14-cuda-float16-uint8 15-cuda-float16-uint8 16-cuda-float16-uint8 17-cuda-float16-uint8 18-cuda-float16-uint8 19-cuda-float16-uint8 20-cuda-float16-uint8 21-cuda-float16-float16 22-cuda-float16-float16 23-cuda-float16-float16 24-cuda-float16-float16 25-cuda-float16-float16 26-cuda-float16-float16 27-cuda-float16-float16 28-cuda-float16-float16 29-cuda-float16-float16 30-cuda-float16-float16 31-cuda-float16-float16 32-cuda-float16-float16 33-cuda-float16-float16 34-cuda-float16-float16 35-cuda-float16-float16 36-cuda-float16-float16 37-cuda-float16-float16 38-cuda-float16-float16 39-cuda-float16-float16 40-cuda-float16-float16
emb.weight f16 cpu 50277 5120
blocks.0.ln1.weight f16 cuda:0 5120
blocks.0.ln1.bias f16 cuda:0 5120
blocks.0.ln2.weight f16 cuda:0 5120
blocks.0.ln2.bias f16 cuda:0 5120
blocks.0.att.time_decay f32 cuda:0 5120
blocks.0.att.time_first f32 cuda:0 5120
blocks.0.att.time_mix_k f16 cuda:0 5120
blocks.0.att.time_mix_v f16 cuda:0 5120
blocks.0.att.time_mix_r f16 cuda:0 5120
blocks.0.att.key.weight i8 cuda:0 5120 5120
blocks.0.att.value.weight i8 cuda:0 5120 5120
blocks.0.att.receptance.weight i8 cuda:0 5120 5120
blocks.0.att.output.weight i8 cuda:0 5120 5120
blocks.0.ffn.time_mix_k f16 cuda:0 5120
blocks.0.ffn.time_mix_r f16 cuda:0 5120
blocks.0.ffn.key.weight i8 cuda:0 5120 20480
blocks.0.ffn.receptance.weight i8 cuda:0 5120 5120
blocks.0.ffn.value.weight i8 cuda:0 20480 5120
...........................................................................................................................................................................................................................................................................................................................................................................................................Traceback (most recent call last):
File "J:\ChatRWKV\v2\chat.py", line 110, in
model = RWKV(model=args.MODEL_NAME, strategy=args.strategy)
File "J:\ChatRWKV\python3.10.10\lib\site-packages\torch\jit_script.py", line 293, in init_then_script
original_init(self, *args, **kwargs)
File "J:\ChatRWKV\v2/../rwkv_pip_package/src\rwkv\model.py", line 192, in init
w[x] = w[x] / (2 ** int(layer_id // self.RESCALE_LAYER))
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 209715200 bytes.

Still OOM for v2

Tried the v2 chat.py with the 14B model but still got OOM for the 24GB graphics memory.
Torch version is 13.0. Why?

Can you provide the requirements file ?

Your ChatRWKV is damely good ! Can you provide the requirements file ? If possible you can provide the requirements for all your deep learning open source projects. That will make your model easier to use and decrease the difficulties for beginners. Thank you for your work!

[a disappointing badcase on online demo]

首先想表达我对你们工作的敬意和惊叹。
我认为微调了attention free transformer的公式，以便能够动归计算来优化时间复杂度的做法非常有趣和实用。
但是在我想把你们的工作分享给更多的人，我去玩在线demo测试的时候。在一个case中，我直接使用了你们提供的prompt以及参数，但是这个case表现出的效果非常不好。

这个case会让我觉得，要么是【训练语料的混乱程度】要么是【训练过程/方法本身导致对于结束状态判断的不是很好】
希望你们变得更好！

failed to compile wkv_cuda

compile wkv_cuda failed when running "python chat.py":
nvcc fatal : Value 'c++17' is not defined for option 'std'

My torch and cuda is 1.13.1+cuda117

Any idea why this happens.

RuntimeError: CUDA error: an illegal memory access was encountered

Encountering an issue with setting RWKV_CUDA_ON to '1' when using multi-gpu strategy.
All the GPUs are the same 3060ti 8Gb with cuda 11.7 installed.

(base) [rig-nenkoru@localhost Raven-RWKV-7B]$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

torch
ninja
tokenizers
rwkv==0.6.2
pynvml
huggingface_hub
gradio>=3.17.1

(llama) [rig-nenkoru@localhost rwkv]$ pip show torch
Name: torch
Version: 2.0.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/

'cuda:0 fp16 -> cuda:1 fp16 -> cuda:2 fp16'

Traceback (most recent call last):
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/blocks.py", line 929, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/utils.py", line 490, in async_iteration
    return next(iterator)
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/gradio/interface.py", line 621, in fn
    for output in self.fn(*args):
  File "/home/rig-nenkoru/Raven-RWKV-7B/./app.py", line 66, in evaluate
    out, state = model.forward(pipeline.encode(ctx)[-ctx_limit:] if i == 0 else [token], state)
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/rwkv/model.py", line 573, in forward
    x, state[i*5+0], state[i*5+1], state[i*5+2], state[i*5+3] = ATT(
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/rig-nenkoru/miniconda3/envs/llama/lib/python3.10/site-packages/rwkv/model.py", line 485, in cuda_att_seq
            y, aa, bb, pp = cuda_wkv(T, C, t_decay, t_first, k, v, aa, bb, pp)
            
            out = (r * y) @ ow
                   ~~~~~ <--- HERE
            return x + out, xx[-1,:], aa, bb, pp
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Maximum Token Length

Just wondering what is token window/max token length? And your thoughts on increasing it?

ps: this repo is amazing. I wish I would have known about it sooner. you guys are awesome and I'll try to contrib at some point if I can.

Simplify v2/Chat.py

Can you simplify chat.py so it is easier to follow and be used as an inference where history can be sent? Right, it seems pretty obfuscated where it using some hacks to limit output?

update of rwkv error

When you updated the package rwkv 0.7.3 i started to get the following error that I m not able to fix when running the chat.py of V2

^C
CondaError: KeyboardInterrupt

(pytorch_p39) ubuntu@ip-172-31-26-101:/chat/v2$ conda env torch
usage: conda-env [-h] {create,export,list,remove,update,config} ...
conda-env: error: argument {create,export,list,remove,update,config}: invalid choice: 'torch' (choose from 'create', 'export', 'list', 'remove', 'update', 'config')
(pytorch_p39) ubuntu@ip-172-31-26-101:/chat/v2$ conda env lsit
usage: conda-env [-h] {create,export,list,remove,update,config} ...
conda-env: error: argument {create,export,list,remove,update,config}: invalid choice: 'lsit' (choose from 'create', 'export', 'list', 'remove', 'update', 'config')
(pytorch_p39) ubuntu@ip-172-31-26-101:/chat/v2$ conda activate aws_neuron_pytorch_p37
(aws_neuron_pytorch_p37) ubuntu@ip-172-31-26-101:/chat/v2$ python chat.py

ChatRWKV v2 https://github.com/BlinkDL/ChatRWKV

English - cuda fp16i8 -> cpu fp32 *10 - /home/ubuntu/chat/v2/prompt/default/English-2.py
Loading model - trained-500-141-1024-RWKV-6-512-2023-04-08-13-57-32
RWKV_JIT_ON 1 RWKV_CUDA_ON 0 RESCALE_LAYER 6

Loading trained-500-141-1024-RWKV-6-512-2023-04-08-13-57-32.pth ...
Strategy: (total 6+1=7 layers)

cuda [float16, uint8], store 0 layers
cpu [float32, float32], store 7 layers
0-cpu-float32-float32 1-cpu-float32-float32 2-cpu-float32-float32 3-cpu-float32-float32 4-cpu-float32-float32 5-cpu-float32-float32 6-cpu-float32-float32
Traceback (most recent call last):
File "/home/ubuntu/chat/v2/../rwkv_pip_package/src/rwkv/model.py", line 196, in init
w['emb.weight'] = F.layer_norm(w['emb.weight'], (args.n_embd,), weight=w['blocks.0.ln0.weight'], bias=w['blocks.0.ln0.bias'])
KeyError: 'blocks.0.ln0.weight'

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

我使用的显卡是GTX1080Ti 11G，在小模型上可以正常运行。在中模型上，可以成功执行完Run prompt，然后在输入问题后就会报这个错误：
Traceback (most recent call last):
File "F:\Projects\ChatRWKV\chat.py", line 397, in
on_message(msg)
File "F:\Projects\ChatRWKV\chat.py", line 357, in on_message
token = tokenizer.sample_logits(
File "F:\Projects\ChatRWKV\src\utils.py", line 85, in sample_logits
out = torch.multinomial(probs, num_samples=1)[0]
RuntimeError: probability tensor contains either inf, nan or element < 0

小模型使用的是RWKV-4-Pile-1B5-Instruct-test1-20230124.pth
中模型使用的是RWKV-4-Pile-3B-Instruct-test1-20230124.pth

请问“RWKV”这四个字母具体是什么含义？What exactly do the four letters "RWKV" mean?

请问“RWKV”这四个字母具体是什么含义？
What exactly do the four letters "RWKV" mean?

what context length was the model trained

i try to create very long chat responses but it seems limited to around 1k token.... is this the limit? or can it go above that?
if so where in ChatRWKV would i have to change the token input lenght?

How to train a ChatRWKV model on domain-specific data

Hi! Thank you for your wonderful work!
I want to train a QA bot on my own data, and I guess I shuold train a RWKV-LM v4 model and load this model in chat.py(by change the model path).
Am I doing right? Do you have any suggestions?

Thank you very much

Can you combine ChatRWKV with ChatGLM-6B and Open-Assistant ?

ChatGLM-6B and Open-Assistant are free and open-source chat-bots
We can work together to develop a free alternative faster

Does the all the model weights use Apache-2.0 license?

如题，目前github的license的主要作用于代码。
因此，想请教一下模型的权重是否也属于Apache-2.0 的license？

running RWKV-4-Pile-7B-20230313-ctx8192-test380.pth with strategy "cuda fp16i8" failed

I convert RWKV-4-Pile-7B-20230313-ctx8192-test380.pth with strategy "cuda fp16i8".
Then I run python chat with this converted model and strategy "cuda fp16i8", got the following error:
My graphics card has 12 GB VRAM.

Run prompt...
Traceback (most recent call last):
File "/home/fc/2TB/GITS/ChatRWKV/v2/./chat.py", line 185, in
out = run_rnn(pipeline.encode(init_prompt))
File "/home/fc/2TB/GITS/ChatRWKV/v2/./chat.py", line 156, in run_rnn
out, model_state = model.forward(tokens[:CHUNK_LEN], model_state)
File "/home/fc/anaconda3/envs/rwkv/lib/python3.10/site-packages/rwkv/model.py", line 607, in forward
x, state[i5+0], state[i5+1], state[i5+2], state[i5+3] = ATT(
File "/home/fc/anaconda3/envs/rwkv/lib/python3.10/site-packages/rwkv/model.py", line 531, in cuda_att_seq_i8
r = torch.sigmoid(self.mm8_seq(rx, rw, rmx, rrx, rmy, rry))
File "/home/fc/anaconda3/envs/rwkv/lib/python3.10/site-packages/rwkv/model.py", line 324, in mm8_seq
return cuda_mm8_seq(B, N, M, x, w, mx, rx, my, ry)
File "/home/fc/anaconda3/envs/rwkv/lib/python3.10/site-packages/rwkv/model.py", line 51, in cuda_mm8_seq
assert x.shape == [B, N]
AssertionError

Any idea what I did wrong?

[Feature request] 4bit/3bit quantization

As you probably know, there's a trend of lowering the size and vram usage of the LLaMA model even further using 4 or 3 bit quantization. Is it possible to implement it for this repository too? It would be of a great help

ModuleNotFoundError: No module named 'tokenizers.tokenizers'

When initiating the chatbot following error occurs:

Traceback (most recent call last):
File "C:\Users\yongy\chatrwkv\v2\chat.py", line 125, in
pipeline = PIPELINE(model, f"{current_path}/20B_tokenizer.json")
File "c:\users\yongy\appdata\local\programs\python\python37\lib\site-packages\rwkv\utils.py", line 28, in init
from tokenizers import Tokenizer
File "c:\users\yongy\appdata\local\programs\python\python37\lib\site-packages\tokenizers_init_.py", line 80, in
from .tokenizers import (
ModuleNotFoundError: No module named 'tokenizers.tokenizers'

No time_shift use in ChatRWKV?

Hi,

Why is time_shift not applied in ChatRWKV on x before computing x * self.time_mix_k + xx * (1 - self.time_mix_k) while in RWKV V4, it is the case. Any idea ?

Keep saying "no CUDA", while installed Pytorch 2.0, with cuda11.7, on A100"

/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py in _lazy_init()
245 if 'CUDA_MODULE_LOADING' not in os.environ:
246 os.environ['CUDA_MODULE_LOADING'] = 'LAZY'
--> 247 torch._C._cuda_init()
248 # Some of the queued calls may reentrantly call _lazy_init();
249 # we need to just return without initializing in that case.

RuntimeError: No CUDA GPUs are available

include "timeit" module in benchmark.py

I'm not exactly sure the best way to implement this, however it would be good to use "timeit" python module in the todo loop at the end of benchmark.py somehow to show how long each 100 model.forward loop takes.

Hi any example of conversation ?

Hi do you have any examples of where it is conversational? I am trying to find the best way to send conversational history.

Right now, the answer is too long.

CUDA load error on Windows 10

Hey, this is great work. I have been trying to set this up from repo for windows and the biggest problem so far is this error in the cuda load step (ChatRWKV/rwkv_pip_package/src/rwkv/model.py:29):

"CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1."

I'm not too familiar with low-level cuda programs so having a tough time debugging. Maybe you have some idea? Thanks!

more details / guidance in need

it's quite helpful to have a requirements.txt along with the project as i met some issues during i tried the model, e.g. i didn't know i need to install torch, numpy by pip, tronsformers by pip+git
it's quite challenge to setup the project

section I:

args.RUN_DEVICE = 'cpu' # line 7
args.MODEL_NAME = '$PATH/RWKV-4b-Pile-171M-20230202-7922' # line 17

== >

python chat.py
...

Run prompt...
Traceback (most recent call last):
  File "chat.py", line 216, in <module>
    out = run_rnn(tokenizer.tokenizer.encode(init_prompt))
  File "chat.py", line 184, in run_rnn
    current_state = model.forward(model_tokens, current_state, preprocess_only = True)
  File "$PATH/ChatRWKV/src/model_run.py", line 191, in forward
    state = torch.zeros(args.n_layer * 5, args.n_embd, device=self.RUN_DEVICE)
AttributeError: 'types.SimpleNamespace' object has no attribute 'n_layer'

section II:

if '-1B5-' in args.MODEL_NAME or '/1.5-' in args.MODEL_NAME:
    args.n_layer = 24
    args.n_embd = 2048
elif '-3B-' in args.MODEL_NAME or '/3-' in args.MODEL_NAME:
    args.n_layer = 32
    args.n_embd = 2560
elif '-7B-' in args.MODEL_NAME or '/7-' in args.MODEL_NAME:
    args.n_layer = 32
    args.n_embd = 4096
elif '-14B-' in args.MODEL_NAME or '/14-' in args.MODEL_NAME:
    args.n_layer = 40
    args.n_embd = 5120
else: # line 41
    args.n_layer = 24
    args.n_embd = 768

==>

Run prompt...
Traceback (most recent call last):
  File "chat.py", line 219, in <module>
    out = run_rnn(tokenizer.tokenizer.encode(init_prompt))
  File "chat.py", line 187, in run_rnn
    current_state = model.forward(model_tokens, current_state, preprocess_only = True)
  File "$PATH/ChatRWKV/src/model_run.py", line 197, in forward
    x = self.LN(x, w.blocks[i].ln0)
  File "$PATH/ChatRWKV/src/model_run.py", line 103, in LN
    return F.layer_norm(x, (self.args.n_embd,), weight=w.weight, bias=w.bias)
  File "/usr/local/Caskroom/miniconda/base/envs/chatbot/lib/python3.8/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

section III

args.FLOAT_MODE = 'fp32' # line 8

finally, it works

The model BlinkDL/rwkv-4-pile-14b runs out of memory on HF with 32 GB RAM CPU

Is this tested to work on 32 gb ram RWKV-4-Pile-14B-20230313-ctx8192-test1050.pth?

Also converting model of 26gb using the convert_model.py made the size ~51gb. How is this effective?

cpu fp32i8 for low RAM usage on CPU?

Hi. This isn't an issue, but I didn't know where else to put this, haha.

I've been watching the progress of ChatRWKV (which is awesome; thank you so much for developing this), and I'm a user of oobabooga's Web UI (so I'm aware of the thread on RWKV support). I like to tinker with text generation models that can be used for chatbot tasks and on CPUs, with little system memory.

I heard about int8 quantization, and not having a good enough GPU (but plenty of RAM) on my main PC, I gave it a try via cpu fp32i8. To my surprise, it works! I still needed swap space to load the model, but after that I was able to run 7B under 8.8 GiB of RAM (with spikes to around 10.5 GiB while generating), and it loaded and generated faster than plain bf16, to the best of my recollection. It was a few days back, but these were the results I remember writing down:

# MODEL		MEMORY USAGE
169M		1.0 GiB (fp32) / 743.0 MiB (bf16) / 856.7 MiB (fp32i8) / 877.9 MiB (bf16i8)
430M		2.1 GiB (fp32) /   1.3 GiB (bf16) /   1.2 GiB (fp32i8) /   2.4 GiB (bf16i8)
1.5B		6.3 GiB (fp32) /   3.4 GiB (bf16) /   5.7 GiB (fp32i8) /   4.5 GiB (bf16i8)
3B		??.? GiB (fp32) /   6.1 GiB (bf16) /   4.3 GiB (fp32i8) /  ~9.1 GiB (bf16i8)
7B		??.? GiB (fp32) /  ??.? GiB (bf16) /   8.8 GiB (fp32i8) /  ??.? GiB (bf16i8)

I do notice that it fluctuates a bit (I tried 1.5B just now, and after it was done loading it ended up idling at around 2.5 GiB the first time, 5.8 GiB the second time, and back to 2.4 GiB the third time); I'm not sure why.

But yeah, I don't see any mentions of cpu fp32i8 anywhere, not even in any Discord servers, only mentions of cuda fp16i8, so I was wondering if this was intended or if it's just a nice side effect?

Stuck on "Run prompt..."

Stuck on "Run prompt..." printed message after model init. What could be a problem?
Using cpu fp32 settings

Not enough memory for 1 gpu

There are 2 gpus on my PC but the model can't be loaded on both of them, thus it can't run 14B model. Here are some information.

Traceback (most recent call last):
  File "chat.py", line 199, in <module>
    model = RWKV_RNN(args)
  File "/home/*/ChatRWKV/lib/python3.8/site-packages/torch/jit/_script.py", line 293, in init_then_script
    original_init(self, *args, **kwargs)
  File "/home/*/ChatRWKV/src/model_run.py", line 76, in __init__
    w[x] = w[x].cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.69 GiB total capacity; 19.19 GiB already allocated; 39.12 MiB free; 19.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So, is it possible to use both of my gpus? Thanks!

M1 Max MPS F32 / F16 Issues

Hello @BlinkDL,

As per your recommendation, I was able to run this on MPS at half precision. It gets stuck on MPS at full precision.

On 64 GB M1 Max CPU, 14B model gives pretty good results, but it is pretty slow. When I made a few changes to get it working on MPS, it's very fast. But the results are the worst. For example, on MPS F16, it generates,

User: +gen Here is a short story in which Jeff Bezos, Elon Musk, and Bill Gates fight in a tournament:
ChDefPSt
QTheThisSCheckReviewQBackgroundInformationThisPaulQSp1QThe1------[How{The#QTheHSamAfterQMfileCharacterGQDemEQAfterThe//QWeQ"ThisIntroductionCorListQQQ(EAtBackgroundEnAn[BrWithDirectGWomenNQTheOh1Last#OnQGlQQWilliamWhy9                        QTheQAfterMelCheckQTQ/*BlYouFieldAQThe/*PrThe

On the other hand, with same code and on CPU F32, it generates,

User: +gen Here is a short story in which Jeff Bezos, Elon Musk, and Bill Gates fight in a tournament:


There are four kings in the chess world, and every four years a World Championship takes place. In 2018, Bezos defeated Musk in the semi-finals, while Gates took down both of them. So now it’s Bezos and Musk who are playing each other.

What am I missing ?

The most significant change was in sample_logits function, where I used the same probability algorithm for both MPS and CPU. Rest of the changes included only changing the device from CUDA to MPS.