A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

License: Apache License 2.0

Jupyter Notebook 100.00%

basic-ui-for-gpt-j-6b-with-low-vram's Introduction

Basic-UI-for-GPT-J-6B-with-low-vram

A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory.

There seem to be some issues with the weights in the drive link. There seems to be some performance loss, most likely because of poor 16 bit conversion.

How to run :

Use - pip install git+https://github.com/finetuneanon/transformers@gpt-neo-localattention3
Use the link - https://drive.google.com/file/d/1tboTvohQifN6f1JiSV8hnciyNKvj9pvm/view?usp=sharing to dowload the model that has been saved as described here - https://github.com/arrmansa/saving-and-loading-large-models-pytorch

Timing (2000 token context)

1

system -

16 gb ddr4 ram . 1070 8gb gpu.
23 blocks on ram (ram_blocks = 23) out of which 18 are on shared/pinned memory (max_shared_ram_blocks = 18).

timing -

single run of the model(inputs) takes 6.5 seconds.
35 seconds to generate 25 tokens at 2000 context. (1.4 seconds/token)

2

system -

16 gb ddr4 ram . 1060 6gb gpu.
26 blocks on ram (ram_blocks = 26) out of which 18 are on shared/pinned memory (max_shared_ram_blocks = 18).

timing -

40 seconds to generate 25 tokens at 2000 context. (1.6 seconds/token)

basic-ui-for-gpt-j-6b-with-low-vram's People

Contributors

Stargazers

Watchers

Forkers

lawrendran cderinbogaz willtejeda jie-jay twlee79 lee-b teeppiphat richardscottoz vertinski binglinchengxiash chunchi031 jackangel

basic-ui-for-gpt-j-6b-with-low-vram's Issues

Make it work with the latest version of transformers

There are speedups and memory savings implemented in the latest version.

Expected all tensors to be on same device

I got your code running up to the first test block using a copy of GPT-J-6B I had downloaded (the link on the readme didn't load). I had to remove the check for rotary and use the else code always, but otherwise it worked.

#if self.rotary:
#    hidden_states = inputs_embeds
#else:
#    position_embeds = self.wpe(position_ids)
#    hidden_states = inputs_embeds + position_embeds
position_embeds = self.wpe(position_ids)
hidden_states = inputs_embeds + position_embeds

However in the first test, it failed with this error message:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking arugment for argument index in method wrapper_index_select)

That error message implies that this would never work, so I'm not sure what I've done wrong (Different versions of packages?)

All packages installed:

Package             Version
------------------- ------------
argon2-cffi         20.1.0
async-generator     1.10
attrs               21.2.0
backcall            0.2.0
bleach              3.3.1
certifi             2021.5.30
cffi                1.14.6
charset-normalizer  2.0.3
click               8.0.1
colorama            0.4.4
debugpy             1.3.0
decorator           5.0.9
defusedxml          0.7.1
einops              0.3.0
entrypoints         0.3
filelock            3.0.12
huggingface-hub     0.0.8
idna                3.2
importlib-metadata  3.10.1
install             1.3.4
ipykernel           6.0.2
ipython             7.25.0
ipython-genutils    0.2.0
ipywidgets          7.6.3
jedi                0.18.0
Jinja2              3.0.1
joblib              1.0.1
jsonschema          3.2.0
jupyter-client      6.1.12
jupyter-core        4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets  1.0.0
MarkupSafe          2.0.1
matplotlib-inline   0.1.2
mistune             0.8.4
nbclient            0.5.3
nbconvert           6.1.0
nbformat            5.1.3
nest-asyncio        1.5.1
notebook            6.4.0
numpy               1.21.0
packaging           21.0
pandocfilters       1.4.3
parso               0.8.2
pickleshare         0.7.5
Pillow              8.3.1
pip                 21.1.3
prometheus-client   0.11.0
prompt-toolkit      3.0.19
pycparser           2.20
Pygments            2.9.0
pyparsing           2.4.7
pyrsistent          0.18.0
python-dateutil     2.8.2
pywin32             301
pywinpty            1.1.3
pyzmq               22.1.0
regex               2021.7.6
requests            2.26.0
sacremoses          0.0.45
Send2Trash          1.7.1
setuptools          47.1.0
six                 1.16.0
terminado           0.10.1
testpath            0.5.0
tokenizers          0.10.3
torch               1.9.0+cu102
torchaudio          0.9.0
torchvision         0.10.0+cu102
tornado             6.1
tqdm                4.61.2
traitlets           5.0.5
transformers        4.6.0.dev0
typing-extensions   3.10.0.0
urllib3             1.26.6
wcwidth             0.2.5
webencodings        0.5.1
widgetsnbextension  3.5.1
zipp                3.5.0

The results are much worse than with original GPT-J-6B

Even though memory savings are great, I hoped that the quality will be the same, but it is not. For example, on https://6b.eleuther.ai/ I try the following prompt (highlighted in bold) and get decent result (:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. As first reported in Andean News, “The unicorns live in the rural valley and are one of many animal species native to the mountains and can be found to this day, but they are a rare occurrence.” One of the scientists at the scene said, “They were so different from anything else that we had seen in our lifetime, so it was a surprise.”

The scientists were able to capture several of the unicorns and identified them as the first specimens ever found, and one unicorn was even carrying a pink umbrella. Additionally, it was found that a human female had been kidnapped by one of the unicorns and that the herd had a protector, a man who travels with them. One of the scientists said, “He had just given us the run of the valley because he didn’t want us to disturb the unicorns. We all know now that’s not going to happen.” It is hoped that the kidnapping is something of a sign that the humans and the unicorns can coexist, and there have been some initial concerns that the unicorns are not quite so friendly as they first seemed, for they refused to let anyone near the big udder.

But with this repository, results are consistently bad (in both cases top-p=0.9 and temperature=1, but I also tried default repository parameters, it generates nonsense too), I generated 30 tokens at a time:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. So we found that these authors do not exist, that's why and how many years

"Yes, please?"

So much so fastidious house dust is created, you know what the author of "The Complete Book of Envato the Dream Chaser
Wartime-Tropomorrah@candy_bunny-rraaaayyyyyy.... What the hell were those people who think
the world around them have been taken aback when you told a lie. We are not supposed to speak the truth... but the facts do exist. The main problem with the human race is that they forget where they got their truth from. They say the whole universe is one big Lie the main source of all this universe is our perception; i.e., We are only creatures on our senses do not know what the hell they are. They think they are not supposed to know. It has no awareness of where it came from.
What makes a rose flower look like an orchid, if it had some life

Original GPT-J-6B does not lose the context and overall quality of each sentence is much higher. But GPT-J-6B from this repository, even is some cases when it does not lose the context right away, just generates nonsense, sometimes even of worse quality than what shown above.

Am I doing something wrong, or severe reduction of quality is a consequence of RAM/VRAM memory savings? If the latter is the case, I suggest putting a warning about this in the README.

I have used RTX 2060 SUPER 8GB (with no connected displays, so it has all the memory free), my CPU is 5950X (16 core) and I have 128GB of RAM. The biggest limit in my case is VRAM, I guess I could run original GPT-J-6B on CPU-only, but I hoped to use my GPU so I tried this repository first.

Where to do the pip install? Can you please make a bit elaborate readme.md

RuntimeError: where expected condition to be a boolean tensor, but got a tensor with dtype Float

I was successful in getting your code to work on my 2060 laptop after a few tweeks. I just got a tesla M40 card in and am looking at running GPT-J-6 on it using this method. To start though, I thought I'd use the same code with the GPT-NEO-2.7B model to verify that it's working OK. I got the error in the title though when I tried to run it.

Any ideas as to what's going on?

Full error log:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~\AppData\Roaming\Python\Python37\site-packages\torch\autograd\grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

~\AppData\Roaming\Python\Python37\site-packages\transformers\generation_utils.py in generate(self, input_ids, max_length, min_length, do_sample, early_stopping, num_beams, temperature, top_k, top_p, repetition_penalty, bad_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, **model_kwargs)
   1024                 return_dict_in_generate=return_dict_in_generate,
   1025                 synced_gpus=synced_gpus,
-> 1026                 **model_kwargs,
   1027             )
   1028 

~\AppData\Roaming\Python\Python37\site-packages\transformers\generation_utils.py in sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
   1533                 return_dict=True,
   1534                 output_attentions=output_attentions,
-> 1535                 output_hidden_states=output_hidden_states,
   1536             )
   1537 

~\AppData\Roaming\Python\Python37\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    983             output_attentions=output_attentions,
    984             output_hidden_states=output_hidden_states,
--> 985             return_dict=return_dict,
    986         )
    987         hidden_states = transformer_outputs[0]

~\AppData\Roaming\Python\Python37\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~\AppData\Local\Temp/ipykernel_8288/2499053029.py in new_forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    219                     head_mask=head_mask[i],
    220                     use_cache=use_cache,
--> 221                     output_attentions=output_attentions,
    222                 )
    223 

~\AppData\Roaming\Python\Python37\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py in forward(self, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions)
    559             head_mask=head_mask,
    560             use_cache=use_cache,
--> 561             output_attentions=output_attentions,
    562         )
    563         attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)

~\AppData\Roaming\Python\Python37\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py in forward(self, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions)
    501             head_mask=head_mask,
    502             use_cache=use_cache,
--> 503             output_attentions=output_attentions,
    504         )
    505 

~\AppData\Roaming\Python\Python37\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py in forward(self, hidden_states, attention_mask, layer_past, head_mask, use_cache, output_attentions)
    453             masked_bias=self.masked_bias,
    454             attn_dropout=self.attn_dropout,
--> 455             head_mask=head_mask,
    456         )
    457 

~\AppData\Roaming\Python\Python37\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py in _attn(self, query, key, value, causal_mask, masked_bias, attn_dropout, attention_mask, head_mask)
    276 
    277         attn_weights = torch.matmul(query, key.transpose(-1, -2))
--> 278         attn_weights = torch.where(causal_mask, attn_weights, masked_bias.to(attn_weights.dtype))
    279 
    280         if attention_mask is not None:

RuntimeError: where expected condition to be a boolean tensor, but got a tensor with dtype Float

arrmansa / basic-ui-for-gpt-j-6b-with-low-vram Goto Github PK

basic-ui-for-gpt-j-6b-with-low-vram's Introduction

Basic-UI-for-GPT-J-6B-with-low-vram

There seem to be some issues with the weights in the drive link. There seems to be some performance loss, most likely because of poor 16 bit conversion.

How to run :

Timing (2000 token context)

1

system -

timing -

2

system -

timing -

basic-ui-for-gpt-j-6b-with-low-vram's People

Contributors

Stargazers

Watchers

Forkers

basic-ui-for-gpt-j-6b-with-low-vram's Issues

Recommend Projects

Recommend Topics

Recommend Org