Comments (13)
I know what's happening. Will fix soon.
from aphrodite-engine.
you need to use --tokenizer-revision
as well
for example
python -m aphrodite.endpoints.openai.api_server --model turboderp/Llama-3-8B-exl2 --revision 6.0bpw --tokenizer-revision 6.0bpw
Otherwise you'll get config.json missing error because the code tries to find that in main branch and fails
from aphrodite-engine.
The patch I posted was enough for me to be able to pull things from HuggingFace and use revisions
from aphrodite-engine.
Have you tested this on the dev branch after the PR was merged?
Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add
tokenizer_revision=self.model_config.tokenizer_revision
as well.
Oh, my bad, i was using 0.5.1 from pypi, but it works by adding --tokenizer-revision
Will test the dev branch
from aphrodite-engine.
(Wondering if there's any workaround in meantime with official runpod image? 👉👈 I tried REVISION
env var too)
from aphrodite-engine.
Any update?
from aphrodite-engine.
Still an issue... a workaround would be really nice, since this makes it pretty difficult to use with HF models that place different degrees of quantization within different trees.
NOTE: It may just be an issue in the Google Colab, since I see that it was recently reported as fixed in #246 -- or, maybe it just hasn't made it into release yet, not sure how your release schedule works here, so: my apologies if this will be soon fixed.
from aphrodite-engine.
It may just be an issue in the Google Colab
It's not just Google Colab, since I tested on Runpod using official image. Though I haven't tested with latest release.
from aphrodite-engine.
There are bugs in the code. You can use this patch against v0.5.2
. I haven't re-applied this to current HEAD
:
diff --git a/aphrodite/endpoints/openai/api_server.py b/aphrodite/endpoints/openai/api_server.py
index 3b3b6ed..02da554 100644
--- a/aphrodite/endpoints/openai/api_server.py
+++ b/aphrodite/endpoints/openai/api_server.py
@@ -565,6 +565,7 @@ if __name__ == "__main__":
engine_args.tokenizer,
tokenizer_mode=engine_args.tokenizer_mode,
trust_remote_code=engine_args.trust_remote_code,
+ revision=engine_args.revision,
)
chat_template = args.chat_template
diff --git a/aphrodite/endpoints/openai/serving_engine.py b/aphrodite/endpoints/openai/serving_engine.py
index c98b332..b8d4e07 100644
--- a/aphrodite/endpoints/openai/serving_engine.py
+++ b/aphrodite/endpoints/openai/serving_engine.py
@@ -63,7 +63,8 @@ class OpenAIServing:
self.tokenizer = get_tokenizer(
engine_model_config.tokenizer,
tokenizer_mode=engine_model_config.tokenizer_mode,
- trust_remote_code=engine_model_config.trust_remote_code)
+ trust_remote_code=engine_model_config.trust_remote_code,
+ revision=engine_model_config.revision,)
async def show_available_models(self) -> ModelList:
"""Show available models. Right now we only have one model."""
diff --git a/aphrodite/engine/aphrodite_engine.py b/aphrodite/engine/aphrodite_engine.py
index b811bfe..11baf74 100644
--- a/aphrodite/engine/aphrodite_engine.py
+++ b/aphrodite/engine/aphrodite_engine.py
@@ -163,7 +163,7 @@ class AphroditeEngine:
max_input_length=None,
tokenizer_mode=self.model_config.tokenizer_mode,
trust_remote_code=self.model_config.trust_remote_code,
- revision=self.model_config.tokenizer_revision)
+ revision=self.model_config.revision)
init_kwargs.update(tokenizer_init_kwargs)
self.tokenizer: TokenizerGroup = TokenizerGroup(
self.model_config.tokenizer, **init_kwargs)
from aphrodite-engine.
I have the same problem. Is there any chance this patch could be applied to a release, please? Because RunPod only pulls the latest release of Aphrodite. Thanks
from aphrodite-engine.
Have you tested this on the dev branch after the PR was merged?
Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add tokenizer_revision=self.model_config.tokenizer_revision
as well.
from aphrodite-engine.
Alright, i'm also getting that error (got this error initially as well, but at an earlier stage, fixed by using --tokenizer-revision, now this happens)
WARNING: exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO: Model = 'turboderp/Llama-3-8B-exl2'
INFO: DataType = torch.bfloat16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = exl2
INFO: Context Length = 7040
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING: Model is quantized. Forcing float16 datatype.
INFO: Downloading model weights ['*.safetensors']
INFO: Model weights loaded. Memory usage: 6.26 GiB x 1 = 6.26 GiB
INFO: # GPU blocks: 566, # CPU blocks: 2048
INFO: Minimum concurrency: 1.29x
INFO: Maximum sequence length allowed in the cache: 9056
INFO: Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager
mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
WARNING: CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are running out of memory, consider decreasing
`gpu_memory_utilization` or enforcing eager mode.
Capturing graph... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 35/35 0:00:00
INFO: Graph capturing finished in 8 secs.
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1282, in _hf_hub_download_to_cache_dir
(url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_error
metadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
r = _request_wrapper(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
response = _request_wrapper(
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
hf_raise_for_status(response)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 315, in hf_raise_for_status
raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-663bb63c-xx)
Entry Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 564, in <module>
tokenizer = get_tokenizer(
File "/usr/local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
config = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: turboderp/Llama-3-8B-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Llama-3-8B-exl2/main' for available files.
from aphrodite-engine.
I'm using this Dockerfile (mostly to apply my previous patch). I had to reset to the root
user as I was running into permission issues applying the patch.
FROM alpindale/aphrodite-engine
USER 0:0
COPY tokenizer-revision.patch .
RUN git apply tokenizer-revision.patch
with this .env
file:
QUANTIZATION=exl2
MODEL_NAME=turboderp/Llama-3-8B-Instruct-exl2
REVISION="4.0bpw"
NUMBA_CACHE_DIR=/tmp/numba_cache
You need NUMBA_CACHE_DIR
due to #323
from aphrodite-engine.
Related Issues (20)
- [Bug]: Fails to start with error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte HOT 2
- [Bug]: Cannot load Mixtral GGUF model? HOT 13
- [Installation]: Docker runs out of CPU swap size on 8 GPUs. How to lower swap_space to be less than 4GB per GPU? HOT 1
- [Bug]: Moe's no longer working HOT 3
- [Bug]: [rank0]: KeyError: 'input_ids' HOT 2
- [Usage]: Higher Context Length. HOT 2
- [Feature]: WARNING: Model is quantized. Forcing float16 datatype HOT 4
- [Misc]: INT8 kv quant seems removed.
- [Bug]: unable use all the vram in wsl cuda environment
- [Bug]: /metrics Endpoint Returns 404 HOT 2
- [Feature]: An alternative to `max_tokens` which defaults to `minimum(max_tokens, remaining_tokens)`
- [Bug]: SnowStorm-v1.15-4x8B: Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=128, NumelOut=128, Timeout(ms)=600000)
- [Usage]: OOM crash following Offline Inference setup HOT 3
- [Feature]: Speculative decoding with dual GPUs
- [Bug]: Segmentation fault (core dumped)
- [Bug]: Docker container refuses connection (read ECONNRESET)
- [Installation]: pip installs no executable HOT 3
- [Feature]: Suggestion for build older versions of aphrodite engine's docker images
- [Bug]: Cannot start GGUF FP16 models HOT 4
- [Feature]: Add Support for aya-23-8b with GGUF HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aphrodite-engine.