Giter Club home page Giter Club logo

Comments (13)

AlpinDale avatar AlpinDale commented on July 23, 2024 1

I know what's happening. Will fix soon.

from aphrodite-engine.

TheHamkerCat avatar TheHamkerCat commented on July 23, 2024 1

you need to use --tokenizer-revision as well

for example

python -m aphrodite.endpoints.openai.api_server --model turboderp/Llama-3-8B-exl2 --revision 6.0bpw --tokenizer-revision 6.0bpw

Otherwise you'll get config.json missing error because the code tries to find that in main branch and fails

from aphrodite-engine.

wingrunr21 avatar wingrunr21 commented on July 23, 2024 1

The patch I posted was enough for me to be able to pull things from HuggingFace and use revisions

from aphrodite-engine.

TheHamkerCat avatar TheHamkerCat commented on July 23, 2024 1

Have you tested this on the dev branch after the PR was merged?

Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add tokenizer_revision=self.model_config.tokenizer_revision as well.

Oh, my bad, i was using 0.5.1 from pypi, but it works by adding --tokenizer-revision
Will test the dev branch

from aphrodite-engine.

josephrocca avatar josephrocca commented on July 23, 2024

(Wondering if there's any workaround in meantime with official runpod image? 👉👈 I tried REVISION env var too)

from aphrodite-engine.

localbarrage avatar localbarrage commented on July 23, 2024

Any update?

from aphrodite-engine.

atlasveldine avatar atlasveldine commented on July 23, 2024

Still an issue... a workaround would be really nice, since this makes it pretty difficult to use with HF models that place different degrees of quantization within different trees.

NOTE: It may just be an issue in the Google Colab, since I see that it was recently reported as fixed in #246 -- or, maybe it just hasn't made it into release yet, not sure how your release schedule works here, so: my apologies if this will be soon fixed.

from aphrodite-engine.

josephrocca avatar josephrocca commented on July 23, 2024

It may just be an issue in the Google Colab

It's not just Google Colab, since I tested on Runpod using official image. Though I haven't tested with latest release.

from aphrodite-engine.

wingrunr21 avatar wingrunr21 commented on July 23, 2024

There are bugs in the code. You can use this patch against v0.5.2. I haven't re-applied this to current HEAD:

diff --git a/aphrodite/endpoints/openai/api_server.py b/aphrodite/endpoints/openai/api_server.py
index 3b3b6ed..02da554 100644
--- a/aphrodite/endpoints/openai/api_server.py
+++ b/aphrodite/endpoints/openai/api_server.py
@@ -565,6 +565,7 @@ if __name__ == "__main__":
         engine_args.tokenizer,
         tokenizer_mode=engine_args.tokenizer_mode,
         trust_remote_code=engine_args.trust_remote_code,
+        revision=engine_args.revision,
     )
 
     chat_template = args.chat_template
diff --git a/aphrodite/endpoints/openai/serving_engine.py b/aphrodite/endpoints/openai/serving_engine.py
index c98b332..b8d4e07 100644
--- a/aphrodite/endpoints/openai/serving_engine.py
+++ b/aphrodite/endpoints/openai/serving_engine.py
@@ -63,7 +63,8 @@ class OpenAIServing:
         self.tokenizer = get_tokenizer(
             engine_model_config.tokenizer,
             tokenizer_mode=engine_model_config.tokenizer_mode,
-            trust_remote_code=engine_model_config.trust_remote_code)
+            trust_remote_code=engine_model_config.trust_remote_code,
+            revision=engine_model_config.revision,)
 
     async def show_available_models(self) -> ModelList:
         """Show available models. Right now we only have one model."""
diff --git a/aphrodite/engine/aphrodite_engine.py b/aphrodite/engine/aphrodite_engine.py
index b811bfe..11baf74 100644
--- a/aphrodite/engine/aphrodite_engine.py
+++ b/aphrodite/engine/aphrodite_engine.py
@@ -163,7 +163,7 @@ class AphroditeEngine:
             max_input_length=None,
             tokenizer_mode=self.model_config.tokenizer_mode,
             trust_remote_code=self.model_config.trust_remote_code,
-            revision=self.model_config.tokenizer_revision)
+            revision=self.model_config.revision)
         init_kwargs.update(tokenizer_init_kwargs)
         self.tokenizer: TokenizerGroup = TokenizerGroup(
             self.model_config.tokenizer, **init_kwargs)

from aphrodite-engine.

houmie avatar houmie commented on July 23, 2024

I have the same problem. Is there any chance this patch could be applied to a release, please? Because RunPod only pulls the latest release of Aphrodite. Thanks

from aphrodite-engine.

houmie avatar houmie commented on July 23, 2024

Have you tested this on the dev branch after the PR was merged?

Well in that case the PR wasn't enough. Maybe if @AlpinDale can confirm this, I'm happy to make another PR to add tokenizer_revision=self.model_config.tokenizer_revision as well.

from aphrodite-engine.

TheHamkerCat avatar TheHamkerCat commented on July 23, 2024

Alright, i'm also getting that error (got this error initially as well, but at an earlier stage, fixed by using --tokenizer-revision, now this happens)

WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO:     Model = 'turboderp/Llama-3-8B-exl2'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = exl2
INFO:     Context Length = 7040
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Model is quantized. Forcing float16 datatype.
INFO:     Downloading model weights ['*.safetensors']
INFO:     Model weights loaded. Memory usage: 6.26 GiB x 1 = 6.26 GiB
INFO:     # GPU blocks: 566, # CPU blocks: 2048
INFO:     Minimum concurrency: 1.29x
INFO:     Maximum sequence length allowed in the cache: 9056
INFO:     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager 
mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
WARNING:  CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are running out of memory, consider decreasing 
`gpu_memory_utilization` or enforcing eager mode.
Capturing graph... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 35/35 0:00:00
INFO:     Graph capturing finished in 8 secs.
/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1282, in _hf_hub_download_to_cache_dir
    (url_to_download, etag, commit_hash, expected_size, head_call_error) = _get_metadata_or_catch_error(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapper
    hf_raise_for_status(response)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 315, in hf_raise_for_status
    raise EntryNotFoundError(message, response) from e
huggingface_hub.utils._errors.EntryNotFoundError: 404 Client Error. (Request ID: Root=1-663bb63c-xx)

Entry Not Found for url: https://huggingface.co/turboderp/Llama-3-8B-exl2/resolve/main/config.json.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 564, in <module>
    tokenizer = get_tokenizer(
  File "/usr/local/lib/python3.10/site-packages/aphrodite/transformers_utils/tokenizer.py", line 87, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 819, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 928, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 631, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 686, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 452, in cached_file
    raise EnvironmentError(
OSError: turboderp/Llama-3-8B-exl2 does not appear to have a file named config.json. Checkout 'https://huggingface.co/turboderp/Llama-3-8B-exl2/main' for available files.

from aphrodite-engine.

wingrunr21 avatar wingrunr21 commented on July 23, 2024

I'm using this Dockerfile (mostly to apply my previous patch). I had to reset to the root user as I was running into permission issues applying the patch.

FROM alpindale/aphrodite-engine

USER 0:0

COPY tokenizer-revision.patch .
RUN git apply tokenizer-revision.patch

with this .env file:

QUANTIZATION=exl2
MODEL_NAME=turboderp/Llama-3-8B-Instruct-exl2
REVISION="4.0bpw"
NUMBA_CACHE_DIR=/tmp/numba_cache

You need NUMBA_CACHE_DIR due to #323

from aphrodite-engine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.