Giter Club home page Giter Club logo

Comments (20)

AlpinDale avatar AlpinDale commented on June 11, 2024 1

Thanks for the solution, @josephrocca

As for the OOM, that's to be expected. Aphrodite will use as much VRAM as available, be it 1GB or 80GB. You can limit that behavior by setting this env variable in the docker launch:

-e GPU_MEMORY_UTILIZATION=0.6

This will limit the pre-allocated memory to 60% of the total GPU memory. The equivalent CLI command is -gmu. You can use the TENSOR_PARALLEL_SIZE env variable (-tp in CLI) to set the number of GPUs to use, provided you have more than 1 GPU available on your machine.

Please see here for all the available variables to use:

https://github.com/PygmalionAI/aphrodite-engine/blob/main/docker/.env

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

Thanks for reporting, the docker image needs to be updated ASAP. Was supposed to do it with release.

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

thanks! =)

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

another question @AlpinDale , how do I activate the int8 kv cache? I tryed --kv-cache-dtype int8 but: api_server.py: error: argument --kv-cache-dtype: invalid choice: 'int8' (choose from 'auto', 'fp8_e5m2')

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

Sounds like you haven't updated aphrodite yet. Please do that then read here https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#int8

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

yep, did not read that doc, thanls!

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

@AlpinDale another question, i'm inside de docker container, but did not find the aphrodite/kv_quant/calibrate.py, in the git the file exists but not in the image, it would be very useful to has the script inside the image to do all the work inside the container. Also the pip dependency would be very nice to be installed in the dockerfile build process. Running from the raw git it could be difficult in order to fight with dependencies and kernel compilations (or the script is agnostic to the engine it self) ?

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

The docker image hasn't been updated yet, so it makes sense it won't be present there. Please do a git pull && pip install -e . to update Aphrodite inside the image for now.

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

will try, thanks!

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

after a success pull & install, the script throws:

root@87d0cbe105ba:/workspace/aphrodite-engine# python aphrodite/kv_quant/calibrate.py --model /data/gorilla-openfunctions-v2-GPTQ/ --calib_dataset wikitext2 --calib_samples 128 --calib_seqlen 4096 --work_dir /home/workspace/int8_data/models--gorilla-llm--gorilla-openfunctions-v2
Traceback (most recent call last):
  File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 112, in <module>
    fire.Fire(calibrate)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 56, in calibrate
    tokenizer = AutoTokenizer.from_pretrained(model,
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 814, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 178, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 203, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

Cleaning the cache:

root@87d0cbe105ba:/workspace/aphrodite-engine# python aphrodite/kv_quant/calibrate.py --model gorilla-llm/gorilla-openfunctions-v2 --calib_dataset wikitext2 --calib_samples 128 --calib_seqlen 4096 --work_dir /home/workspace/int8_data/models--gorilla-llm--gorilla-openfunctions-v2
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.24k/4.24k [00:00<00:00, 25.7MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 462/462 [00:00<00:00, 3.68MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.61M/4.61M [00:00<00:00, 8.19MB/s]
Traceback (most recent call last):
  File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 112, in <module>
    fire.Fire(calibrate)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 56, in calibrate
    tokenizer = AutoTokenizer.from_pretrained(model,
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 814, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 178, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 203, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

will wait for the official docker image to be updated, thanks for your time @AlpinDale =)

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

I updated the docker image, but I'm getting this error when running it:

+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --model EleutherAI/pythia-70m-deduped --api-keys testing
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 30, in <module>
    from aphrodite.endpoints.openai.serving_chat import OpenAIServingChat
  File "/app/aphrodite-engine/aphrodite/endpoints/openai/serving_chat.py", line 16, in <module>
    from aphrodite.modeling.outlines_decoding import get_guided_decoding_logits_processor
  File "/app/aphrodite-engine/aphrodite/modeling/outlines_decoding.py", line 12, in <module>
    from aphrodite.modeling.outlines_logits_processors import JSONLogitsProcessor, RegexLogitsProcessor
  File "/app/aphrodite-engine/aphrodite/modeling/outlines_logits_processors.py", line 24, in <module>
    from outlines.fsm.fsm import RegexFSM
  File "/usr/local/lib/python3.10/dist-packages/outlines/__init__.py", line 2, in <module>
    import outlines.generate
  File "/usr/local/lib/python3.10/dist-packages/outlines/generate/__init__.py", line 2, in <module>
    from .cfg import cfg
  File "/usr/local/lib/python3.10/dist-packages/outlines/generate/cfg.py", line 3, in <module>
    from outlines.fsm.guide import CFGGuide
  File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/guide.py", line 9, in <module>
    from outlines.fsm.regex import create_fsm_index_tokenizer, make_deterministic_fsm
  File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py", line 96, in <module>
    def create_fsm_info(
  File "/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py", line 229, in wrapper
    disp.enable_caching()
  File "/usr/local/lib/python3.10/dist-packages/numba/core/dispatcher.py", line 856, in enable_caching
    self._cache = FunctionCache(self.py_func)
  File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 601, in __init__
    self._impl = self._impl_class(py_func)
  File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 337, in __init__
    raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'create_fsm_info': no locator available for file '/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py'

Do you guys have any idea what's happening? Something related to numba tmp dir?

cc @StefanDanielSchwarz

from aphrodite-engine.

puppetm4st3r avatar puppetm4st3r commented on June 11, 2024

@AlpinDale in the morning when I clone 0.5 and build/install inside the docker past image did not get any issue, in fact im already using it now, only the issues with the kv int 8 calibration, i leave that information may be helpful for you.

from aphrodite-engine.

josephrocca avatar josephrocca commented on June 11, 2024

Hey @AlpinDale, I also had that issue with create_fsm_info, and thanks to this comment I solved it like this:

docker run --gpus all --shm-size 10g -e NUMBA_CACHE_DIR=/tmp/numba_cache -p 5000:5000 -it alpindale/aphrodite-engine

However, I then, for some reason, run into an out-of-memory error, despite having ~20.4GB of VRAM free (desktop OS stuff is using the other ~3.6GB) and only loading the tiny EleutherAI/pythia-70m-deduped:

docker run --gpus all --shm-size 1g -e NUMBA_CACHE_DIR=/tmp/numba_cache -p 5000:5000 -it alpindale/aphrodite-engine
Starting Aphrodite Engine API server...
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 567/567 [00:00<00:00, 6.39MB/s]
INFO:     Initializing the Aphrodite Engine (v0.5.0) with the following config:
INFO:     Model = 'EleutherAI/pythia-70m-deduped'
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 2048
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 396/396 [00:00<00:00, 2.20MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:01<00:00, 1.71MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 811kB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Downloading model weights ['*.safetensors']
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 166M/166M [00:01<00:00, 112MB/s]
INFO:     Model loaded. Memory usage: 0.13 GiB
INFO:     # GPU blocks: 110844, # CPU blocks: 21845
INFO:     Minimum concurrency: 865.97x
INFO:     Maximum sequence length allowed in the cache: 1773504
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 561, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 116, in __init__
    self._init_cache()
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 358, in _init_cache
    self._run_workers("init_cache_engine", cache_config=self.cache_config)
  File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1025, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 161, in init_cache_engine
    self.cache_engine = CacheEngine(self.cache_config, self.model_config,
  File "/app/aphrodite-engine/aphrodite/task_handler/cache_engine.py", line 46, in __init__
    self.gpu_cache = self.allocate_gpu_cache()
  File "/app/aphrodite-engine/aphrodite/task_handler/cache_engine.py", line 82, in allocate_gpu_cache
    value_blocks = torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 0 has a total capacity of 23.62 GiB of which 786.12 MiB is free. Process 1427206 has 61.83 MiB memory in use. Process 323616 has 19.34 GiB memory in use. Of the allocated memory 18.75 GiB is allocated by PyTorch, and 19.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

Fixed now. Docker should work fine.

from aphrodite-engine.

mrseeker avatar mrseeker commented on June 11, 2024

Is this actually fixed, or is it just only for those running on Runpod?

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

It should be fixed for everyone. I'm planning a new hotfix release soon, and will update the dockerfile with this fix.

from aphrodite-engine.

fullstackwebdev avatar fullstackwebdev commented on June 11, 2024

getting the same create_fsm_info error just now

Status: Downloaded newer image for alpindale/aphrodite-engine:latest
Starting Aphrodite Engine API server...
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --model mistralai/Mistral-7B-Instruct-v0.2
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
...
  File "/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py", line 229, in wrapper
    disp.enable_caching()
  File "/usr/local/lib/python3.10/dist-packages/numba/core/dispatcher.py", line 856, in enable_caching
    self._cache = FunctionCache(self.py_func)
  File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 601, in __init__
    self._impl = self._impl_class(py_func)
  File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 337, in __init__
    raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'create_fsm_info': no locator available for file '/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py'
(base) ➜  ~ docker run --gpus '"all"' --shm-size 10g -p 2242:2242 -e MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.2" -e PORT=2242 -it alpindale/aphrodite-engine

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

@fullstackwebdev please set NUMBA_CACHE_DIR=/tmp/numba_cache as an env variable when launching the docker image. I'm currently building a new image with this fix, if you'd rather wait for that.

from aphrodite-engine.

mrseeker avatar mrseeker commented on June 11, 2024

@AlpinDale if your fix can be pushed to Github, then others can work on a new image too. He is not the only one waiting for new image to be built.

from aphrodite-engine.

AlpinDale avatar AlpinDale commented on June 11, 2024

@mrseeker I've already pushed the fix to the entrypoint script for the docker. Image is currently being uploaded.

from aphrodite-engine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.