Comments (20)
Thanks for the solution, @josephrocca
As for the OOM, that's to be expected. Aphrodite will use as much VRAM as available, be it 1GB or 80GB. You can limit that behavior by setting this env variable in the docker launch:
-e GPU_MEMORY_UTILIZATION=0.6
This will limit the pre-allocated memory to 60% of the total GPU memory. The equivalent CLI command is -gmu
. You can use the TENSOR_PARALLEL_SIZE env variable (-tp
in CLI) to set the number of GPUs to use, provided you have more than 1 GPU available on your machine.
Please see here for all the available variables to use:
https://github.com/PygmalionAI/aphrodite-engine/blob/main/docker/.env
from aphrodite-engine.
Thanks for reporting, the docker image needs to be updated ASAP. Was supposed to do it with release.
from aphrodite-engine.
thanks! =)
from aphrodite-engine.
another question @AlpinDale , how do I activate the int8 kv cache? I tryed --kv-cache-dtype int8 but: api_server.py: error: argument --kv-cache-dtype: invalid choice: 'int8' (choose from 'auto', 'fp8_e5m2')
from aphrodite-engine.
Sounds like you haven't updated aphrodite yet. Please do that then read here https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#int8
from aphrodite-engine.
yep, did not read that doc, thanls!
from aphrodite-engine.
@AlpinDale another question, i'm inside de docker container, but did not find the aphrodite/kv_quant/calibrate.py, in the git the file exists but not in the image, it would be very useful to has the script inside the image to do all the work inside the container. Also the pip dependency would be very nice to be installed in the dockerfile build process. Running from the raw git it could be difficult in order to fight with dependencies and kernel compilations (or the script is agnostic to the engine it self) ?
from aphrodite-engine.
The docker image hasn't been updated yet, so it makes sense it won't be present there. Please do a git pull && pip install -e .
to update Aphrodite inside the image for now.
from aphrodite-engine.
will try, thanks!
from aphrodite-engine.
after a success pull & install, the script throws:
root@87d0cbe105ba:/workspace/aphrodite-engine# python aphrodite/kv_quant/calibrate.py --model /data/gorilla-openfunctions-v2-GPTQ/ --calib_dataset wikitext2 --calib_samples 128 --calib_seqlen 4096 --work_dir /home/workspace/int8_data/models--gorilla-llm--gorilla-openfunctions-v2
Traceback (most recent call last):
File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 112, in <module>
fire.Fire(calibrate)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 56, in calibrate
tokenizer = AutoTokenizer.from_pretrained(model,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 814, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 178, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 203, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
Cleaning the cache:
root@87d0cbe105ba:/workspace/aphrodite-engine# python aphrodite/kv_quant/calibrate.py --model gorilla-llm/gorilla-openfunctions-v2 --calib_dataset wikitext2 --calib_samples 128 --calib_seqlen 4096 --work_dir /home/workspace/int8_data/models--gorilla-llm--gorilla-openfunctions-v2
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.24k/4.24k [00:00<00:00, 25.7MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 462/462 [00:00<00:00, 3.68MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.61M/4.61M [00:00<00:00, 8.19MB/s]
Traceback (most recent call last):
File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 112, in <module>
fire.Fire(calibrate)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/aphrodite-engine/aphrodite/kv_quant/calibrate.py", line 56, in calibrate
tokenizer = AutoTokenizer.from_pretrained(model,
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 814, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 178, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 203, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
will wait for the official docker image to be updated, thanks for your time @AlpinDale =)
from aphrodite-engine.
I updated the docker image, but I'm getting this error when running it:
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --model EleutherAI/pythia-70m-deduped --api-keys testing
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 30, in <module>
from aphrodite.endpoints.openai.serving_chat import OpenAIServingChat
File "/app/aphrodite-engine/aphrodite/endpoints/openai/serving_chat.py", line 16, in <module>
from aphrodite.modeling.outlines_decoding import get_guided_decoding_logits_processor
File "/app/aphrodite-engine/aphrodite/modeling/outlines_decoding.py", line 12, in <module>
from aphrodite.modeling.outlines_logits_processors import JSONLogitsProcessor, RegexLogitsProcessor
File "/app/aphrodite-engine/aphrodite/modeling/outlines_logits_processors.py", line 24, in <module>
from outlines.fsm.fsm import RegexFSM
File "/usr/local/lib/python3.10/dist-packages/outlines/__init__.py", line 2, in <module>
import outlines.generate
File "/usr/local/lib/python3.10/dist-packages/outlines/generate/__init__.py", line 2, in <module>
from .cfg import cfg
File "/usr/local/lib/python3.10/dist-packages/outlines/generate/cfg.py", line 3, in <module>
from outlines.fsm.guide import CFGGuide
File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/guide.py", line 9, in <module>
from outlines.fsm.regex import create_fsm_index_tokenizer, make_deterministic_fsm
File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py", line 96, in <module>
def create_fsm_info(
File "/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py", line 229, in wrapper
disp.enable_caching()
File "/usr/local/lib/python3.10/dist-packages/numba/core/dispatcher.py", line 856, in enable_caching
self._cache = FunctionCache(self.py_func)
File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 601, in __init__
self._impl = self._impl_class(py_func)
File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 337, in __init__
raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'create_fsm_info': no locator available for file '/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py'
Do you guys have any idea what's happening? Something related to numba tmp dir?
from aphrodite-engine.
@AlpinDale in the morning when I clone 0.5 and build/install inside the docker past image did not get any issue, in fact im already using it now, only the issues with the kv int 8 calibration, i leave that information may be helpful for you.
from aphrodite-engine.
Hey @AlpinDale, I also had that issue with create_fsm_info
, and thanks to this comment I solved it like this:
docker run --gpus all --shm-size 10g -e NUMBA_CACHE_DIR=/tmp/numba_cache -p 5000:5000 -it alpindale/aphrodite-engine
However, I then, for some reason, run into an out-of-memory error, despite having ~20.4GB of VRAM free (desktop OS stuff is using the other ~3.6GB) and only loading the tiny EleutherAI/pythia-70m-deduped
:
docker run --gpus all --shm-size 1g -e NUMBA_CACHE_DIR=/tmp/numba_cache -p 5000:5000 -it alpindale/aphrodite-engine
Starting Aphrodite Engine API server...
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 567/567 [00:00<00:00, 6.39MB/s]
INFO: Initializing the Aphrodite Engine (v0.5.0) with the following config:
INFO: Model = 'EleutherAI/pythia-70m-deduped'
INFO: DataType = torch.float16
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = None
INFO: Context Length = 2048
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = auto
INFO: KV Cache Params Path = None
INFO: Device = cuda
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 396/396 [00:00<00:00, 2.20MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:01<00:00, 1.71MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 811kB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO: Downloading model weights ['*.safetensors']
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 166M/166M [00:01<00:00, 112MB/s]
INFO: Model loaded. Memory usage: 0.13 GiB
INFO: # GPU blocks: 110844, # CPU blocks: 21845
INFO: Minimum concurrency: 865.97x
INFO: Maximum sequence length allowed in the cache: 1773504
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 561, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 116, in __init__
self._init_cache()
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 358, in _init_cache
self._run_workers("init_cache_engine", cache_config=self.cache_config)
File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 1025, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 161, in init_cache_engine
self.cache_engine = CacheEngine(self.cache_config, self.model_config,
File "/app/aphrodite-engine/aphrodite/task_handler/cache_engine.py", line 46, in __init__
self.gpu_cache = self.allocate_gpu_cache()
File "/app/aphrodite-engine/aphrodite/task_handler/cache_engine.py", line 82, in allocate_gpu_cache
value_blocks = torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 0 has a total capacity of 23.62 GiB of which 786.12 MiB is free. Process 1427206 has 61.83 MiB memory in use. Process 323616 has 19.34 GiB memory in use. Of the allocated memory 18.75 GiB is allocated by PyTorch, and 19.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
from aphrodite-engine.
Fixed now. Docker should work fine.
from aphrodite-engine.
Is this actually fixed, or is it just only for those running on Runpod?
from aphrodite-engine.
It should be fixed for everyone. I'm planning a new hotfix release soon, and will update the dockerfile with this fix.
from aphrodite-engine.
getting the same create_fsm_info error just now
Status: Downloaded newer image for alpindale/aphrodite-engine:latest
Starting Aphrodite Engine API server...
+ exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 5000 --download-dir /app/tmp/hub --model mistralai/Mistral-7B-Instruct-v0.2
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
...
File "/usr/local/lib/python3.10/dist-packages/numba/core/decorators.py", line 229, in wrapper
disp.enable_caching()
File "/usr/local/lib/python3.10/dist-packages/numba/core/dispatcher.py", line 856, in enable_caching
self._cache = FunctionCache(self.py_func)
File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 601, in __init__
self._impl = self._impl_class(py_func)
File "/usr/local/lib/python3.10/dist-packages/numba/core/caching.py", line 337, in __init__
raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'create_fsm_info': no locator available for file '/usr/local/lib/python3.10/dist-packages/outlines/fsm/regex.py'
(base) ➜ ~ docker run --gpus '"all"' --shm-size 10g -p 2242:2242 -e MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.2" -e PORT=2242 -it alpindale/aphrodite-engine
from aphrodite-engine.
@fullstackwebdev please set NUMBA_CACHE_DIR=/tmp/numba_cache
as an env variable when launching the docker image. I'm currently building a new image with this fix, if you'd rather wait for that.
from aphrodite-engine.
@AlpinDale if your fix can be pushed to Github, then others can work on a new image too. He is not the only one waiting for new image to be built.
from aphrodite-engine.
@mrseeker I've already pushed the fix to the entrypoint script for the docker. Image is currently being uploaded.
from aphrodite-engine.
Related Issues (20)
- AsyncEngineDeadError with koboldai api server HOT 4
- Add RoPE scaling arguments to engine HOT 1
- Infinite hang on example prompt. Using AWQ quantization HOT 3
- Is GGUF support broken? HOT 9
- Configuration of the internal port of the docker container HOT 3
- Fix warnings during compile time
- GGUF IQ quants support HOT 1
- Prompts are being interpolated on log output HOT 2
- Problem with request (before 0.5 works with no problem) HOT 2
- Load part of GGUF to GPU and CPU? HOT 1
- `RuntimeError: CUDA unknown error` on Runpod (but works fine on local machine) HOT 2
- Initial fetch for `config.json` ignores `--revision`? HOT 3
- Bad generation with GGUF and OpenAI api HOT 1
- [Bug]: openAI endpoint crashing on "no locator available" HOT 1
- [Bug]: Pydantic serializer issue when pinging /v1/models HOT 2
- [Bug]: `ValueError: Out of range float values are not JSON compliant` when requesting logprobs from awq model HOT 1
- [sparsetral and Qwen2idae]: support for mixtral of lora HOT 12
- [Bug]: exl2 is not auto detected HOT 2
- [Usage]: nccl and cupy problem "no cupy" and "NCCL_ERROR_UNHANDLED_CUDA_ERROR" when use TP in wsl HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aphrodite-engine.