Comments (12)
Ah I see what the issue is.
We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch. Llama, mistral, et al. fall under the llama
category of models in llama.cpp, so their tensors and configs match with every llama model. Models like qwen2, command-r, etc, are supported by llama.cpp but use different names for tensors. To add support for these, we'd have to handle every model individually. I haven't gotten around to doing it yet, it'd need a fair bit of work. If you (or anyone else) would like to contribute for that, I'd start looking at these two places:
aphrodite-engine/aphrodite/transformers_utils/tokenizer.py
Lines 16 to 65 in ed225f5
aphrodite-engine/aphrodite/modeling/hf_downloader.py
Lines 208 to 281 in ed225f5
from aphrodite-engine.
I will take a closer look, but FYI, exl2 quants do not work with multi-gpu setups. It's the only quant with that limitation.
from aphrodite-engine.
I will take a closer look, but FYI, exl2 quants do not work with multi-gpu setups. It's the only quant with that limitation.
It's single P40 setup inside WSL2, so I don't know why it value error like that.
from aphrodite-engine.
That would be the -tp 2
in your command. Please see here for a full list of the commands and what they do.
from aphrodite-engine.
That would be the
-tp 2
in your command. Please see here for a full list of the commands and what they do.
python -m aphrodite.endpoints.openai.api_server --model /mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/ -tp 1 --api-keys sk-example --trust-remote-code --dtype float32 --kv-cache-dtype fp8_e5m2
You are using a model of type sparsetral to instantiate a model of type mistral. This is not supported for all configurations of models and can yield errors.
INFO: CUDA_HOME is not found in the environment. Using /usr/local/cuda as CUDA_HOME.
INFO: Using fp8_e5m2 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it may cause slight accuracy drop. Currently we only support fp8 without scaling factors and make e5m2 as a default format.
INFO: Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO: Model = '/mnt/c/model/sparsetral-16x7B-v2-SPIN_iter1-exl2-6.5/'
INFO: DataType = torch.float32
INFO: Model Load Format = auto
INFO: Number of GPUs = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = None
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: KV Cache Data Type = fp8_e5m2
INFO: KV Cache Params Path = None
INFO: Device = cuda
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 341, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
return engine_class(*args, **kwargs)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 115, in __init__
self._init_workers()
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 157, in _init_workers
self._run_workers("load_model")
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/task_handler/worker.py", line 112, in load_model
self.model_runner.load_model()
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 121, in load_model
self.model = get_model(self.model_config, self.device_config,
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 47, in get_model
model_class = _get_model_architecture(model_config)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 39, in _get_model_architecture
raise ValueError(
ValueError: Model architectures ['modeling_sparsetral.MistralForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'OLMoForCausalLM', 'OPTForCausalLM', 'PhiForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM']
I think I got it working but 'modeling_sparsetral.MistralForCausalLM'] are not supported for now
from aphrodite-engine.
You can probably remove the modeling_sparsetral
part from the model's config.json, it may work, but it'll skip all the MoE stuff. Same is happening with that exl2 quant I imagine, because exl2 doesn't support this arch.
from aphrodite-engine.
You can probably remove the
modeling_sparsetral
part from the model's config.json, it may work, but it'll skip all the MoE stuff. Same is happening with that exl2 quant I imagine, because exl2 doesn't support this arch.
I have other question regarding qwen1.5/qwen1 in general.
I try to load a qwen1.5 or 1 directly via load with
python -m aphrodite.endpoints.openai.api_server --model sakura0.9_13B_Qwen1.5_Q5KS_1.2.gguf -tp 1 --api-keys sk-example
I got
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 670, in from_engine_args
engine_configs = engine_args.create_engine_configs()
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/engine/args_tools.py", line 318, in create_engine_configs
model_config = ModelConfig(
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/common/config.py", line 116, in __init__
self.hf_config = get_config(self.model, trust_remote_code, revision)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/config.py", line 86, in get_config
return extract_gguf_config(model)
File "/home/sora/.local/lib/python3.10/site-packages/aphrodite/transformers_utils/config.py", line 28, in extract_gguf_config
raise RuntimeError(f"Unsupported architecture {architecture}")
RuntimeError: Unsupported architecture qwen2
I guess i need to convert to PTH before using itοΌ
from aphrodite-engine.
Works fine with the FP16 model. Can you link me to the gguf if it's public?
from aphrodite-engine.
Works fine with the FP16 model. Can you link me to the gguf if it's public?
from aphrodite-engine.
Ah I see what the issue is.
We're using a custom GGUF model parser in aphrodite, so it means everything needs to be hand-written and implemented for every model arch. Llama, mistral, et al. fall under the
llama
category of models in llama.cpp, so their tensors and configs match with every llama model. Models like qwen2, command-r, etc, are supported by llama.cpp but use different names for tensors. To add support for these, we'd have to handle every model individually. I haven't gotten around to doing it yet, it'd need a fair bit of work. If you (or anyone else) would like to contribute for that, I'd start looking at these two places:aphrodite-engine/aphrodite/transformers_utils/tokenizer.py
Lines 16 to 65 in ed225f5
aphrodite-engine/aphrodite/modeling/hf_downloader.py
Lines 208 to 281 in ed225f5
I could not offer much help regarding coding, but I though if this could done in reverse to
https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py
this script convert hf to gguf but what if this can be done in revere.
Anyway, thanks for the hard work.
from aphrodite-engine.
Related Issues (20)
- Initial fetch for `config.json` ignores `--revision`? HOT 3
- Bad generation with GGUF and OpenAI api HOT 1
- [Bug]: openAI endpoint crashing on "no locator available" HOT 1
- [Bug]: Pydantic serializer issue when pinging /v1/models HOT 2
- [Bug]: `ValueError: Out of range float values are not JSON compliant` when requesting logprobs from awq model HOT 1
- [Bug]: exl2 is not auto detected HOT 2
- [Usage]: nccl and cupy problem "no cupy" and "NCCL_ERROR_UNHANDLED_CUDA_ERROR" when use TP in wsl HOT 10
- [Bug]: Issue when trying to load a AWQ model with --load-in-4bits for mixtral flavors HOT 3
- Installation fails on NAVI gpu HOT 2
- [Bug]: loading model with int8 kv cache chokes HOT 1
- [Usage]: Question about VRAM requirement and temperature HOT 2
- [Feature]: Support YiForCausalLM HOT 5
- [Misc]: Building docker container requires insane amount of memory HOT 7
- [Bug]: Outlines json guided decoding HOT 7
- [Feature]: BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences HOT 1
- [Bug]: Does --trust-remote-code work? HOT 1
- [Bug]: multi GPU crashes backend HOT 6
- [Bug]: WSL Cuda out of Memory when Trying to Load GGUF Model HOT 8
- [Usage]: load-in-4bit not load after converted, and it seem not use swap well
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aphrodite-engine.