Giter Club home page Giter Club logo

Comments (3)

lbeurerkellner avatar lbeurerkellner commented on June 12, 2024

Hi there, we have not tested LMQL with Qwen models yet, so this may be an issue with supporting its tokenizer. I will have to investigate a bit further.

from lmql.

lbeurerkellner avatar lbeurerkellner commented on June 12, 2024

You can fix the assertion by adding if p2 == "*": break after this line:

value_follow.add_all(result_map)

However, I could not get the model to do inference on my machine, since it never seems to finish a forward pass. Maybe you can try running with the change above, and report back with further results?

from lmql.

justairr avatar justairr commented on June 12, 2024

You can fix the assertion by adding if p2 == "*": break after this line:

value_follow.add_all(result_map)

However, I could not get the model to do inference on my machine, since it never seems to finish a forward pass. Maybe you can try running with the change above, and report back with further results?

Thank you for your reply! I still couldn't fix the error by adding the code at the location you mentioned. But I can directly add this code in front of the assertion which is throwing the error to allow the program to continue running. Now, my new error message is as follows:

[Loading Qwen/Qwen-72B-Chat with AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True)]]
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 82/82 [00:11<00:00,  7.04it/s]
[Qwen/Qwen-72B-Chat ready on device cpu]
/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:394: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:404: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
  warnings.warn(
[Error during generate()] expected scalar type c10::BFloat16 but found double
Traceback (most recent call last):
  File "/home/name/user1/lmql/lmql_test.py", line 19, in <module>
    print(prompt())
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/api/queries.py", line 148, in lmql_query_wrapper
    return module.query(*args, **kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/lmql_runtime.py", line 204, in __call__
    return call_sync(self, *args, **kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/loop.py", line 37, in call_sync
    res = loop.run_until_complete(task)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/lmql_runtime.py", line 230, in __acall__
    results = await interpreter.run(self.fct, **query_kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/tracing/tracer.py", line 240, in wrapper
    return await fct(*args, **kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/interpreter.py", line 1070, in run
    async for _ in decoder_fct(prompt, **decoder_args):
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/decoders.py", line 21, in argmax
    h = h.extend(await model.argmax(h, noscore=True))
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/dclib_cache.py", line 277, in argmax
    return await arr.aelement_wise(op_argmax)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/dclib_array.py", line 318, in aelement_wise
    result_items = await asyncio.gather(*[op_with_path(path, seqs, *args, **kwargs) for path, seqs in self.sequences.items()])
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/dclib_array.py", line 317, in op_with_path
    return path, await op(element, *args, **kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/dclib_cache.py", line 256, in op_argmax
    non_cached_argmax = iter((await self.delegate.argmax(DataArray(non_cached), **kwargs)).items())                
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_dcmodel.py", line 307, in argmax
    return await self.sample(sequences, temperature=0.0, **kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_dcmodel.py", line 350, in sample
    return await sequences.aelement_wise(op_sample)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/dclib_array.py", line 318, in aelement_wise
    result_items = await asyncio.gather(*[op_with_path(path, seqs, *args, **kwargs) for path, seqs in self.sequences.items()])
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/runtime/dclib/dclib_array.py", line 317, in op_with_path
    return path, await op(element, *args, **kwargs)
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_dcmodel.py", line 340, in op_sample
    tokens = await asyncio.gather(*[self.stream_and_return_first(s, await self.generate(s, temperature=temperature, **kwargs), mode) for s,mode in zip(seqs, unique_sampling_mode)])
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_dcmodel.py", line 147, in stream_and_return_first
    buffer += [await anext(iterator)]
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_multiprocessing.py", line 188, in generate
    async for token in self.stream_iterator(self.stream_id):
  File "/home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_multiprocessing.py", line 217, in stream_iterator
    raise LMTPStreamError(item["error"])
lmql.models.lmtp.errors.LMTPStreamError: failed to generate tokens 'expected scalar type c10::BFloat16 but found double'
Task was destroyed but it is pending!
task: <Task cancelling name='lmtp_inprocess_client_loop' coro=<LMTPDcModel.inprocess_client_loop() running at /home/name/miniconda3/envs/lmql/lib/python3.10/site-packages/lmql/models/lmtp/lmtp_dcmodel.py:76> wait_for=<Future finished result=True>>

from lmql.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.