abacaj / mpt-30b-inference Goto Github PK
View Code? Open in Web Editor NEWRun inference on MPT-30B using CPU
License: MIT License
Run inference on MPT-30B using CPU
License: MIT License
Good afternoon, first of all, thank you for your work. It is very interesting. I am a java developer in production so I don't have much expertise in python. When I run the example from your youtube, I will give an error at the end.
Please tell me how much more difficult it would be to apply here rest communication by post request, if not difficult.
below is the description of the error:
(mpt30_final) D:\Develop\NeuronNetwork\Mpt30\mpt_30B_inference>python inference.py
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Develop\NeuronNetwork\Mpt30\mpt_30B_inference\inference.py", line 49, in
llm = AutoModelForCausalLM.from_pretrained(
File "C:\Users\j0sch\miniconda3\envs\mpt30_final\lib\site-packages\ctransformers\hub.py", line 157, in from_pretrained
return LLM(
File "C:\Users\j0sch\miniconda3\envs\mpt30_final\lib\site-packages\ctransformers\llm.py", line 203, in init
if not Path(model_path).is_file():
File "C:\Users\j0sch\miniconda3\envs\mpt30_final\lib\pathlib.py", line 958, in new
self = cls._from_parts(args)
File "C:\Users\j0sch\miniconda3\envs\mpt30_final\lib\pathlib.py", line 592, in _from_parts
drv, root, parts = self._parse_args(args)
File "C:\Users\j0sch\miniconda3\envs\mpt30_final\lib\pathlib.py", line 576, in _parse_args
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
@abacaj Would be really helpful to have Dockerfile you've been using in your demo
why is it slow on Mac M1 takes several min to gen 1 token
How to implement continuous dialogue, like ChatGPT?
Hi,
When trying to run inference, I got the following error message:
Downloading (…)feaf43e4/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24k/1.24k [00:00<00:00, 8.64MB/s]
Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.38it/s]
Traceback (most recent call last):
File "/home/matthieu/Deployment/mpt-30B-inference/inference.py", line 49, in <module>
llm = AutoModelForCausalLM.from_pretrained(
File "/home/matthieu/anaconda3/envs/mpt_30b_cpu/lib/python3.10/site-packages/ctransformers/hub.py", line 157, in from_pretrained
return LLM(
File "/home/matthieu/anaconda3/envs/mpt_30b_cpu/lib/python3.10/site-packages/ctransformers/llm.py", line 206, in __init__
self._lib = load_library(lib)
File "/home/matthieu/anaconda3/envs/mpt_30b_cpu/lib/python3.10/site-packages/ctransformers/llm.py", line 102, in load_library
lib = CDLL(path)
File "/home/matthieu/anaconda3/envs/mpt_30b_cpu/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /home/matthieu/anaconda3/envs/mpt_30b_cpu/lib/python3.10/site-packages/ctransformers/lib/avx2/libctransformers.so)
I use an AMD® Ryzen threadripper 3960x 24-core processor × 48 on Ubuntu 18.04 LTS.
Thanks for any help!
Hello! Kudos for your nice work on the open-source community of LLM's. I'm learning a lot for your findings.
I've tried to run the latest commit on a GCP VM with 32GB of ram.
And I've found this error:
(env) (base) sergiomoreno@production:~/mpt-30B-inference$ python inference.py
/home/sergiomoreno/mpt-30B-inference/models/mpt-30b-chat.ggmlv0.q4_1.bin
Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7752.87it/s]
Traceback (most recent call last):
File "/home/sergiomoreno/mpt-30B-inference/inference.py", line 51, in <module>
llm = AutoModelForCausalLM.from_pretrained(
File "/home/sergiomoreno/mpt-30B-inference/env/lib/python3.10/site-packages/ctransformers/hub.py", line 157, in from_pretrained
return LLM(
File "/home/sergiomoreno/mpt-30B-inference/env/lib/python3.10/site-packages/ctransformers/llm.py", line 203, in __init__
if not Path(model_path).is_file():
File "/home/sergiomoreno/miniconda3/lib/python3.10/pathlib.py", line 960, in __new__
self = cls._from_parts(args)
File "/home/sergiomoreno/miniconda3/lib/python3.10/pathlib.py", line 594, in _from_parts
drv, root, parts = self._parse_args(args)
File "/home/sergiomoreno/miniconda3/lib/python3.10/pathlib.py", line 578, in _parse_args
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
After looking the files structure, I've saw that folder didn't exist, and had to replace it to this:
llm = AutoModelForCausalLM.from_pretrained(
os.path.abspath("models/models--TheBloke--mpt-30B-chat-GGML/snapshots/60df632f84e8b99fa7aeadf314467152be55adf4/mpt-30b-chat.ggmlv0.q4_1.bin"),
model_type="mpt",
Is it something expected or can we improve it somehow?
Thank you for such a great repo. I was wondering if we can use GPU inference for text generation?
I am trying to generate responses based on the input from mpt-30B model and create a API using flask but I am having trouble as it not giving response i.e. the response is empty for each input I am asking. I am using Standard F32s v2 (32 vcpus, 64 GiB memory) with remote access to server.
from flask import Flask, request, Response
from dataclasses import dataclass, asdict
from ctransformers import AutoModelForCausalLM, AutoConfig
import os
import time
app = Flask(name)
@DataClass
class GenerationConfig:
temperature: float
top_k: int
top_p: float
repetition_penalty: float
max_new_tokens: int
seed: int
reset: bool
stream: bool
threads: int
stop: list
def format_prompt(system_prompt: str, user_prompt: str):
system_prompt = f"system\n{system_prompt}\n"
user_prompt = f"user\n{user_prompt}\n"
assistant_prompt = f"assistant\n"
return f"{system_prompt}{user_prompt}{assistant_prompt}"
def generate(
llm: AutoModelForCausalLM,
generation_config: GenerationConfig,
system_prompt: str,
user_input: str,
):
# return llm(
# format_prompt(
# system_prompt,
# user_prompt,
# ),
# **asdict(generation_config),
# )
model_output = llm(
format_prompt(system_prompt, user_input.strip()),
**asdict(generation_config),
)
print("Model output:", model_output)
return model_output
@app.route('/generate', methods=['GET','POST'])
def generate_response_endpoint():
#user_input = request.data.decode('utf-8')
# Load the model and configuration
if request.method == 'GET':
user_input = request.args.get('user_input', '') # Get input from query parameter
elif request.method == 'POST':
user_input = request.data.decode('utf-8')
print("Loading model...")
#config = AutoConfig.from_pretrained("mosaicml/mpt-30b-chat", context_length=8192)
llm = AutoModelForCausalLM.from_pretrained(
"/home/azureuser/mpt-30B-inference/models/mpt-30b-chat.ggmlv0.q4_1.bin",
model_type="mpt"
)
print("model Loaded")
system_prompt = "Reply."
generation_config = GenerationConfig(
temperature=0.2,
top_k=0,
top_p=0.9,
repetition_penalty=1.0,
max_new_tokens=512,
seed=42,
reset=False,
stream=False,
threads=int(os.cpu_count() / 2), # adjust for your CPU
stop=["", "|<"],
)
generator = generate(llm, generation_config, system_prompt, user_input.strip())
#time.sleep(60)
print(generator)
response = generator
print(response)
return Response(response, content_type='text/plain; charset=utf-8')
if name == "main":
app.run(host='0.0.0.0', port=3002)
iimport requests
while True:
user_input = input("You: ")
if user_input.lower() in ['exit', 'quit']:
print("Exiting...")
break
data = {'user_input': user_input+"Respond to this."}
response = requests.post('http://127.0.0.1:3002/generate', json=data)
if response.status_code == 200:
assistant_response = response.text
# assistant_response = assistant_response.replace("You:", "").replace("Assistant:", "").strip()
print("Assistant:", assistant_response)
else:
print("Error:", response.status_code)
but the repsonse which i am getting is empty as can be seen below:-
You: 3+4
Assistant:
You:
Any idea what may be causing this issue here and what can be done to resolve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.