Comments (8)
Same here with VSCodium extension https://github.com/Venthe/vscode-fauxpilot
Set up with model py-model
Respectively gives:
fauxpilot-copilot_proxy-1 | [StatusCode.UNAVAILABLE] failed to connect to all addresses
fauxpilot-copilot_proxy-1 | WARNING: Model 'py-model' is not available. Please ensure that `model` is set to either 'fastertransformer' or 'py-model' depending on your installation
fauxpilot-copilot_proxy-1 | Returned completion in 2.1827220916748047 ms
from fauxpilot.
fastertransformer and py-model are not working here too
from fauxpilot.
In my case, disabling the http proxy in the container resolves the problem!
from fauxpilot.
Could you please give more context how you did that @becxer ?
from fauxpilot.
In my case, due to the environment being restricted by proxy for security, I set an internal proxy for building Dockerfiles like
ENV http_proxy xxx.xx.xx.xx
But I forgot to remove this after installing pip package install.
so I removed those ENV after pip install, both for proxy.Dockerfile and triton.Dockerfile
ENV http_proxy ""
And it works.
from fauxpilot.
OK, if I understand correctly, in your case it was a non-standard configuration. Not my case, I installed Fauxpilot straight away.
from fauxpilot.
same problem
from fauxpilot.
same issue on wsl2
fastertransformer works fine but can only run 350M model whereas i could run 2B one with python backend
$ uname -a
Linux puter 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ python3 --version
Python 3.10.6
$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50 Driver Version: 531.79 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1070 Ti On | 00000000:01:00.0 On | N/A |
| 29% 61C P5 24W / 180W| 946MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 333 G /Xwayland N/A |
| 0 N/A N/A 3198 G /kitty N/A |
+---------------------------------------------------------------------------------------+
edit: so okay apparently it does a download but gives you no sort of feedback about it, you can see it by answering yes to the cache question and watch du -lh
the directory and waiting until the size does not keep increasing and the tmp file seems extracted. the launch script should also end with a bunch of "started" logs
however the issue still persists
so i tried to make sure everything was stopped with docker compose down -v
then followed by a docker system prune -a
and reran setup.sh
to force a redownload and rebuild of everything just in case and make sure nothing that shouldnt be cached somewhere or something wasnt being
however the issue still persists
so pretty sure its just doesnt work in current state
edit2 ("fixed", very big quotes):
- okay so apparently if you are not specifically using the fauxpilot vscode extension and explicitely setting it to use py-model it will just default to fastertransformer
so thats one isssue
you can get around it by editing copilot_proxy/utils/codegen.py
and adding data["model"] = "py-model"
right under the def generate
function (currently line 75 on main branch)
that takes care of one issue i was having
- the default timeout configured seems to be 30 seconds
on my gtx 1070ti the 350M fastermodel answers in like less than 1 second and the py-model answers in like 20 seconds for the exact same completion
so when i was trying the 2B (thinking that i could reliably run it since it only consumes 4GB of vram) model it would pretty much timeout every time silently and return 0 completions (even though the call shows it "succeeds" a little bit after 30 seconds)
so sure you can increase the the timeout probably but like seing how the py-model is "slower" by something like a factor of 20 (so yeah that certainly is slower lmao) probably makes it not worth it to even try to be honest and i havent bothered looking how to increase this timeout but i guess maybe for specific use cases there could be a point in using less vram to have this response time
maybe
for at least somebody out there
probably
🤷
anyways tldr:
- mount the cache and check its path with
watch du -h
to see when its actually done downloading everything, or dont and wait until you will also get the "started GRPCInferenceService" "started HTTPService" "started Metrics Service" lines once its done - editing
copilot_proxy/utils/codegen.py
and addingdata["model"] = "py-model"
right under thedef generate
function (currently line 75 on main branch) if youre not using the specific vscode fauxpilot extension and have it configured with the py-model option - there also seems to be some sort of timeout mechanism that i cba looking into tbh (given running the models on python seems at the very least ~20 times slower, if youre lucky, more like 1000 in practice)
also even after doing all of this i can only get it to actually give me the suggestions on the fauxpilot vscode extension (but at least without the fastertransformer setting changed to py-model so in theory unless the answer format is completely changed with the py-model for whatever reason, this should be working, but im maybe missing something, but at least its not either issue 1, 2 or 3 mentionned above (i think)
couldnt get it to work with nvim copilot at least
now im getting a bunch of
fauxpilot-triton-1 | The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
fauxpilot-triton-1 | Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
spammed all over
but i really dont think i will be putting any more effort into getting the python models running so i leave the rest to someone else
last edit: also im being extremely generous with the 20 times factor, after checking again, fastertransformer answers in ~50ms, and calls for the exact identical prompt on the py-model are ~50 seconds so thats a factor of 1000 (not even sure why its included in the script as it will lead people to believe they are an "okay" tradeoff since you consume half the vram for the same amount of parameters as the fastertransformer model but anyhoot thats not my place to say, surely again this will be useful to at least a single person, maybe, probably)
maybe just make it clearer in the setup script that by "slower" you mean by a factor of ~1000
so actual tldr: dont bother trying to run py-model
unless you have infinite time
from fauxpilot.
Related Issues (20)
- [bug] docker(version 20.10.21) version parse error in launch.sh HOT 1
- Support arm64 to minimize cost
- Maybe add windows/etc installer all-in-one in this project's 'releases'.
- 400 Bad Request when file has around 100 lines of code HOT 3
- C# support! HOT 2
- Hello all. The comments above have been very helpful in setting up the Copilot extension. I managed to get it to work with my instance and figured I would combine the steps I used (this is for Windows. Linux installation is similar, just different locations):
- It was working fine before... HOT 1
- Support for AMD GPUs HOT 1
- Triton doesnt exist anymore I think? HOT 3
- K8s deployment (via helm chart) HOT 2
- Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) HOT 1
- why my response are all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! HOT 3
- Can I merge images of triton and client into one?eg fastertransformer_backend get content_fetch <fastertransformer&client>in CMakeLists ? HOT 1
- help me HOT 1
- What is the comparison of these model in huggingface? HOT 2
- Python Backend: "Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0" HOT 2
- [promptlib] proxy {"cause":{}} HOT 1
- ollama HOT 2
- Company Proxy HOT 1
- is documentation outdated?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fauxpilot.