moyix / fauxpilot Goto Github PK
View Code? Open in Web Editor NEWFauxPilot - an open-source alternative to GitHub Copilot server
License: MIT License
FauxPilot - an open-source alternative to GitHub Copilot server
License: MIT License
I get the following error when I run launch.sh
, how do I fix it? Thanks
fauxpilot-triton-1 | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
fauxpilot-triton-1 | I0811 02:59:27.879335 94 libfastertransformer.cc:307] Before Loading Model:
fauxpilot-triton-1 | terminate called after throwing an instance of 'std::runtime_error'
fauxpilot-triton-1 | what(): [FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain. /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:393
Hi there,
Amazing job on fauxpilot! Thank you. I just wanted to make you aware of this error I got when I tried to run the docker containers:
~/dev/fauxpilot$ docker compose up
service "triton" refers to undefined volume y/codegen-350M-multi-1gpu: invalid compose project
I believe the issue is the missing dot in front of the local path as shown below. It seems docker understands y/codegen-350M-multi-1gpu
as a volume identifier instead of a local path. I am using wsl2 on Windows with docker 20.10.12.
Not sure if the change should be in docker-compose.yaml
or launch.sh
.
diff --git a/docker-compose.yaml b/docker-compose.yaml
index 7a0745b..0e37b03 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -5,7 +5,7 @@ services:
command: bash -c "CUDA_VISIBLE_DEVICES=${GPUS} mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/model"
shm_size: '2gb'
volumes:
- - ${MODEL_DIR}:/model
+ - ./${MODEL_DIR}:/model
ports:
- "8000:8000"
- "8001:8001"
In fauxpilot, can we suggest full code block like github copilot? right now even I wrote a quick sort method, I need to tab 100 times.
Hi! Huge fan of project, really clean implementation, something I'd love to explore.
I'm also working with the code-gen model set for some projects - is your huggingface model set the same base model (same trained weights*) that you've converted with faster transformer for optimized serving, or is this a different model set?
Thanks so much in advance
Hi, I am trying to run the tritonserver and flask proxy in the same container and found that the performance is bad.
10.110.2.179 - - [10/Sep/2022 08:46:43] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6501.662492752075 ms
10.110.2.179 - - [10/Sep/2022 08:46:45] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6626.265287399292 ms
10.110.2.179 - - [10/Sep/2022 08:46:45] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6687.429904937744 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 6587.698698043823 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 5684.9658489227295 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 5248.137474060059 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 4866.7027950286865 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Returned completion in 4606.655836105347 ms
10.110.2.179 - - [10/Sep/2022 08:46:46] "POST /v1/engines/codegen/completions HTTP/1.1" 200 -
Every request need 4 or more seconds. This is not acceptable...But I am usng a V100 gpu which I think is good enough. Hope someone can help me figure out the reason.
This is the config.env
:
MODEL=codegen-350M-multi
NUM_GPUS=1
MODEL_DIR=...
This is the nvidia-smi
info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:06:00.0 Off | 0 |
| N/A 32C P0 36W / 250W | 1649MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
And here is the log when start the tritonserver:
I0910 08:43:31.338765 258862 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5498000000' with size 268435456
I0910 08:43:31.339558 258862 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0910 08:43:31.346725 258862 model_repository_manager.cc:1191] loading: fastertransformer:1
I0910 08:43:31.567535 258862 libfastertransformer.cc:1226] TRITONBACKEND_Initialize: fastertransformer
I0910 08:43:31.567573 258862 libfastertransformer.cc:1236] Triton TRITONBACKEND API version: 1.10
I0910 08:43:31.567581 258862 libfastertransformer.cc:1242] 'fastertransformer' TRITONBACKEND API version: 1.10
I0910 08:43:31.567638 258862 libfastertransformer.cc:1274] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
W0910 08:43:31.569452 258862 libfastertransformer.cc:149] model configuration:
{
"name": "fastertransformer",
"platform": "",
"backend": "fastertransformer",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 1024,
"input": [
{
"name": "input_ids",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "start_id",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "end_id",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "input_lengths",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "request_output_len",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "runtime_top_k",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "runtime_top_p",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "beam_search_diversity_rate",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "temperature",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "len_penalty",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "repetition_penalty",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "random_seed",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "is_return_log_probs",
"data_type": "TYPE_BOOL",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "beam_width",
"data_type": "TYPE_UINT32",
"format": "FORMAT_NONE",
"dims": [
1
],
"reshape": {
"shape": []
},
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "bad_words_list",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
2,
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
},
{
"name": "stop_words_list",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
2,
-1
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": true
}
],
"output": [
{
"name": "output_ids",
"data_type": "TYPE_UINT32",
"dims": [
-1,
-1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "sequence_length",
"data_type": "TYPE_UINT32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "cum_log_probs",
"data_type": "TYPE_FP32",
"dims": [
-1
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "output_log_probs",
"data_type": "TYPE_FP32",
"dims": [
-1,
-1
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "fastertransformer_0",
"kind": "KIND_CPU",
"count": 1,
"gpus": [],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "codegen-350M-multi",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {
"start_id": {
"string_value": "50256"
},
"model_name": {
"string_value": "codegen-350M-multi"
},
"is_half": {
"string_value": "1"
},
"enable_custom_all_reduce": {
"string_value": "0"
},
"vocab_size": {
"string_value": "51200"
},
"tensor_para_size": {
"string_value": "1"
},
"decoder_layers": {
"string_value": "20"
},
"size_per_head": {
"string_value": "64"
},
"max_seq_len": {
"string_value": "2048"
},
"end_id": {
"string_value": "50256"
},
"inter_size": {
"string_value": "4096"
},
"head_num": {
"string_value": "16"
},
"model_type": {
"string_value": "GPT-J"
},
"model_checkpoint_path": {
"string_value": "/model/fastertransformer/1/1-gpu"
},
"rotary_embedding": {
"string_value": "32"
},
"pipeline_para_size": {
"string_value": "1"
}
},
"model_warmup": []
}
I0910 08:43:31.569890 258862 libfastertransformer.cc:1320] TRITONBACKEND_ModelInstanceInitialize: fastertransformer_0 (device 0)
W0910 08:43:31.569915 258862 libfastertransformer.cc:453] Faster transformer model instance is created at GPU '0'
W0910 08:43:31.569922 258862 libfastertransformer.cc:459] Model name codegen-350M-multi
W0910 08:43:31.569940 258862 libfastertransformer.cc:578] Get input name: input_ids, type: TYPE_UINT32, shape: [-1]
W0910 08:43:31.569948 258862 libfastertransformer.cc:578] Get input name: start_id, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569954 258862 libfastertransformer.cc:578] Get input name: end_id, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569960 258862 libfastertransformer.cc:578] Get input name: input_lengths, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569966 258862 libfastertransformer.cc:578] Get input name: request_output_len, type: TYPE_UINT32, shape: [-1]
W0910 08:43:31.569972 258862 libfastertransformer.cc:578] Get input name: runtime_top_k, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.569978 258862 libfastertransformer.cc:578] Get input name: runtime_top_p, type: TYPE_FP32, shape: [1]
W0910 08:43:31.569984 258862 libfastertransformer.cc:578] Get input name: beam_search_diversity_rate, type: TYPE_FP32, shape: [1]
W0910 08:43:31.569990 258862 libfastertransformer.cc:578] Get input name: temperature, type: TYPE_FP32, shape: [1]
W0910 08:43:31.569995 258862 libfastertransformer.cc:578] Get input name: len_penalty, type: TYPE_FP32, shape: [1]
W0910 08:43:31.570001 258862 libfastertransformer.cc:578] Get input name: repetition_penalty, type: TYPE_FP32, shape: [1]
W0910 08:43:31.570006 258862 libfastertransformer.cc:578] Get input name: random_seed, type: TYPE_INT32, shape: [1]
W0910 08:43:31.570012 258862 libfastertransformer.cc:578] Get input name: is_return_log_probs, type: TYPE_BOOL, shape: [1]
W0910 08:43:31.570018 258862 libfastertransformer.cc:578] Get input name: beam_width, type: TYPE_UINT32, shape: [1]
W0910 08:43:31.570024 258862 libfastertransformer.cc:578] Get input name: bad_words_list, type: TYPE_INT32, shape: [2, -1]
W0910 08:43:31.570034 258862 libfastertransformer.cc:578] Get input name: stop_words_list, type: TYPE_INT32, shape: [2, -1]
W0910 08:43:31.570046 258862 libfastertransformer.cc:620] Get output name: output_ids, type: TYPE_UINT32, shape: [-1, -1]
W0910 08:43:31.570053 258862 libfastertransformer.cc:620] Get output name: sequence_length, type: TYPE_UINT32, shape: [-1]
W0910 08:43:31.570059 258862 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
W0910 08:43:31.570065 258862 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
[FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
I0910 08:43:31.853587 258862 libfastertransformer.cc:307] Before Loading Model:
after allocation, free 31.19 GB total 31.75 GB
[WARNING] gemm_config.in is not found; using default GEMM algo
I0910 08:43:34.250280 258862 libfastertransformer.cc:321] After Loading Model:
after allocation, free 30.33 GB total 31.75 GB
I0910 08:43:34.251457 258862 libfastertransformer.cc:537] Model instance is created on GPU Tesla V100-PCIE-32GB
I0910 08:43:34.252189 258862 model_repository_manager.cc:1345] successfully loaded 'fastertransformer' version 1
I0910 08:43:34.252418 258862 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0910 08:43:34.252533 258862 server.cc:583]
+-------------------+--------------------------------------------+--------------------------------------------+
| Backend | Path | Config |
+-------------------+--------------------------------------------+--------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransform | {"cmdline":{"auto-complete-config":"false" |
| | er/libtriton_fastertransformer.so | ,"min-compute-capability":"6.000000","back |
| | | end-directory":"/opt/tritonserver/backends |
| | | ","default-max-batch-size":"4"}} |
| | | |
+-------------------+--------------------------------------------+--------------------------------------------+
I0910 08:43:34.252618 258862 server.cc:626]
+-------------------+---------+--------+
| Model | Version | Status |
+-------------------+---------+--------+
| fastertransformer | 1 | READY |
+-------------------+---------+--------+
I0910 08:43:34.298144 258862 metrics.cc:650] Collecting metrics for GPU 0: Tesla V100-PCIE-32GB
I0910 08:43:34.298580 258862 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.23.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependent |
| | s) schedule_policy model_configuration system_shared_memory cuda_shared_me |
| | mory binary_tensor_data statistics trace |
| model_repository_path[0] | /openbayes/home/fauxpilot/models/codegen-350M-multi-1gpu |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------+
I0910 08:43:34.329038 258862 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0910 08:43:34.329462 258862 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0910 08:43:34.370706 258862 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
Hi,I tried using the VSCode plugin using the supplied configuration, but the plugin throws the following error:
[INFO] [auth] [2022-08-03T06:47:46.784Z] Invalid copilot token: missing token: 403
[ERROR] [default] [2022-08-03T06:47:46.787Z] GitHub Copilot could not connect to server. Extension activation failed: "User not authorized"
Do we need a GitHub Copilot subscription to get a working token?
ERROR: Version in "./docker-compose.yaml" is unsupported.
You might be seeing this error because you're using the wrong Compose file version.
Either specify a version of "2" (or "2.0") and place your service definitions under the services
key,
or omit the version
key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/
version: '3.3'
services:
triton:
image: moyix/triton_with_ft:22.09
Ubuntu 16.04 LTS
Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:54:58 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker-compose version 1.8.0, build unknown
docker-py version: 1.9.0
CPython version: 2.7.12
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
Hi all,
My host is windows 10 with nivida 3090 +24gb vram, I cannot start triton container with error message not CUDA capable device is detected. Do You Know why? i can detect cuda with pytorch in the host.
NVIDIA Release 22.06 (build 39726160)
Triton Server Version 2.23.0
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
W0903 01:19:39.717609 88 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
I0903 01:19:39.717744 88 cuda_memory_manager.cc:115] CUDA memory pool disabled
E0903 01:19:39.753366 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-mono-1gpu': failed to open text file for read /model/codegen-16B-mono-1gpu/config.pbtxt: No such file or directory
E0903 01:19:39.773178 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-mono-2gpu': failed to open text file for read /model/codegen-16B-mono-2gpu/config.pbtxt: No such file or directory
E0903 01:19:39.795737 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-multi-1gpu': failed to open text file for read /model/codegen-16B-multi-1gpu/config.pbtxt: No such file or directory
E0903 01:19:39.815933 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-multi-2gpu': failed to open text file for read /model/codegen-16B-multi-2gpu/config.pbtxt: No such file or directory
E0903 01:19:39.836123 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-nl-1gpu': failed to open text file for read /model/codegen-16B-nl-1gpu/config.pbtxt: No such file or directory
E0903 01:19:39.857360 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-16B-nl-2gpu': failed to open text file for read /model/codegen-16B-nl-2gpu/config.pbtxt: No such file or directory
E0903 01:19:39.881019 88 model_repository_manager.cc:2063] Poll failed for model directory 'codegen-2B-mono-1gpu': failed to open text file for read /model/codegen-2B-mono-1gpu/config.pbtxt: No such file or directory
People should be aware of the research and tools at https://github.com/adapter-hub/adapter-transformers . They place small bottlenecks between model layers and freeze the pretrained weights and train them to compose specific skillsets together. This would be good for personal coding styles or changes like refactoring, commenting, or bugfixing.
Is there a way to do such a thing with CPU instead of GPU? I know this would be slower, but it would be a cheaper solution and would not depend on NVIDIA.
It seems that my company is hijacking my SSL traffic.
But I host the Fauxpilot in my LAN, Is there any way to bypass this problem in VSCode ?
I set the copilot config in setting.json
to localhost
and other machine's local IP address, neither work out.
Refers:
https://stackoverflow.com/questions/71367058/self-signed-certificate-in-certificate-chain-on-github-copilot
community/community#6785
Hey,
installation went fine, no problems whatsoever. However, when trying to run launch.sh
I get following error:
➜ ./launch.sh
[+] Running 1/0
⠿ Container fauxpilot-copilot_proxy-1 Running
Attaching to fauxpilot-copilot_proxy-1, fauxpilot-triton-1
Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]]
I have all dependencies installed:
Running on Arch Linux, kernel: 5.15.55-1-lts
Thanks for the great repo! Just a couple of issues:
Hi,
I'm using the JetBrains CoPilot plugin which is not configurable. I've tried setting api.openai.com in my hosts file, but the server isn't being hit.
Is it a different hostname, such as copilot.github.com or something?
Thanks!
Can the curl command return the OpenAI API execution result?
When I executed the OpenAI API as follows using the curl command, I did not receive a successful execution result. What part is wrong? Any hints or clues are welcome. :)
$ dataman@culint02:~$ cat ./openai-apitest.py
#!/usr/bin/env python3
import openai
openai.api_key = 'dummy'
openai.api_base = 'http://localhost:5000/v1'
result = openai.Completion.create(engine='codegen', prompt='def bye', max_tokens=16, temperature=0.1, stop=["\n\n"])
print ("################ result #################")
print (result)
$ python3 ./openai-apitest.py
################ result #################
{
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"text": "() {\n System.exit(0);\n }\n}\n"
}
],
"created": 1660611975,
"id": "cmpl-AYJ3pTvp3USyMhFmL0BoewUWZIA5h",
"model": "codegen",
"object": "text_completion",
"usage": {
"completion_tokens": 16,
"prompt_tokens": 2,
"total_tokens": 18
}
}
$ curl --location --globoff --request POST 'http://localhost:5000/v1/engines/:codegen/completions?prompt="def hello"&max_tokens=16&temperature=0.1&stop=["\n\n"]'
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<title>Error response</title>
</head>
<body>
<h1>Error response</h1>
<p>Error code: 400</p>
<p>Message: Bad request syntax ('POST /v1/engines/:codegen/completions?prompt="def hello"&max_tokens=16&temperature=0.1&stop=["\\n\\n"] HTTP/1.1').</p>
<p>Error code explanation: HTTPStatus.BAD_REQUEST - Bad request syntax or unsupported method.</p>
</body>
</html>
This is supported by Triton, we just need to add support for it to the proxy. I have written code to do this independently here: https://moyix.net/~moyix/batch_codegen_full.py ; I just need to integrate that into the the proxy code.
It seems that repository https://huggingface.co/codegen-16B-mono-hf/resolve/main/config.json required authentication. This lets the build fail for me.
$ ~/fauxpilot (main) $ ./setup.sh
Models available:
[1] codegen-350M-mono (2GB total VRAM required; Python-only)
[2] codegen-350M-multi (2GB total VRAM required; multi-language)
[3] codegen-2B-mono (7GB total VRAM required; Python-only)
[4] codegen-2B-multi (7GB total VRAM required; multi-language)
[5] codegen-6B-mono (13GB total VRAM required; Python-only)
[6] codegen-6B-multi (13GB total VRAM required; multi-language)
[7] codegen-16B-mono (32GB total VRAM required; Python-only)
[8] codegen-16B-multi (32GB total VRAM required; multi-language)
Enter your choice [6]: 7
Enter number of GPUs [1]: 1
Where do you want to save the model [/home/user/fauxpilot/models]?
Downloading and converting the model, this will take a while...
Unable to find image 'moyix/model_conveter:latest' locally
latest: Pulling from moyix/model_conveter
[many "Pull complete"s]
Digest: sha256:744858f56b5eef785fde79b0f3bc76887fe34f14d0f8c01b06bf92ccd551b3ac
Status: Downloaded newer image for moyix/model_conveter:latest
Converting model codegen-16B-mono with 1 GPUs
Downloading config.json: 0%| | 0.00/994 [00:00<?, ?B/s]Loading CodeGen model
Downloading config.json: 100%|██████████| 994/994 [00:00<00:00, 1.59MB/s]
Downloading pytorch_model.bin: 100%|██████████| 30.0G/30.0G [06:19<00:00, 84.9MB/s]
download_and_convert_model.sh: line 9: 8 Killed python3 codegen_gptj_convert.py --code_model Salesforce/${MODEL} ${MODEL}-hf
=============== Argument ===============
saved_dir: /models/codegen-16B-mono-1gpu/fastertransformer/1
in_file: codegen-16B-mono-hf
trained_gpu_num: 1
infer_gpu_num: 1
processes: 4
weight_data_type: fp32
========================================
Traceback (most recent call last):
File "/transformers/src/transformers/configuration_utils.py", line 619, in _get_config_dict
resolved_config_file = cached_path(
File "/transformers/src/transformers/utils/hub.py", line 285, in cached_path
output_path = get_from_cache(
File "/transformers/src/transformers/utils/hub.py", line 503, in get_from_cache
_raise_for_status(r)
File "/transformers/src/transformers/utils/hub.py", line 418, in _raise_for_status
raise RepositoryNotFoundError(
transformers.utils.hub.RepositoryNotFoundError: 401 Client Error: Repository not found for url: https://huggingface.co/codegen-16B-mono-hf/resolve/main/config.json. If the repo is private, make sure you are authenticated.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "huggingface_gptj_convert.py", line 188, in <module>
split_and_convert(args)
File "huggingface_gptj_convert.py", line 86, in split_and_convert
model = GPTJForCausalLM.from_pretrained(args.in_file)
File "/transformers/src/transformers/modeling_utils.py", line 1844, in from_pretrained
config, model_kwargs = cls.config_class.from_pretrained(
File "/transformers/src/transformers/configuration_utils.py", line 530, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/transformers/src/transformers/configuration_utils.py", line 557, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/transformers/src/transformers/configuration_utils.py", line 631, in _get_config_dict
raise EnvironmentError(
OSError: codegen-16B-mono-hf is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
Done! Now run ./launch.sh to start the FauxPilot server.
Would be awesome to have an plugin for fauxpilot!
Thanks for this fantastic work. I will do a quick comparison for code completion between fauxpiot and copilot.
I will be making calls with the following API invocation:
response = openai.Completion.create(
model="code-davinci-002",
prompt=input_prompt,
temperature=temperature,
max_tokens=max_tokens,
top_p=1,
n=number_of_suggestions,
frequency_penalty=frequency_penalty,
presence_penalty=0,
stop="###",
)
# suggestions = response['choices']
result = ""
if 'choices' in response:
x = response['choices']
if len(x) > 0:
for i in range(0, len(x)):
result = x[i]['text']
else:
result = ''
# are these metrics present?
response_completion_tokens = response["usage"]["completion_tokens"]
response_prompt_tokens = response["usage"]["prompt_tokens"]
response_total_tokens = response["usage"]["total_tokens"]
Can you please let me know whether this API invocation would work for fauxpilot?
With most recent versions of Docker Compose (2.6), installed as Debian package "docker-compose-plugin", the executable name changed from docker-compose
to an argument docker compose
.
launch.sh
has the docker-compose
command hard-coded. Make this choose the correct command depending on the version installed (or just try it and fall back to the other).
What is the context length for faux pilot?
I presume it is 4096?
I installed and run Fauxpilot on Ubuntu18.04/Nvidia RTX 2080 (192.168.0.201) and Ubuntu18.04/Nvidia Titan Xp (192.168.0.179).
Then, in the Ubuntu environment of my laptop, I performed the OpenAI API with the curl command, as shown below.
Unfortunately, sending the curl command to Ubuntu18.04/Nvidia Titan Xp (192.168.0.179) throws an error.
In the summary. FauxPilot on Ubuntu18.04/Nvidia Titan Xp generates the "CUDA runtime error: invalid device function" error message.
Maybe Nvidia Titan Xp is not supported to run FauxPilot?
The configuration file is as follows.
cat ./config.env
MODEL=codegen-2B-multi
NUM_GPUS=1
MODEL_DIR=/work/fauxpilot/models
fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"def hello","max_tokens":16,"temperature":0.1,"stop":["\n\n"]}' http://192.168.0.201:5000/v1/engines/codegen/completions
{"id": "cmpl-eww3WHuWSjUMdfLb5tBfxVxRoJUIs", "model": "codegen", "object": "text_completion", "created": 1660749662, "choices": [{"text": "(self):\n return \"Hello World!\"", "index": 0, "finish_reason": "stop", "logprobs": null}], "usage":
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 30% 29C P8 3W / 225W | 6035MiB / 7982MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1198 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1403 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 768980 C ...onserver/bin/tritonserver 6017MiB |
+-----------------------------------------------------------------------------+
fauxpilot$ curl -s -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"prompt":"def hello","max_tokens":16,"temperature":0.1,"stop":["\n\n"]}' http://192.168.0.179:5000/v1/engines/codegen/completions
<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.</p>
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN Xp On | 00000000:01:00.0 Off | N/A |
| 23% 39C P2 61W / 250W | 5919MiB / 12194MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1592 G /usr/lib/xorg/Xorg 41MiB |
| 0 N/A N/A 24177 C ...onserver/bin/tritonserver 5873MiB |
+-----------------------------------------------------------------------------+
Below is the error log output when running ./launch.sh on Ubuntu18.04/Nvidia Titan Xp (192.168.0.179).
$ ./launch.sh
...... Omission ......
triton_1 |
triton_1 | I0817 15:19:21.682222 96 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
triton_1 | I0817 15:19:21.682527 96 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
triton_1 | I0817 15:19:21.724786 96 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
triton_1 | W0817 15:21:08.354892 96 libfastertransformer.cc:1397] model fastertransformer, instance fastertransformer_0, executing 1 requests
triton_1 | W0817 15:21:08.354910 96 libfastertransformer.cc:638] TRITONBACKEND_ModelExecute: Running fastertransformer_0 with 1 requests
triton_1 | W0817 15:21:08.354916 96 libfastertransformer.cc:693] get total batch_size = 1
triton_1 | W0817 15:21:08.354922 96 libfastertransformer.cc:1051] get input count = 16
triton_1 | W0817 15:21:08.354930 96 libfastertransformer.cc:1117] collect name: start_id size: 4 bytes
triton_1 | W0817 15:21:08.354935 96 libfastertransformer.cc:1117] collect name: input_ids size: 8 bytes
triton_1 | W0817 15:21:08.354939 96 libfastertransformer.cc:1117] collect name: bad_words_list size: 8 bytes
triton_1 | W0817 15:21:08.354944 96 libfastertransformer.cc:1117] collect name: random_seed size: 4 bytes
triton_1 | W0817 15:21:08.354948 96 libfastertransformer.cc:1117] collect name: end_id size: 4 bytes
triton_1 | W0817 15:21:08.354952 96 libfastertransformer.cc:1117] collect name: input_lengths size: 4 bytes
triton_1 | W0817 15:21:08.354956 96 libfastertransformer.cc:1117] collect name: request_output_len size: 4 bytes
triton_1 | W0817 15:21:08.354960 96 libfastertransformer.cc:1117] collect name: runtime_top_k size: 4 bytes
triton_1 | W0817 15:21:08.354964 96 libfastertransformer.cc:1117] collect name: runtime_top_p size: 4 bytes
triton_1 | W0817 15:21:08.354968 96 libfastertransformer.cc:1117] collect name: is_return_log_probs size: 1 bytes
triton_1 | W0817 15:21:08.354972 96 libfastertransformer.cc:1117] collect name: stop_words_list size: 24 bytes
triton_1 | W0817 15:21:08.354976 96 libfastertransformer.cc:1117] collect name: temperature size: 4 bytes
triton_1 | W0817 15:21:08.354979 96 libfastertransformer.cc:1117] collect name: len_penalty size: 4 bytes
triton_1 | W0817 15:21:08.354988 96 libfastertransformer.cc:1117] collect name: beam_width size: 4 bytes
triton_1 | W0817 15:21:08.354998 96 libfastertransformer.cc:1117] collect name: beam_search_diversity_rate size: 4 bytes
triton_1 | W0817 15:21:08.355005 96 libfastertransformer.cc:1117] collect name: repetition_penalty size: 4 bytes
triton_1 | W0817 15:21:08.355010 96 libfastertransformer.cc:1130] the data is in CPU
triton_1 | W0817 15:21:08.355015 96 libfastertransformer.cc:1137] the data is in CPU
triton_1 | W0817 15:21:08.355025 96 libfastertransformer.cc:999] before ThreadForward 0
triton_1 | W0817 15:21:08.355069 96 libfastertransformer.cc:1006] after ThreadForward 0
triton_1 | I0817 15:21:08.355097 96 libfastertransformer.cc:834] Start to forward
triton_1 | terminate called after throwing an instance of 'std::runtime_error'
triton_1 | what(): [FT][ERROR] CUDA runtime error: invalid device function /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/kernels/sampling_topp_kernels.cu:1057
triton_1 |
triton_1 | Signal (6) received.
triton_1 | 0# 0x000055ACE072C699 in /opt/tritonserver/bin/tritonserver
triton_1 | 1# 0x00007F0F78E2D090 in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1 | 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1 | 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1 | 4# 0x00007F0F791E6911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1 | 5# 0x00007F0F791F238C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1 | 6# 0x00007F0F791F23F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1 | 7# 0x00007F0F791F26A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1 | 8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 9# void fastertransformer::invokeTopPSampling<float>(void*, unsigned long&, unsigned long&, int*, int*, bool*, float*, float*, float const*, int const*, int*, int*, curandStateXORWOW*, int, unsigned long, int const*, float, CUstream_st*, cudaDeviceProp*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 10# fastertransformer::TopPSamplingLayer<float>::allocateBuffer(unsigned long, unsigned long, float) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 11# fastertransformer::TopPSamplingLayer<float>::runSampling(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 12# fastertransformer::BaseSamplingLayer<float>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 13# fastertransformer::DynamicDecodeLayer<float>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 14# fastertransformer::GptJ<__half>::forward(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > >*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, fastertransformer::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, fastertransformer::Tensor> > > const*, fastertransformer::GptJWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 15# GptJTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
triton_1 | 16# 0x00007F0F700ED40A in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
triton_1 | 17# 0x00007F0F7921EDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton_1 | 18# 0x00007F0F7A42D609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
triton_1 | 19# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
triton_1 |
copilot_proxy_1 | [2022-08-17 15:21:08,929] ERROR in app: Exception on /v1/engines/codegen/completions [POST]
copilot_proxy_1 | Traceback (most recent call last):
copilot_proxy_1 | File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2463, in wsgi_app
copilot_proxy_1 | response = self.full_dispatch_request()
copilot_proxy_1 | File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1760, in full_dispatch_request
copilot_proxy_1 | rv = self.handle_user_exception(e)
copilot_proxy_1 | File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1758, in full_dispatch_request
copilot_proxy_1 | rv = self.dispatch_request()
copilot_proxy_1 | File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1734, in dispatch_request
copilot_proxy_1 | return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
copilot_proxy_1 | File "/python-docker/app.py", line 258, in completions
copilot_proxy_1 | response=codegen(data),
copilot_proxy_1 | File "/python-docker/app.py", line 234, in __call__
copilot_proxy_1 | completion, choices = self.generate(data)
copilot_proxy_1 | File "/python-docker/app.py", line 146, in generate
copilot_proxy_1 | result = self.client.infer(model_name, inputs)
copilot_proxy_1 | File "/usr/local/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1322, in infer
copilot_proxy_1 | raise_error_grpc(rpc_error)
copilot_proxy_1 | File "/usr/local/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
copilot_proxy_1 | raise get_error_grpc(rpc_error) from None
copilot_proxy_1 | tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
copilot_proxy_1 | 192.168.0.179 - - [17/Aug/2022 15:21:08] "POST /v1/engines/codegen/completions HTTP/1.1" 500 -
triton_1 | --------------------------------------------------------------------------
triton_1 | Primary job terminated normally, but 1 process returned
triton_1 | a non-zero exit code. Per user-direction, the job has been aborted.
triton_1 | --------------------------------------------------------------------------
triton_1 | --------------------------------------------------------------------------
triton_1 | mpirun noticed that process rank 0 with PID 0 on node 1f7b69d48c22 exited on signal 6 (Aborted).
triton_1 | --------------------------------------------------------------------------
fauxpilot_triton_1 exited with code 134
What could be causing this issue?
Any hints or clues are welcome. Thank you.
have tried fauxpilot client but seems I've got lots of bugs while running, couldn't help ask is there any other plugins available?
While am spitting up the compose via launch script am getting following error
triton_1 | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
triton_1 | terminate called after throwing an instance of 'std::runtime_error'
triton_1 | what(): [FT][ERROR] CUDA runtime error: the provided PTX was compiled with an unsupported toolchain. /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/cuda_utils.h:393
triton_1 |
triton_1 | [0674cc13c0f5:00095] *** Process received signal ***
triton_1 | [0674cc13c0f5:00095] Signal: Aborted (6)
triton_1 | [0674cc13c0f5:00095] Signal code: (-6)
Cuda Information
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
vidia-smi
Tue Sep 20 07:53:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 30W / 320W | 334MiB / 10240MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4833 G 35MiB |
| 0 N/A N/A 5856 G 179MiB |
| 0 N/A N/A 5982 G 51MiB |
| 0 N/A N/A 17941 G 13MiB |
| 0 N/A N/A 179354 G 11MiB |
| 0 N/A N/A 189500 G 26MiB |
+-----------------------------------------------------------------------------+
Hi, I discovered this project from a blog and I find it interesting. Only issue is that I work on a macbookpro with an M2 chip. And I dont know how to adapt this project to work an apple silicon and leverage the neural engine on it.
I'm a web developer and don't know much about Machine learning
This year, HactoberFest will be held in October for one month.
Many GitHub open source projects are participating.
To encourage code contribution, I would like to suggest that this community participate here. 😺
Methods for participation can be found in the ISSUE and Pull Request (PR) menu.
@moyix , You may include the two labels such as HACKTOBERFEST, HACKTOBERFEST-ACCEPTED.
Hacktoberfest is a month-long celebration of open source projects, their maintainers,
and the entire community of contributors. Each October, open source maintainers
give new contributors extra attention as they guide developers through their first
pull requests on GitHub.
Looking for a help with triton inference server setup!
After run ./launch.sh the following log generated:
"
...
fauxpilot-triton-1 | W0907 17:13:12.624284 88 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
fauxpilot-triton-1 | I0907 17:13:12.624501 88 cuda_memory_manager.cc:115] CUDA memory pool disabled
fauxpilot-triton-1 | I0907 17:13:12.624647 88 server.cc:556]
...
W0907 17:13:12.704691 88 metrics.cc:634] Cannot get CUDA device count, GPU metrics will not be available
...
"
But container is running. When I try to request inference I got:
"tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Request for unknown model: 'fastertransformer' is not found"
Dependencies and system specs:
1xV100 gpu
Driver Version: 515.65.01
Docker Compose version v2.6.0
Nvidia docker: nvidia/cuda:11.0.3-base-ubuntu20.04
The repository contains Dockerfiles to recreate the moyix/model_converter:latest
and moyix/copilot_proxy:latest
image but not the moyix/triton_with_ft:22.06
image. Would it be possible to add the Dockerfile + config to build this image?
Hi, I would like to know if the copilot only supports NVIDIA GPU's?
I am using a T4 gpu, host machine's cuda is 11.0 and driver is 450.102.04. When running launch.sh, got such error.
Detail log:
fauxpilot-triton-1 | W0812 03:06:40.864778 92 libfastertransformer.cc:620] Get output name: cum_log_probs, type: TYPE_FP32, shape: [-1]
fauxpilot-triton-1 | W0812 03:06:40.864782 92 libfastertransformer.cc:620] Get output name: output_log_probs, type: TYPE_FP32, shape: [-1, -1]
fauxpilot-triton-1 | [FT][WARNING] Custom All Reduce only supports 8 Ranks currently. Using NCCL as Comm.
fauxpilot-triton-1 | I0812 03:06:41.156692 92 libfastertransformer.cc:307] Before Loading Model:
fauxpilot-triton-1 | after allocation, free 6.56 GB total 8.00 GB
fauxpilot-triton-1 | [WARNING] gemm_config.in is not found; using default GEMM algo
fauxpilot-triton-1 | terminate called after throwing an instance of 'std::runtime_error'
fauxpilot-triton-1 | what(): [FT][ERROR] CUDA runtime error: operation not supported /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/allocator.h:181
fauxpilot-triton-1 |
fauxpilot-triton-1 | [5f61fab36b85:00092] *** Process received signal ***
fauxpilot-triton-1 | [5f61fab36b85:00092] Signal: Aborted (6)
fauxpilot-triton-1 | [5f61fab36b85:00092] Signal code: (-6)
fauxpilot-triton-1 | [5f61fab36b85:00092] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f3a7ef7e420]
Thx~
According to the response from the Docker developers:
env_file
is not used for variable interpolation in the container specification used by compose to create containers, but passed to the docker API as runtime definition for the process runing inside the container.
...
So docker-compose.yaml
in 8895b74 uses an undefined behavior to construct itself, which may have unintended effects on some systems. (In fact, as mentioned in the issue above, this doesn't work properly on Windows. Such as fauxpilot-windows)
I noticed this was already the second bug in #49 and @dslandry said he didn't finish the check, should we completely re-evaluate and check this PR?
Which graphics cards are supported?
Couple of queries:
codegen-16B-multi
model with a single A6000?Curious to know have you attempted to see how much load fauxpilot could handle? I know its lot to do with the H/W that is provisioned.
Still would be curious to know what a typical GPU like RTX3090 or RTX A6000 could handle in terms of API request/minute?
Not sure how this happened, but currently the 16B 2gpu models fail in unzstd with Decoding error (36) : Corrupted block detected
. I will re-convert and re-upload them. Steps to fix:
./setup.sh
to prevent corrupted downloads.I rewrote the setup, launch scripts and so on from this repository to PowerShell to start fauxpilot directly in Windows with Docker, and it works fine on my device (As shown below). Although I used Windows in the name, I've recently enhanced the generality to work well in Linux as well (if anyone likes using pwsh
in Linux as much as I do😸).
I wonder if you are interested in such a project. Should I file a merge request to add the same functionality to this repo? Or continue to be independent?
Using a well-crafted FAUXPILOT, we can execute inference tasks based on the Codegen model. I read recently that I can work on Fine-tune using the Codegen model on the following website.
$ deepspeed --num_gpus 1 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=./codegen-6B-finetuned --dataset_name your_dataset --tokenizer_name Salesforce/codegen-6B-multi --block_size 2048 --gradient_accumulation_steps 32 --do_train --fp16 --overwrite_output_dir --deepspeed ds_config.json
I'm curious if there is a GitHub storage address that describes how to perform Fine-Tune work with additional source code (e.g., my own source code) using Deepspeed. We are looking for a more detailed GitHub repository for the "--dataset_name your_dataset" option. Where is the applicable GitHub repository located? Are there any web pages that deal with how to run Fine-Tune with Deepspeed? Welcome to any comments on this issue.
Is there any chance that this project is able to run on android TV (Nvidia Shield Pro, which has a Tegra X1) ?
It is so inspiring to find this project.
I am a disabled software developer who really struggles to code now. I've been staying away from copilot, worried of becoming reliant on cloud infrastructure, but I wonder if these tools could really help me.
Are there clear "for dummies" instructions to get set up anywhere? Including what hardware is recommended, and how to configure popular editors? (i use vim, and i was thinking of setting up a dasharo asus d16 motherboard ...)
How easy would it be to swap out SalesForce CodeGen models with those from EleutherAI?
GPT-Neo, GPT-J, and GPT-NeoX models are also trained on GitHub.
I installed the requirements and installed choice 4 with 1 GPU in ./setup.sh. Then I called ./launch.sh with the following prints. I think the server was not launched. Simply creating a Program folder in my c drive didn't solve the problem. I wonder how to fix this?
(base) boyuanchen@Owne:~/fauxpilot$ ls
LICENSE README.md config.env converter copilot_proxy docker-compose.yaml example.env launch.sh models setup.sh
(base) boyuanchen@Owne:~/fauxpilot$ ./launch.sh
./launch.sh: line 19: /mnt/c/Program: No such file or directory
(base) boyuanchen@Owne:~/fauxpilot$ ./launch.sh
./launch.sh: line 19: /mnt/c/Program: Is a directory
(base) boyuanchen@Owne:~/fauxpilot$
Now that FauxPilot has been used by quite a few people, it would be great to collect the questions that come up repeatedly in a Frequently Asked Questions (FAQ) page. I have started tagging issues with to collect such questions.
Another helpful thing would be a list of GPUs and model sizes that are known to work, so that people can easily see if their configuration should work.
Most of us don't have GPUs powerful enough to even run models with 6 billion parameters. Can we port this to Colab in any way so it would be more accessible?
An (unintended?) consequence of #49 seems to be that the completion API now returns log probabilities for each token now even if they are not requested (i.e. if logprobs=NULL in the request). I poked around at it a little bit but couldn't immediately track down why it's happening. I thought it might be due to these lines:
https://github.com/moyix/fauxpilot/blob/main/copilot_proxy/utils/codegen.py#L84-L87
But changing that doesn't seem to have stopped it from returning LPs. @fdegier, do you know offhand what might have introduced this?
Not high priority, just something to fix up when I get a chance.
Hi guys,
I'm using:
5.18.16-200.fc36.x86_64
Using podman as container runtime with NV container toolkit
Client: Podman Engine
Version: 4.1.1
API Version: 4.1.1
Go Version: go1.18.4
Built: Fri Jul 22 15:05:59 2022
OS/Arch: linux/amd64
cat /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
{
"version": "1.0.0",
"hook": {
"path": "/usr/bin/nvidia-container-toolkit",
"args": ["nvidia-container-toolkit", "prestart"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [".*"]
},
"stages": ["prestart"]
}
Command nvdia-smi
works fine in container.
The triton server started fine but it crashes when I request it using OpenAI API demo written in the readme.
Is this a GPU compatibility issue? if yes, which GPU model is supported?
Any help will be appreciated!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.