Comments (9)
Can't tell if the model architecture has changed – any idea?
from fauxpilot.
I don't know it's realy what you whant to know but as I can see you can use it the same way (Causal) but they added Infill way. And it seems to support a lot of more langages too.
For infill sampling, we introduce three new special token types:
<mask_N>: N-th span to be masked. In practice, use <mask_1> to where you want to sample infill.
<sep>: Seperator token between the suffix and the infilled sample. See below.
<eom>: "End-Of-Mask" token that model will output at the end of infilling. You may use this token to truncate the output.
It's probably possible with a single modification of setup.sh
from fauxpilot.
@moyix how did you come up with the calculations & code in codegen_gptj_convert.py? It seems like this conversion from CodeGen to GPT-J is the most difficult part of supporting a new model type.
I was able to modify the python backend to support bigcode/starcoder. It's obviously really slow because we are just loading the model via transformers library in the python backend (are we sure that is the right way to do the python backend thing?). I got fairly far along with the faster transformer conversion but stopped when I saw the bit of math / calculations going on in codegen_gptj_convert
. I haven't tried just removing it and seeing if the conversion from GPTBigCode
-> GPT-J
is simpler than CodeGen
-> GPT-J
.
from fauxpilot.
@moyix Let me bring some light into the dark.
After comparing which configuration Codegen-2 (trust_remote_code=True) uses over Codegen-1, I found one obvious hyperparameter with mp_num = 8
over mp_num=4
.
After debugging some hours, I reverse engineered that the following tweak in the permutation order should do the job.
Below explainations for Salesforce/codegen2-1B sizes
1B: qkv_proj has shape [6144, other]
6144 contains for 8*mp_num all 2d vectors for Q,K, and V, which now need to go to 8, 256 shape
toy example: qkv_proj were just np.arange, if would go through this transformation
qkv_proj[:,0] = np.arange(0,6144)
1B: qw has shape [1,2,8,256]
qw =
tensor([[ 0., 768., 1536., 2304., 3072., 3840., 4608., 5376.])
[...254 rows missing]
[ 255., 1023., 1791., 2559., 3327., 4095., 4863., 5631.]])
value gets
qv = tensor(
[[ 256., 1024., 1792., 2560., 3328., 4096., 4864., 5632.])
[...254 rows missing]
[ 511., 1279., 2047., 2815., 3583., 4351., 5119., 5887.]])
rest goes to kv ..
generalized vector for permutation is therefore:
```python
mp_num =8
mp_num = codegen_2 = 8
base_permutation = np.arange(0,mp_num*3).reshape(-1,3).T.flatten().tolist()
base_permutation == [0, 3, 6, 9, 12, 15, 18, 21,
1, 4, 7, 10, 13, 16, 19, 22,
2, 5, 8, 11, 14, 17, 20, 23]
All you need to do is to make the permutation configurable.
Anyhow, the Triton Server is not really performant, when comparing to Ctranslate2 [https://github.com/OpenNMT/CTranslate2]. It also can do batching, and there is no need to perform padding to certain shapes in the FastAPI proxy. (Ctranslate2-codegen2 on int8-CPU is around 4.1x faster, and takes ~4 less memory than huggingface-codegen2)
I'll try to add some models for Codegen-1 and Codegen-2 for all sizes for Ctranslate2-framework, stay tuned.
https://github.com/OpenNMT/CTranslate2/pull/1230/files
from fauxpilot.
Oops, really sorry that I didn't see this before you figured it out on your own. I wrote up an article explaining how the permutation was derived here:
https://gist.github.com/moyix/7896575befbe1b99162ccfec8d135566
I'll look into Ctranslate2 – are there gains over FT when using GPUs for inference?
from fauxpilot.
Not sure about FT. On my GPU:
task: input 16 tokens -> generate ten times exactly 64 tokens
timings
- ct2 codegen2-7B on float16 =9.55seconds (67 tokens/s, 1x GPU, 7gb Vram)
- huggingface codegen2-7B on int8 =17.06seconds (37.5 tokens/s, 1x GPU, 7gb Vram)
For the smaller models (2B etc) should be more like 3x speedup, for large models the size of tensors benefit less from the c++ implementation: 16B more 1.5.
Most importantly, only ct2 only takes half the memory. I am not sure about the speeds of FT (I think you used to write ~2x speedup).
Edit: I found another of your markdown posts, which helped me to derive the codegen2 conversion.
from fauxpilot.
Here are some benchmarks for codegen2 on FasterTransformers. This is with A6000s.
codegen2-1B on 4 GPUs generated 16+64 tokens in 0.18s ~349.40 tokens/sec
codegen2-1B on 2 GPUs generated 16+64 tokens in 0.19s ~337.68 tokens/sec
codegen2-1B on 1 GPU generated 16+64 tokens in 0.25s ~253.55 tokens/sec
codegen2-3_7B on 4 GPUs generated 16+64 tokens in 0.29s ~220.00 tokens/sec
codegen2-3_7B on 2 GPUs generated 16+64 tokens in 0.43s ~148.69 tokens/sec
codegen2-3_7B on 1 GPU generated 16+64 tokens in 0.73s ~ 87.47 tokens/sec
codegen2-7B on 4 GPUs generated 16+64 tokens in 0.51s ~125.93 tokens/sec
codegen2-7B on 2 GPUs generated 16+64 tokens in 0.80s ~ 80.26 tokens/sec
codegen2-7B on 1 GPU generated 16+64 tokens in 1.38s ~ 46.26 tokens/sec
codegen2-16B on 4 GPUs generated 16+64 tokens in 0.99s ~ 64.97 tokens/sec
codegen2-16B on 2 GPUs generated 16+64 tokens in 1.68s ~ 38.13 tokens/sec
codegen2-16B on 1 GPU generated 16+64 tokens in 3.10s ~ 20.61 tokens/sec
from fauxpilot.
Do you have a comparison on the transformers float16 or bitsandbytes int8 version? Can‘t benchmark
While you are one it, you can pull the Ctranslate2 model from here, should just take 2-3 min to install + a download, see:
https://huggingface.co/michaelfeil/ct2fast-codegen2-7B
from fauxpilot.
Do we have any updates on this?
from fauxpilot.
Related Issues (20)
- CodeT5+ as the next model for FauxPilot? HOT 2
- can I launch fauxpilot without docker installation in notebook? HOT 1
- could Fauxpilot help to generate unit test for java code?
- Is it normal so much time to build? HOT 4
- Infinite time to (load?) and then it doen't even work?? HOT 4
- [bug] docker(version 20.10.21) version parse error in launch.sh HOT 1
- Support arm64 to minimize cost
- Maybe add windows/etc installer all-in-one in this project's 'releases'.
- 400 Bad Request when file has around 100 lines of code HOT 3
- C# support! HOT 2
- Hello all. The comments above have been very helpful in setting up the Copilot extension. I managed to get it to work with my instance and figured I would combine the steps I used (this is for Windows. Linux installation is similar, just different locations):
- It was working fine before... HOT 1
- Support for AMD GPUs HOT 1
- Triton doesnt exist anymore I think? HOT 3
- K8s deployment (via helm chart) HOT 2
- Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) HOT 1
- why my response are all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! HOT 3
- Can I merge images of triton and client into one?eg fastertransformer_backend get content_fetch <fastertransformer&client>in CMakeLists ? HOT 1
- help me HOT 1
- What is the comparison of these model in huggingface? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fauxpilot.