Giter Club home page Giter Club logo

Comments (9)

moyix avatar moyix commented on April 30, 2024

Can't tell if the model architecture has changed – any idea?

from fauxpilot.

gilsdav avatar gilsdav commented on April 30, 2024

I don't know it's realy what you whant to know but as I can see you can use it the same way (Causal) but they added Infill way. And it seems to support a lot of more langages too.

For infill sampling, we introduce three new special token types:

<mask_N>: N-th span to be masked. In practice, use <mask_1> to where you want to sample infill.

<sep>: Seperator token between the suffix and the infilled sample. See below.

<eom>: "End-Of-Mask" token that model will output at the end of infilling. You may use this token to truncate the output.

It's probably possible with a single modification of setup.sh

from fauxpilot.

spew avatar spew commented on April 30, 2024

@moyix how did you come up with the calculations & code in codegen_gptj_convert.py? It seems like this conversion from CodeGen to GPT-J is the most difficult part of supporting a new model type.

I was able to modify the python backend to support bigcode/starcoder. It's obviously really slow because we are just loading the model via transformers library in the python backend (are we sure that is the right way to do the python backend thing?). I got fairly far along with the faster transformer conversion but stopped when I saw the bit of math / calculations going on in codegen_gptj_convert. I haven't tried just removing it and seeing if the conversion from GPTBigCode -> GPT-J is simpler than CodeGen -> GPT-J.

from fauxpilot.

michaelfeil avatar michaelfeil commented on April 30, 2024

@moyix Let me bring some light into the dark.
After comparing which configuration Codegen-2 (trust_remote_code=True) uses over Codegen-1, I found one obvious hyperparameter with mp_num = 8 over mp_num=4.
After debugging some hours, I reverse engineered that the following tweak in the permutation order should do the job.

Below explainations for Salesforce/codegen2-1B sizes

1B: qkv_proj has shape [6144, other]
6144 contains for 8*mp_num all 2d vectors for Q,K, and V, which now need to go to 8, 256 shape
toy example: qkv_proj were just np.arange, if would go through this transformation
qkv_proj[:,0] = np.arange(0,6144)
1B: qw has shape [1,2,8,256]
qw = 
tensor([[   0.,  768., 1536., 2304., 3072., 3840., 4608., 5376.])
        [...254 rows missing]
        [ 255., 1023., 1791., 2559., 3327., 4095., 4863., 5631.]])
value gets
qv =  tensor(
        [[ 256., 1024., 1792., 2560., 3328., 4096., 4864., 5632.])
        [...254 rows missing]
        [ 511., 1279., 2047., 2815., 3583., 4351., 5119., 5887.]])
rest goes to kv ..
generalized vector for permutation is therefore:
```python
mp_num =8
mp_num = codegen_2 = 8
base_permutation = np.arange(0,mp_num*3).reshape(-1,3).T.flatten().tolist()
base_permutation == [0, 3, 6, 9, 12, 15, 18, 21,
                                    1, 4, 7, 10, 13, 16, 19, 22,
                                    2, 5, 8, 11, 14, 17, 20, 23]

All you need to do is to make the permutation configurable.

Anyhow, the Triton Server is not really performant, when comparing to Ctranslate2 [https://github.com/OpenNMT/CTranslate2]. It also can do batching, and there is no need to perform padding to certain shapes in the FastAPI proxy. (Ctranslate2-codegen2 on int8-CPU is around 4.1x faster, and takes ~4 less memory than huggingface-codegen2)

I'll try to add some models for Codegen-1 and Codegen-2 for all sizes for Ctranslate2-framework, stay tuned.
https://github.com/OpenNMT/CTranslate2/pull/1230/files

from fauxpilot.

moyix avatar moyix commented on April 30, 2024

Oops, really sorry that I didn't see this before you figured it out on your own. I wrote up an article explaining how the permutation was derived here:

https://gist.github.com/moyix/7896575befbe1b99162ccfec8d135566

I'll look into Ctranslate2 – are there gains over FT when using GPUs for inference?

from fauxpilot.

michaelfeil avatar michaelfeil commented on April 30, 2024

Not sure about FT. On my GPU:
task: input 16 tokens -> generate ten times exactly 64 tokens
timings

  • ct2 codegen2-7B on float16 =9.55seconds (67 tokens/s, 1x GPU, 7gb Vram)
  • huggingface codegen2-7B on int8 =17.06seconds (37.5 tokens/s, 1x GPU, 7gb Vram)

For the smaller models (2B etc) should be more like 3x speedup, for large models the size of tensors benefit less from the c++ implementation: 16B more 1.5.
Most importantly, only ct2 only takes half the memory. I am not sure about the speeds of FT (I think you used to write ~2x speedup).

Edit: I found another of your markdown posts, which helped me to derive the codegen2 conversion.

from fauxpilot.

moyix avatar moyix commented on April 30, 2024

Here are some benchmarks for codegen2 on FasterTransformers. This is with A6000s.

codegen2-1B   on 4 GPUs generated 16+64 tokens in  0.18s ~349.40 tokens/sec
codegen2-1B   on 2 GPUs generated 16+64 tokens in  0.19s ~337.68 tokens/sec
codegen2-1B   on 1 GPU  generated 16+64 tokens in  0.25s ~253.55 tokens/sec
codegen2-3_7B on 4 GPUs generated 16+64 tokens in  0.29s ~220.00 tokens/sec
codegen2-3_7B on 2 GPUs generated 16+64 tokens in  0.43s ~148.69 tokens/sec
codegen2-3_7B on 1 GPU  generated 16+64 tokens in  0.73s ~ 87.47 tokens/sec
codegen2-7B   on 4 GPUs generated 16+64 tokens in  0.51s ~125.93 tokens/sec
codegen2-7B   on 2 GPUs generated 16+64 tokens in  0.80s ~ 80.26 tokens/sec
codegen2-7B   on 1 GPU  generated 16+64 tokens in  1.38s ~ 46.26 tokens/sec
codegen2-16B  on 4 GPUs generated 16+64 tokens in  0.99s ~ 64.97 tokens/sec
codegen2-16B  on 2 GPUs generated 16+64 tokens in  1.68s ~ 38.13 tokens/sec
codegen2-16B  on 1 GPU  generated 16+64 tokens in  3.10s ~ 20.61 tokens/sec

from fauxpilot.

michaelfeil avatar michaelfeil commented on April 30, 2024

Do you have a comparison on the transformers float16 or bitsandbytes int8 version? Can‘t benchmark

While you are one it, you can pull the Ctranslate2 model from here, should just take 2-3 min to install + a download, see:
https://huggingface.co/michaelfeil/ct2fast-codegen2-7B

from fauxpilot.

batuhanfaik avatar batuhanfaik commented on April 30, 2024

Do we have any updates on this?

from fauxpilot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.