Giter Club home page Giter Club logo

Comments (10)

Frogley avatar Frogley commented on August 21, 2024 3

我想知道ChatCompletion模式下如何使用流式输出

哥们,这个库的主要贡献者是一个英国人,参与者来自全世界各地。他们大部分人都不懂中文。为了更好的讨论,请使用英文。你可以先写中文,然后用翻译软件翻成英文再发出来。

另外,你问的这个问题也和本issue的主题无关。正确的做法是在Discussions里开一个新的帖子

Hey mate, the main contributor to this library is British and participants are from all over the world, most of whom do not understand Chinese. To facilitate better discussion, please use English. You can write in Chinese first and then translate it into English using translation software before posting.

Also, the question you're asking is not relevant to the topic of this issue. The right way to proceed would be to start a new thread in the Discussions

from openai.

tianqingshuilan avatar tianqingshuilan commented on August 21, 2024

Is your feature request related to a problem? Please describe. when i use tokenizer to count Chinese character. the result is not correct. Describe the solution you'd like correct token count for chinese character.

Describe alternatives you've considered N/A Additional context N/A

When using TokenizerGpt3 to calculate Tokens of Chinese characters, the calculation results are not consistent with those calculated by python's tiktoken library. The following is the sample code:

# python
import tiktoken

model = "gpt-3.5-turbo-0301"
chiStr = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"
encoding = tiktoken.encoding_for_model(model);
print(len(encoding.encode(chiStr)))

# output 34
Console.WriteLine($"{TokenizerGpt3.Encode("床前明月光,疑是地上霜,举头望明月,低头思故乡。").Count}");
// output 49

I don't know what is the reason for this difference. The calculation of the web version Tokenizer provided by OpenAI is also problematic.

from openai.

kayhantolga avatar kayhantolga commented on August 21, 2024

sorry but I do not know how I can do the calculation with Chinese characters :/

from openai.

tianqingshuilan avatar tianqingshuilan commented on August 21, 2024

sorry but I do not know how I can do the calculation with Chinese characters :/

I found a project called "TiktokenSharp" in github, the value it calculates is correct, I haven't read the core code, I think you can use it as a reference.

from openai.

Frogley avatar Frogley commented on August 21, 2024

The reason is encoding. According to openai-cookbook, text-davinci-003 uses p50k_base while gpt-3.5-turbo uses cl100k_base.

  • cl100k_base: gpt-4, gpt-3.5-turbo, text-embedding-ada-002
  • p50k_base: Codex models, text-davinci-002, text-davinci-003
  • r50k_base (or gpt2): GPT-3 models like davinci

I tested tiktoken in Python and the results are consistent with @tianqingshuilan.

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-3.5-turbo-0301")
>>> print(len(encoding.encode("床前明月光,疑是地上霜,举头望明月,低头思故乡。")))
34
>>> encoding = tiktoken.encoding_for_model("text-davinci-003")
>>> print(len(encoding.encode("床前明月光,疑是地上霜,举头望明月,低头思故乡。")))
49

Currently, TokenizerGpt3 only supports p50k_base. Maybe we need TokenizerGpt4 to support cl100k_base. Referring to TiktokenSharp, cl100k_base has different regular expressions, MergeableRanks, and SpecialTokens compared to p50k_base.

from openai.

tianqingshuilan avatar tianqingshuilan commented on August 21, 2024

The reason is encoding. According to openai-cookbook, text-davinci-003 uses p50k_base while gpt-3.5-turbo uses cl100k_base.

  • cl100k_base: gpt-4, gpt-3.5-turbo, text-embedding-ada-002
  • p50k_base: Codex models, text-davinci-002, text-davinci-003
  • r50k_base (or gpt2): GPT-3 models like davinci

I tested tiktoken in Python and the results are consistent with @tianqingshuilan.

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-3.5-turbo-0301")
>>> print(len(encoding.encode("床前明月光,疑是地上霜,举头望明月,低头思故乡。")))
34
>>> encoding = tiktoken.encoding_for_model("text-davinci-003")
>>> print(len(encoding.encode("床前明月光,疑是地上霜,举头望明月,低头思故乡。")))
49

Currently, TokenizerGpt3 only supports p50k_base. Maybe we need TokenizerGpt4 to support cl100k_base. Referring to TiktokenSharp, cl100k_base has different regular expressions, MergeableRanks, and SpecialTokens compared to p50k_base.

Wow, thanks for your answer, it gave me a deeper understanding of the differences between tokens in each version.

from openai.

vonwell avatar vonwell commented on August 21, 2024

我想知道ChatCompletion模式下如何使用流式输出

from openai.

zhaonaiqiu avatar zhaonaiqiu commented on August 21, 2024

我想知道ChatCompletion模式下如何使用流式输出

Chat和Completion有什么不一样吗?请教一下?

from openai.

kayhantolga avatar kayhantolga commented on August 21, 2024

@Frogley, I completely agree with your suggestions. 🙏🏻

To everyone contributing here, let's continue to foster a global, inclusive environment by using English in our discussions. For those who might not be fluent in English, please feel free to use translation software or AI services like ChatGPT to assist you. We greatly appreciate your understanding and cooperation.

Also, I'd like to second Frogley's point about staying on topic. If you have an issue or question that's not related to the current thread, please create a new issue or start a new thread in Discussions. This will help keep our discussions focused and will make it easier for everyone to track and respond to specific issues.

Thanks everyone for your cooperation!

from openai.

lanvada avatar lanvada commented on August 21, 2024

You can use SharpToken
https://github.com/dmitry-brazhenko/SharpToken

from openai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.