Hi, First of all, I really appreciate your work! I

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Leaderboard evaluations issues about gorilla HOT 4 CLOSED

danieljannai21 commented on June 12, 2024 1

Leaderboard evaluations issues

from gorilla.

Comments (4)

HuanzhiMao commented on June 12, 2024 2

Hi @danieljannai21,

Thank you for your attention and thanks for flagging this! Sorry for the delayed reply; we were a bit busy recently :/

Regarding your first and third points: Yes, this is an oversight on our end. We do realise that some of the prompts and possible answers in the executable evaluation datasets have bugs. We plan to go through them today. Expect a patch update on this very soon :)
Regarding your second point: Our current evaluation pipeline doesn't enforce type constraints for the executable test category; we only do type-checking for the AST category. For the executable category, we only care about the execution result of the function call. As long as the result is expected, then it would be correct.

Best,
BFCL Team

from gorilla.

danieljannai21 commented on June 12, 2024

Hi @HuanzhiMao,
Thank you for your answer!

Regarding my second point, I still don't understand - even is AST evaluation, why would we want to enforce typing even in the case of an integer that's represented as a float? Wouldn't it be 100% valid to use an int in that case, as the conversation can be done seamlessly without any information loss?
Also, not sure if it's something you addressed in your data fixes, but I noted that in the Java category (and possibly other ones as well, but I didn't read everything thoroughly), the questions are formatted as "How do I ?" instead of just "", so many models tend to output a complete explanation (that also includes the tool invocation as part of it), instead of just outputting the tool invocation alone, which can be more easily parsed and evaluated.

from gorilla.

HuanzhiMao commented on June 12, 2024

Hi @danieljannai21 ,

Good question. Let me give you an example where type matters.
Let's say the function doc asks for an integer-type argument x, and the source code of that function is defined as:

def func(x: int):
    for i in range(1, 100 // x):
        print(i)

Here, if you follow the function doc and do func(5), it would work fine. But if you give func(5.0), you will get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in func
TypeError: 'float' object cannot be interpreted as an integer

Since the model doesn't have access to the function source code, the model cannot safely assume that it's okay to mess up the type.
Moreover, in programming languages like Java, the compiler would error if you provide a float value when the argument type is an int. It's simply invalid syntax.

Regarding your second concern:
Thank you for pointing out how we format the prompt. We do observe questions starting with "How do I?" or similar in which it asks questions rather than provide tool use instruction.
Here are 2 scenarios:

When the model possesses the capacity to make function calls, we expect the model to provide function calls in an "easy-to-extract" fashion (i.e. JSON format, or direct function call format) if the model decides to make this choice. An example is the Claude-3 function-calling has a "thinking" component that's independent of the tool use section. Therefore, when evaluating the function-calling model, we expect the tool invocation to be concise and convenient.
When we prompt the model to make function call, we specifically format the prompt (Should you decide to return the function call(s), Put it in the format of [func1(params_name=params_value, params_name2=params_value2...), func2(params)] \n NO other text MUST be included.) such that no explanation should be provided if the model followed the instructions.

With the above, if a model outputs a complete explanation, it is either not a user-friendly function-calling model or it is not following prompting instructions carefully.
We would love to have questions rephrased more diversely in terms of tone to simulate real-world tool usage better, but that will possibly be in our future release.

from gorilla.

danieljannai21 commented on June 12, 2024

Thanks, @HuanzhiMao!

I agree that the model should be able to output a function call in a parsable manner, but I don't think that's what your checking in the "How do I ..." questions. The phrasing of these questions doesn't suggest that the user wants the model to call a function, but asks it how to do something, in which case the model can tell the user to call the relevant function. In my opinion, phrasing the questions as "Please do ..." would yield much better results for most models, and would better evaluate their function calling capabilities. Also, the "NO other text MUST be included" is only added to the prompted models, and not the FC models.
Another thing I noticed is that there are cases where the gold answer contains a parameter that doesn't appear in the description of the function (for example, the price parameter doesn't exist in the description of the book_room function in question 46 of execution_multiple_function category. I would suggest running an automatic validation that compares the function's documentation in the tools section, the way it's invoked in the gold answer, and the actual implementation (if exists) and makes sure they're all consistent. Shouldn't be that hard to implement.

Thanks again for the wonderful work you guys are doing.

from gorilla.

Leaderboard evaluations issues about gorilla HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent