Comments (4)
Hi @danieljannai21,
Thank you for your attention and thanks for flagging this! Sorry for the delayed reply; we were a bit busy recently :/
- Regarding your first and third points: Yes, this is an oversight on our end. We do realise that some of the prompts and possible answers in the executable evaluation datasets have bugs. We plan to go through them today. Expect a patch update on this very soon :)
- Regarding your second point: Our current evaluation pipeline doesn't enforce type constraints for the executable test category; we only do type-checking for the AST category. For the executable category, we only care about the execution result of the function call. As long as the result is expected, then it would be correct.
Best,
BFCL Team
from gorilla.
Hi @HuanzhiMao,
Thank you for your answer!
-
Regarding my second point, I still don't understand - even is AST evaluation, why would we want to enforce typing even in the case of an integer that's represented as a float? Wouldn't it be 100% valid to use an int in that case, as the conversation can be done seamlessly without any information loss?
-
Also, not sure if it's something you addressed in your data fixes, but I noted that in the Java category (and possibly other ones as well, but I didn't read everything thoroughly), the questions are formatted as "How do I ?" instead of just "", so many models tend to output a complete explanation (that also includes the tool invocation as part of it), instead of just outputting the tool invocation alone, which can be more easily parsed and evaluated.
from gorilla.
Hi @danieljannai21 ,
Good question. Let me give you an example where type matters.
Let's say the function doc asks for an integer-type argument x
, and the source code of that function is defined as:
def func(x: int):
for i in range(1, 100 // x):
print(i)
Here, if you follow the function doc and do func(5)
, it would work fine. But if you give func(5.0)
, you will get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in func
TypeError: 'float' object cannot be interpreted as an integer
Since the model doesn't have access to the function source code, the model cannot safely assume that it's okay to mess up the type.
Moreover, in programming languages like Java, the compiler would error if you provide a float value when the argument type is an int. It's simply invalid syntax.
Regarding your second concern:
Thank you for pointing out how we format the prompt. We do observe questions starting with "How do I?" or similar in which it asks questions rather than provide tool use instruction.
Here are 2 scenarios:
- When the model possesses the capacity to make function calls, we expect the model to provide function calls in an "easy-to-extract" fashion (i.e. JSON format, or direct function call format) if the model decides to make this choice. An example is the Claude-3 function-calling has a "thinking" component that's independent of the tool use section. Therefore, when evaluating the function-calling model, we expect the tool invocation to be concise and convenient.
- When we prompt the model to make function call, we specifically format the prompt (
Should you decide to return the function call(s), Put it in the format of [func1(params_name=params_value, params_name2=params_value2...), func2(params)] \n NO other text MUST be included.
) such that no explanation should be provided if the model followed the instructions.
With the above, if a model outputs a complete explanation, it is either not a user-friendly function-calling model or it is not following prompting instructions carefully.
We would love to have questions rephrased more diversely in terms of tone to simulate real-world tool usage better, but that will possibly be in our future release.
from gorilla.
Thanks, @HuanzhiMao!
-
I agree that the model should be able to output a function call in a parsable manner, but I don't think that's what your checking in the "How do I ..." questions. The phrasing of these questions doesn't suggest that the user wants the model to call a function, but asks it how to do something, in which case the model can tell the user to call the relevant function. In my opinion, phrasing the questions as "Please do ..." would yield much better results for most models, and would better evaluate their function calling capabilities. Also, the "NO other text MUST be included" is only added to the prompted models, and not the FC models.
-
Another thing I noticed is that there are cases where the gold answer contains a parameter that doesn't appear in the description of the function (for example, the
price
parameter doesn't exist in the description of thebook_room
function in question 46 ofexecution_multiple_function
category. I would suggest running an automatic validation that compares the function's documentation in thetools
section, the way it's invoked in the gold answer, and the actual implementation (if exists) and makes sure they're all consistent. Shouldn't be that hard to implement.
Thanks again for the wonderful work you guys are doing.
from gorilla.
Related Issues (20)
- How to evaluate android function call[feature] HOT 1
- [bug] LlamaHandler format HOT 4
- Leaderboard data bug + suggested fix HOT 1
- [bug] Hosted Gorilla: Expired HTTPS certificate HOT 2
- how to test new model on BFCL? HOT 2
- [bug] openfunctions-v2 default chat template
- [feature] Add multi-turn conversational function calling category for benchmarking HOT 1
- the evaluation of class relevance in BFCL maybe unfair HOT 1
- What format was used for the final fine-tuning of LLaMA2-7B in RAFT? HOT 1
- [bug] Hosted Gorilla: <Issue> HOT 6
- The Urban Dictionary from the RapidAPI is not serving, can't evaluate execution data
- auto fill missed mandatory param is a nightmare HOT 3
- [bug] Hosted Gorilla: <Issue> HOT 2
- [bug] Hosted Gorilla: <Issue> HOT 1
- [bug] Hosted Gorilla: <Issue> HOT 2
- Rapid API error (Yahoo Finance, https://rapidapi.com/sparior/api/yahoo-finance15) is inaccessible HOT 5
- Local CUDA Support for RAFT
- Revamp Landing README HOT 1
- [bug] OpenFunctions-v2: <Issue>
- [bug] OpenFunctions-v2: <HTTP code 502>
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gorilla.