BFCL
The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions.
The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Ranking
| 1 | Llama 3.1 405B Instruct | 88.5 |
| 2 | Llama 3.1 70B Instruct | 84.8 |
| 3 | Llama 3.1 8B Instruct | 76.1 |
| 4 | Qwen3 235B A22B | 70.8 |
| 5 | Qwen3 32B | 70.3 |
| 6 | Qwen3 30B A3B | 69.1 |
| 7 | Nova Pro | 68.4 |
| 8 | Nova Lite | 66.6 |
| 9 | QwQ-32B | 66.4 |
| 10 | Nova Micro | 56.2 |