AI Hub
All benchmarks
Agents

BFCL

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions.

10Models
88.5Top score
69.1Median

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

1008875635020242025Llama 3.1 70B Instruct: 84.8 (2024-07-23)Llama 3.1 8B Instruct: 76.1 (2024-07-23)Nova Pro: 68.4 (2024-11-20)Nova Lite: 66.6 (2024-11-20)Nova Micro: 56.2 (2024-11-20)QwQ-32B: 66.4 (2025-03-05)Qwen3 235B A22B: 70.8 (2025-04-28)Qwen3 32B: 70.3 (2025-04-28)Qwen3 30B A3B: 69.1 (2025-04-28)Llama 3.1 405B Instruct: 88.5 (2024-07-23)Llama 3.1 405B Instruct

Ranking

1Llama 3.1 405B Instruct
88.5
2Llama 3.1 70B Instruct
84.8
3Llama 3.1 8B Instruct
76.1
4Qwen3 235B A22B
70.8
5Qwen3 32B
70.3
6Qwen3 30B A3B
69.1
7Nova Pro
68.4
8Nova Lite
66.6
9QwQ-32B
66.4
10Nova Micro
56.2

Related Agents benchmarks