Agents

BFCL

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions.

Source

10Models

88.5Top score

69.1Median

The Berkeley Function Calling Leaderboard (BFCL) is the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' ability to invoke functions. It evaluates serial and parallel function calls across multiple programming languages (Python, Java, JavaScript, REST API) using a novel Abstract Syntax Tree (AST) evaluation method. The benchmark consists of over 2,000 question-function-answer pairs covering diverse application domains and complex use cases including multiple function calls, parallel function calls, and multi-turn interactions.

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

Ranking

1	Llama 3.1 405B InstructMeta	88.5
2	Llama 3.1 70B InstructMeta	84.8
3	Llama 3.1 8B InstructMeta	76.1
4	Qwen3 235B A22BAlibaba	70.8
5	Qwen3 32BAlibaba	70.3
6	Qwen3 30B A3BAlibaba	69.1
7	Nova ProAmazon	68.4
8	Nova LiteAmazon	66.6
9	QwQ-32BAlibaba	66.4
10	Nova MicroAmazon	56.2

Related Agents benchmarks

TAU-bench Retail22 TAU-bench Airline20 τ²-bench Airline10 τ²-bench Retail10 BrowseComp10