HumanEval
Functional correctness on 164 hand-written Python programming problems.
68Models
95Top score
85.4Median
State of the art over time
Each point is a model at its release date; the line traces the best score to date.
Functional correctness on 164 hand-written Python programming problems.
Each point is a model at its release date; the line traces the best score to date.