Evaluation

Evaluating Large Language Models Trained on Code

OpenAI·July 7, 2021

Mark Chen, Jerry Tworek, Heewoo Jun

TL;DR

Introduces Codex (behind Copilot) and the HumanEval benchmark for code generation.

Kicked off the AI coding-assistant era and gave the field its standard code benchmark.