Evaluation
Evaluating Large Language Models Trained on Code
OpenAI·July 7, 2021
Mark Chen, Jerry Tworek, Heewoo Jun
View on arXivTL;DR
Introduces Codex (behind Copilot) and the HumanEval benchmark for code generation.
Why it matters
Kicked off the AI coding-assistant era and gave the field its standard code benchmark.