AI Hub
All papers
Evaluation

Evaluating Large Language Models Trained on Code

OpenAI·July 7, 2021

Mark Chen, Jerry Tworek, Heewoo Jun

View on arXiv

TL;DR

Introduces Codex (behind Copilot) and the HumanEval benchmark for code generation.

Why it matters

Kicked off the AI coding-assistant era and gave the field its standard code benchmark.

Related models

Related terms