ArchitectureTraining
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Google·October 11, 2018
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
View on arXivTL;DR
BERT pretrains a bidirectional Transformer with masked-language modeling, then fine-tunes it to set new state of the art across many NLP tasks.
Why it matters
The paper that made pretrain-then-fine-tune the default recipe for NLP and showed how much could be learned from unlabeled text.