Pre-training of Deep Bidirectional Transformers for Language Understanding

Overview

BERT introduces two new pre-training mechanism that provides an excellent base model for fine-tuning on down stream tasks. Importantly, for specific downstream tasks, only a finely-tuned output layer is required - the code bidirectional transformer learned in pre-training is kept. The two pre-training methods introduced are:

Masked language model: Input tokens are randomly masked and the transformers model is tasked to output the input sequence, learning the masked input tokens.
Nest sentence prediction: A binary classification task where the model is given an input sentence and must predict if the following sentence given is either a) the following sentence from the corpus, or b) a random sentence from the corpus.

Architecture

Standard bidirectional transformer. Added segment embedding.

Three parts:

Token Embedding
Segment Embedding (important because so pretraining can handle the concept of segments in downstream tasks (e.g. Q+A))
Position Embeddings, in the same fashion as Attention is all you Need