Pre-training of Deep Bidirectional Transformers for Language Understanding
Overview
BERT introduces two new pre-training mechanism that provides an excellent base model for fine-tuning on down stream tasks. Importantly, for specific downstream tasks, only a finely-tuned output layer is required - the code bidirectional transformer learned in pre-training is kept. The two pre-training methods introduced are:
- Masked language model: Input tokens are randomly masked and the transformers model is tasked to output the input sequence, learning the masked input tokens.
- Nest sentence prediction: A binary classification task where the model is given an input sentence and must predict if the following sentence given is either a) the following sentence from the corpus, or b) a random sentence from the corpus.
Architecture
Standard bidirectional transformer. Added segment embedding.
Three parts:
- Token Embedding
- Segment Embedding (important because so pretraining can handle the concept of segments in downstream tasks (e.g. Q+A))
- Position Embeddings, in the same fashion as Attention is all you Need