The Evolution of GPT

GPT (1): Improving Language Understanding by Generative Pre-Training

Simple transformer architecture.
Pre-trained to perform next word prediction.
For downstream tasks, define input sequence pattern, fine-tune with additional, final linear layer.

Standard language modelling & use of transformers (like from Attention is All you Need).

A general system should be able to handle many tasks. Condition output on input and task. I.e. instead of $p(output

input)$ (perhaps per-task), consider $p(output

input, task)$.

“While typically task-agnostic in architecture, this method _[Generative pre-training] still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine- tuning approaches.”_
In essence, large enough LLMs can perform on zero/few-shot tasks at SOTA

Introduces new RL technique: Reinforcement Learning from Human Feedback (RLHF).
Results show human preferred outputs from 1.3B parameters InstructGPT over 175B GPT-3.