Abstract
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
Key Contributions
- Transformer Architecture: First model to rely entirely on self-attention
- Multi-Head Attention: Allows the model to jointly attend to different representation subspaces
- Positional Encoding: Injects information about the relative or absolute position of tokens
Impact
This paper revolutionized NLP and became the foundation for BERT, GPT, and other language models.