,

Abstract

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.

Key Contributions

  1. Transformer Architecture: First model to rely entirely on self-attention
  2. Multi-Head Attention: Allows the model to jointly attend to different representation subspaces
  3. Positional Encoding: Injects information about the relative or absolute position of tokens

Impact

This paper revolutionized NLP and became the foundation for BERT, GPT, and other language models.