Understanding RoPE: From Rotary Embeddings to Context Extension

<aside> 💡 TL;DR: RoPE encodes position via 2D rotations at geometrically spaced frequencies. The RoPE base sets a hard lower bound on effective context length — beyond it, the model literally prefers random tokens over similar ones. NTK-aware scaling extends context by changing the base (concentrating scaling on low-frequency dimensions); YaRN refines this with explicit frequency partitioning and attention scaling. This post derives everything from scratch, including a clean proof of why RoPE's attention decays with distance.

</aside>

1. Background: How RoPE Works

1.1 What RoPE Tries to Achieve

Attention needs to know where tokens are, not just what they are. Absolute position embeddings (adding a learned vector at each position) work, but they bake position into the representation itself. RoPE takes a different approach: encode position directly into the attention computation, so that the dot product between query and key naturally depends on their relative distance.

The design goals:

Attention score between positions $m$ and $n$ should depend only on the relative distance $m - n$, not the absolute positions
Nearby tokens should naturally receive more attention than distant ones (long-term decay)
The encoding should work for any sequence length, not just lengths seen during training

1.2 The Mathematical Formulation

RoPE operates on each attention head independently. For a head with dimension $d$, the query and key vectors are split into $d/2$ pairs of 2D subspaces. In each subspace $l$, RoPE applies a 2×2 rotation matrix:

$$ R(m\theta_l) = \begin{pmatrix} \cos(m\theta_l) & -\sin(m\theta_l) \\ \sin(m\theta_l) & \cos(m\theta_l) \end{pmatrix} $$

where $m$ is the token's position and the frequency for subspace $l$ is:

$$ \theta_l = \text{base}^{-2l/d}, \quad l = 0, 1, \ldots, d/2 - 1 $$

with $\text{base} = 10{,}000$ by default. In RoPE, both the query at position $m$ and key at position $n$ are rotated by their respective angles:

$$ q_{\text{rot}} = R(m\theta_l) \cdot q_l, \quad k_{\text{rot}} = R(n\theta_l) \cdot k_l $$

where $q_l = (q_{2l},\, q_{2l+1})$ and $k_l = (k_{2l},\, k_{2l+1})$ are the 2D components in subspace $l$. The attention score is their dot product:

$$ \text{score}l = q{\text{rot}}^T \, k_{\text{rot}} = q_l^T \, R(m\theta_l)^T \, R(n\theta_l) \, k_l $$

Rotation matrices have a key property: $R(\alpha)^T = R(-\alpha)$, and composing two rotations adds their angles. So the two rotations collapse into a single relative rotation, which we call $R_{\Delta,l}$:

$$ R(m\theta_l)^T \, R(n\theta_l) = R(-m\theta_l)\,R(n\theta_l) = R_{\Delta,l}, \quad \Delta = m - n $$