Transformers as Support Vector Machines

[Submitted on 31 Aug 2023]

Abstract: Since its inception in “Attention Is All You Need”, transformer architecture
has led to revolutionary advancements in NLP. The attention layer within the
transformer admits a sequence of input tokens $X$ and makes them interact
through pairwise similarities computed as softmax$(XQK^top X^top)$, where
$(K,Q)$ are the trainable key-query parameters. In this work, we establish a
formal equivalence between the optimization geometry of self-attention and a
hard-margin SVM problem that separates optimal input tokens from non-optimal
tokens using linear constraints on the outer-products of token pairs. This
formalism allows us to characterize the implicit bias of 1-layer transformers
optimized with gradient descent: (1) Optimizing the attention layer with
vanishing regularization, parameterized by $(K,Q)$, converges in direction to
an SVM solution minimizing the nuclear norm of the combined parameter
$W=KQ^top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm
objective. We characterize this convergence, highlighting that it can occur
toward locally-optimal directions rather than global ones. (2) Complementing
this, we prove the local/global directional convergence of gradient descent
under suitable geometric conditions. Importantly, we show that
over-parameterization catalyzes global convergence by ensuring the feasibility
of the SVM problem and by guaranteeing a benign optimization landscape devoid
of stationary points. (3) While our theory applies primarily to linear
prediction heads, we propose a more general SVM equivalence that predicts the
implicit bias with nonlinear heads. Our findings are applicable to arbitrary
datasets and their validity is verified via experiments. We also introduce
several open problems and research directions. We believe these findings
inspire the interpretation of transformers as a hierarchy of SVMs that
separates and selects optimal tokens.

Submission history

From: Yingcong Li [view email]

[v1]

Thu, 31 Aug 2023 17:57:50 UTC (1,671 KB)

Transformers as Support Vector Machines

Submission history

Bayesian Statistics: The three cultures

Reverse-engineering my speakers’ API to get reasonable volume control

Zen 5’s 2-ahead branch predictor: how a 30 year old idea allows for new tricks

LEAVE A REPLY Cancel reply

Most Popular

Facebook doesn’t think hackers accessed third-party sites

It’s getting a lot harder for global brands to win in China

Why it’s time for investors to go on the defense

Facebook doesn’t think hackers accessed third-party sites

Recent Comments

EDITOR PICKS

Top Fashion Trends to Look for in Every Important Collection

Spring Fashion Show at the University of Michigan Has Started

Laptop with 128-bit Processor, 32GB of RAM and 24MP Front Camera

POPULAR POSTS

Reflecting on 18 Years at Google

Gboard Hat Version

Feathered robotic wing paves way for flapping drones

POPULAR CATEGORY