Wednesday, February 28, 2024
Google search engine
HomeUncategorizedTransformers as Support Vector Machines

Transformers as Support Vector Machines

[Submitted on 31 Aug 2023]

Download PDF

Abstract: Since its inception in “Attention Is All You Need”, transformer architecture
has led to revolutionary advancements in NLP. The attention layer within the
transformer admits a sequence of input tokens $X$ and makes them interact
through pairwise similarities computed as softmax$(XQK^top X^top)$, where
$(K,Q)$ are the trainable key-query parameters. In this work, we establish a
formal equivalence between the optimization geometry of self-attention and a
hard-margin SVM problem that separates optimal input tokens from non-optimal
tokens using linear constraints on the outer-products of token pairs. This
formalism allows us to characterize the implicit bias of 1-layer transformers
optimized with gradient descent: (1) Optimizing the attention layer with
vanishing regularization, parameterized by $(K,Q)$, converges in direction to
an SVM solution minimizing the nuclear norm of the combined parameter
$W=KQ^top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm
objective. We characterize this convergence, highlighting that it can occur
toward locally-optimal directions rather than global ones. (2) Complementing
this, we prove the local/global directional convergence of gradient descent
under suitable geometric conditions. Importantly, we show that
over-parameterization catalyzes global convergence by ensuring the feasibility
of the SVM problem and by guaranteeing a benign optimization landscape devoid
of stationary points. (3) While our theory applies primarily to linear
prediction heads, we propose a more general SVM equivalence that predicts the
implicit bias with nonlinear heads. Our findings are applicable to arbitrary
datasets and their validity is verified via experiments. We also introduce
several open problems and research directions. We believe these findings
inspire the interpretation of transformers as a hierarchy of SVMs that
separates and selects optimal tokens.

Submission history

From: Yingcong Li [view email]


Thu, 31 Aug 2023 17:57:50 UTC (1,671 KB)

Read More



Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments