Paper Dissection: Understanding the Intrinsic Dimensionality of LLMs
Building mathematical intuition for why fine-tuning methods like LoRA work so well.
This article is an explainer for the paper Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.
It details some of the core ideas which inspired Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) for Language Models.
Note: This article is adapted from my Obsidian notes, and Substack doesn’t support inline LaTeX. Therefore some parts of the article include unrendered-inline-LaTeX in hopes that one day Substack will render them properly. And I don’t know how else to type it out here.
A well-formatted PDF version of this article is available at this link.
Introduction
Why can a model with hundreds of millions of parameters be steered toward a new NLP task by updating only a handful of parameters? This paper argues that fine-tuning rarely explores the full parameter space and instead, good solutions live in a low-dimensional subspace. This article digs into the core ideas behind that claim. By the end, you’ll have an intuitive understanding of why LoRA-style adapters work, hopefully backed by some mathematical motivation.
Fine-Tuning Magic
Pre-trained language models like Gemma 2/3, GPT-2/3/4/4o, etc. are insanely large, consisting of hundreds of billions, of parameters (D).
Surprisingly, we can take these large models, fine-tune them on a tiny new dataset for a specific task (e.g. sentiment analysis with a few thousand labeled examples), and they learn incredibly well without catastrophic forgetting/overfitting. This is quite surprising, given that we have orders of magnitude more parameters than data points!
Intrinsic Dimensionality
The paper argues that the "solution" to the fine-tuning task doesn't require wiggling all D billion parameters independently. Instead, the necessary changes to adapt the language model to a new task live in a much, much smaller "intrinsic" dimensional subspace (d).
One way to think about it is as follows: the pre-training process has already built a a very sophisticated latent representation of the world which the model was exposed to. To "learn" how to perform a new task, you don't need to rebuild or modify every single bit of foundational knowledge which you've learned. In more practical terms, you don't need to wiggle every single one of the D dimensions. All that you really need is to modify a few "high-level" ideas.
Let's consider cooking as an analogy (I swear I haven't used Gemini to come up with this). If you are a moderately experienced home-cook (who has watched their mom cook rajma-chawal or biryani ever since they were a little kid) you have built up foundational knowledge about how food is prepared. Let's say you want to prepare Spaghetti Bolognese, something which you haven't necessarily seen being cooked at home before.
So you look up a recipe online, you read through the ingredients and instructions:
Making the sauce: sauté garlic and onion, prepare the protein, add the spices.
Preparing the pasta: you already know how to take something out of a packet and boil it, don't you?
You don’t remodel the kitchen or relearn how to boil water; you keep the stove, pans, and knife skills exactly as they are and just introduce a couple of new ingredients and a little more knowledge. These few high-level adjustments are enough to produce a brand new meal, just as nudging a few hundred carefully-crafted parameters is enough to adapt a few hundred million model to a fresh task while the bulk of the weights remain unchanged.
Similarly, when you're fine-tuning a model, the effective number of parameters you're truly "learning" is this tiny d, and not the full D.
Subspace Fine-Tuning
Instead of fine-tuning all D parameters of the pre-trained model
the paper proposes to find an optimal set of parameters
(for the new task) by optimizing only a small set of d parameters,
The relationship is defined as follows (equation (1)):
Here,
- \(\Theta_0\)
These are the initial pre-trained weights (e.g., the weights of GPT-2 loaded from a checkpoint). These are fixed during this specific intrinsic dimensionality experiment.
- \(\theta^d\)
These are the low-dimensional parameters we actually optimize. There are only d of them (e.g. d = 200).
- \(P:\mathbb{R}^d\to\mathbb{R}^D\)
This is a projection function. It takes the d-dimensional vector $\theta^d$ and maps to to a D-dimensional update vector in the full parameter space. This projection P is randomly generated and then fixed. You don't learn P.
Not learning P is a deliberate choice, which underlines one of the core ideas of this paper. The paper is an exercise in trying to determine if the learned manifold allows for good solutions within a randomly chosen low-dimensional subspace. If the random projection P (fixed after random initialization) is sufficient to achieve good performance, by only learning the d-dimensional parameters $\theta^d$, it strongly suggests the task's solution truly lies in a low-dimensional ("intrinsic") manifold within the larger parameter space of the pre-trained model.
The entire point of the paper is to show that a model with D parameters (hundreds of million/billion) can be effectively fine-tuned by only optimizing d parameters, where d << D (e.g., a few hundred).
If P (mapping from $\mathbb{R}^d\to\mathbb{R}^D$) were learnable, it would introduce a vast number of new parameters (e.g., D x d elements for a linear projection). This would defeat the purpose of demonstrating that only d parameters are essential for the update.
The intuition here is that we're searching for a solution not in the full D-dimensional space, but in a d-dimensional affine subspace.
Making the Projection P efficient: The Fastfood Transform
Explicitly creating a storing a dense random matrix P (of shape D x d) is computationally infeasible for sufficiently large D.
For example, if D=500M and d=1000, P would be about 500M x 1000 x 4 bytes (for float32 weights) = 2 Terabytes!
So the authors used a structured random projection, called the Fastfood transform, for $P(\theta^d)$ (equation (2)).
Where $\theta^dM$ is the transformation which maps our low-dimensional learnings in the subspace $\mathbb{R}^d$ to the original high-dimensional subspace $\mathbb{R}^D$.
Here, M is defined as,
H: Hadamard matrix (which enables fast multiplication using Walsh-Hadamard Transform), which is basically a square matrix whose entires are either +1 or -1, and whose rows are mutually orthogonal.
G: Random diagonal matrix with independent gaussian entries.
Π: Random permutation matrix, which basically shuffles the elements of a vector or the rows/columns of another matrix when you multiply by it. It's made up almost entirely of zeros, and has exactly one '1' in each row and each column, all other entries are zero.
B: Random diagonal matrix (equal probability ±1 entries)
This factorization of M is what’s called Fastfood Transform.
Going back to equation (2), $\theta^dM$ represents the mapping from our low-dimensional (d-dimensions) intrinsic subspace to the original high-dimensional (D-dimensions) subspace.
- \(\theta^d\)
represents are low-dimensional learnings from the fine-tuning data.
M represents a fixed transformation machinery, which translates are d-dimensional tweaks into D-dimensional adjustments which can be applied to the original high-dimensional weights.
The transformation ($\theta^dM$) is a set of D specific adjustments, one for each of the D parameters of the layer/model.
The fun part here is that the unique Fastfood-factorization of M allows us to very quickly calculate the final D adjustments (the result of $\theta^dM$). In fact, you don't even need to store the actual matrix M (potentially of size D x d, let alone perform the actual matrix multiplication!
All we really need to store are the factors which make up M (H (Hadamard, which has a computational rule rather than explicit storage), G (a diagonal matrix, only D values), Π (permutation matrix, can be stored as D indices), and B (another diagonal matrix, D values))
Therefore, we simply apply these transformations B, H, Π, G, H sequentially to our appropriately shaped $\theta^d$.
Therefore, the critical point here is, we just need to learn $\theta^d$.
Measuring The Intrinsic Dimension
The authors define the "satisfactory solution" for a task as achieving 90% of the performance you'd get by fine-tuning the full model ($Perf_{full}$). They then search for the smallest d (by trying various values of d) that allows them to reach $0.9\times$ $Perf_{full}$. This smallest d is called $d_{90}$.
Key Findings
For RoBERTa-Large on MRPC, $d_{90}$ is around 200-300! This means optimizing just around 200 parameters (projected back to $355M$) gets you 90% of the way!
As they pre-train RoBERTa-Base model from scratch, they periodically measure $d_{90}$ on downstream tasks. They find that $d_{90}$ decreases as pre-training progresses. This suggests that pre-training adjusts the learned manifold to make tasks representable in lower dimensions.
The most counter-intuitive finding is probably that for a constant number of weight updates, larger models tend to have smaller $d_{90}$ for the same downstream tasks! This hints to the fact that more capacity in a language model allows it to create an even more efficient subspace for tasks.
Closing thoughts
With the advent of LoRA based fine-tuning a lot of these results are know common knowledge. However, looking at the seminal ideas which led to the development of techniques like LoRA allows us to build much stronger intuition on why such techniques actually work.
It's interesting to learn about the computational tricks such as the Fastfood transform which allow these methods to scale to the billion-parameter-scale. After all, clever ideas mean little if they can’t be efficiently implemented.