The simple Essence of Automatic Differentiation

Disclaimer: I assume you know math up to derivatives and are somewhat comfortable with programming.

Intuitively, Automatic differentiation (AD) is a black box that given a function gives you a derivative of it. We can use it to automagically perform things like backpropagation, given that the function you differentiate computes a feed-forward pass of the neural network.

But what is a derivative?

Linear Approximations

The standard definition of a derivative is the following:

$$ \frac{\delta f}{\delta x} = f^\prime(x) = \lim_{\varepsilon\to0}\frac{f(x+\varepsilon)-f(x)}{\varepsilon} $$

The idea here is to perform a linear approximation of the function $f$ at $x$.

A linear function satisfies:

$$ f(\alpha x + \beta) = \alpha f(x) + \beta $$

Now think back, there is the slope triangle. We get the tangent at $x$ by offsetting it with a very small $\varepsilon$ value, which corresponds to the hypothenuse of the slope triangle.

Rewrite above equation to

$$ 0 = \lim_{\varepsilon\to 0}\frac{f(x+\varepsilon)-f(x)}{\varepsilon}-f^\prime(x) $$

$$ 0 = \lim_{\varepsilon\to 0}\frac{f(x+\varepsilon)-f(x)-\varepsilon f^\prime(x)}{\varepsilon} $$

This definition works fine for $\mathbb{R}\to\mathbb{R}$ and $\mathbb{R}\to\mathbb{R}^n$. However, for functions of the form $\mathbb{R}^n\to\mathbb{R}^m$ we can’t apply it, since dividing by $\varepsilon\in\mathbb{R}^n$ is nonsensical. This issue is easily mitigated, however, by taking the length of the vectors:

$$ 0 = \lim_{\varepsilon\to \vec{0}}\frac{||f(x+\varepsilon)-(f(x)+\varepsilon f^\prime(x))||}{||\varepsilon||} $$

Thinking of a function $\mathcal{D}$ that, given a function $f$ and a location $x$, gives us the derivative $f^\prime$ gives us the following type signature:

$$ (A \to B) \to (A \to (A \multimap B)) $$

Hereby, $A\multimap B$ is the derivative, i.e. a linear function. There is a correspondence between linear maps $A\multimap B$ and matrices: Every linear map can be represented as a matrix.

Linear Maps as Matrices

Consider $f : A\multimap B$ for finite-dimensional vector spaces $A,B$ and suitable defined bases for both. We use $e^A_i$ with $0\le i\le |A|=n$ as basis vectors of $A$, similarily $e^B_j$ as basis for $B$ with $0\le j \le |B|=m$. Now let there be some arbitrary coefficients $w_i$. Observe:

$$ f(w_1 e^A_1 + \cdots + w_n e^A_n) = w_1 f(e^A_1) + \cdots + w_n f(e^A_n) $$

Hence, the result value of $f$ depends only on the individual values of $f$ with respect to the basis vectors $e^A_i$. Since $f$ maps a value from $A$ to a value from $B$ and values/vectors of $B$ can be determined by the basis vectors $e^B_j$, we can identify each $f(e^A_i)$ as follows:

$$ f(e^A_i) = a_{1,i}\cdot e^B_1 + \cdots + a_{m,i}\cdot e^B_m $$

For suitable values $a_{j,i}$. Notice we have exactly $|B|\times|A|$ values of $a_{j,i}$. Tiling these into a matrix yields one of the same dimension. So, a linear map $A\multimap B$ can be identified by a matrix of shape $B\times A$. We can redefine $f : A\multimap B$ using the matrix $M$ from above procedure as $f(\vec{x})=M\cdot\vec{x}$.

Composing

Whenever constructing a new functionally-flavored operation, it is useful to ensure certain properties. We’ll look at those now in the case of $\mathcal{D}$.

Chaining

Suppose you have two functions $f : A \to B$ and $g : B \to C$. Chaining them means to first put something into $f$ and plug the output of that into $g$:

$$ (g\circ f)(a) = g(f(a)) $$

We know that the chain rule is $(u(v))^\prime = u^\prime(v) * v^\prime$, albeit a bit of notational abuse. Hence, for $\mathcal{D}$, we define:

$$ \mathcal{D}\ (g\circ f)\ a = \mathcal{D}\ (g\circ f, a) = (\mathcal{D}\ (g, f(a))) \circ (\mathcal{D}(f,a)) = \mathcal{D}\ g\ (f\ a) \circ \mathcal{D}\ f\ a $$

(in this setting, $f\ a\ b$ and $f(a,b)$ are the same)

As seen here, we don’t achieve compositionality just by $\mathcal{D}\ g$ and $\mathcal{D}\ f$. We also need $f$ itself for $\mathcal{D}\ g\ (f\ a)$!

An efficient way of achieving compositionality here might be to change the behavior of $\mathcal{D}$ to not only return the linear map, but also the value of the function at position $x$:

$$ \mathcal{D}^+\ f\ a = (f\ a, \mathcal{D}\ f\ a) $$

From this, we can arrive at a concrete ML-like implementation:

$$ \mathcal{D}^+\ (g\circ f)\ a $$

$$ = ((g\circ f)\ a, \mathcal{D}\ (g\circ f)\ a) $$

$$ = (g\ (f\ a), \mathcal{D}\ g\ (f\ a)\circ\mathcal{D}\ f\ a) $$

$$ = \mathtt{let}\ h = f\ a\ \mathtt{in}\ (g\ h, \mathcal{D}\ g\ h\circ\mathcal{D}\ f\ a) $$

$$ = \mathtt{let}\ (h, f^\prime) = \mathcal{D}^+\ f\ a\ \mathtt{in}\ (g\ h, \mathcal{D}\ g\ h\circ f^\prime) $$

$$ = \mathtt{let}\ (h, f^\prime) = \mathcal{D}^+\ f\ a;\ (k, g^\prime) = \mathcal{D}^+\ g\ h\ \mathtt{in}\ (k, g^\prime\circ f^\prime) $$

Notice that no input value is used twice, so combining two linear maps yields again a linear map.

Parallel

Similarily, suppose you have two functions $f:A\to C$ and $g:B\to D$ and compose them together in a parallel fashion:

$$ (f\times g)(a,b) = (f\ a, g\ b) $$

The derivative now shall happen in parallel as well:

$$ \mathcal{D}\ (f\times g)\ (a,b) = (\mathcal{D}\ f\ a)\times(\mathcal{D}\ g\ b) $$

And similarily as before, we can extract an implementation for it:

$$ \mathcal{D}^+\ (f\times g)\ (a,b) $$

$$ = ((f\times g)(a,b), (\mathcal{D}\ (f\times g)\ (a,b))) $$

$$ = ((f\ a, g\ b), ((\mathcal{D}\ f\ a)\times (\mathcal{D}\ g\ b))) $$

$$ = \mathtt{let}\ (c,f^\prime) = (f\ a,\mathcal{D}\ f\ a); (d,g^\prime) = (g\ b,\mathcal{D}\ g\ b)\ \mathtt{in}\ ((c,d), f^\prime\times g^\prime) $$

$$ = \mathtt{let}\ (c,f^\prime) = \mathcal{D}^+\ f\ a; (d,g^\prime) = \mathcal{D}^+\ g\ b\ \mathtt{in}\ ((c,d), f^\prime\times g^\prime) $$

Derivatives of Linear Functions

Let $f : A\to B$ be linear. We’ve established the fact that derivatives are local linear approximations of $f$. An interesting but more or less obvious fact following from this is that linear functions are their own derivatives.

Consider $f(x) = x$. You almost surely know that the derivative is $f^\prime(x) = 1$. So, obviously, this is not the same as $f(x) = x$. However, recap from our the definition we multiply some infinitesimally small $\varepsilon$ with it:

$$ 0 = \lim_{\varepsilon\to \vec{0}}\frac{||f(x+\varepsilon)-(f(x)+\colorbox{yellow}{$\varepsilon f^\prime(x)$})||}{||\varepsilon||} $$

From “Linear Maps as Matrices” we know that we can write any linear function $f$ of some arbitrary form equivalently as $g(x)=M\cdot x$ for some suitable matrix $M$.

Now, reconsider $(f^\prime(x)) \cdot \varepsilon$, the intuition for a (higher dimensional) derivative is that it gives us the direction in which the function goes in a “morally correct” way. What is more, most times the derivative in such a setting is referred to as Jacobian matrix.

Going back to $f(x)=x$ and $f^\prime(x)=1$ we see that we do a $\varepsilon \cdot 1$ in the definition above. This is exactly the linear transformation $f(x)=x$ does to any input value!

Now take note that our derivative is basically $\mathcal{D}^+\ f\ a = (f\ a, \mathcal{D}\ f\ a) = (f\ a, \lambda x.(f^\prime\ a) \cdot x)$. So, for linear functions we have that $f = \mathcal{D}\ f\ a$.

Sources