## Automatic Differentiation

I asked my son Alexey Radul what exactly he is doing for his postdoc at the Hamilton Institute in Ireland. Here is his reply:

The short, jargon-loaded version: We are building an optimizing compiler for a programming language with first-class automatic differentiation, and exploring mathematical foundations, connections, applications, etc.

Interpretation of jargon:

Automatic differentiation is a technique for turning a program that computes a function into a program that computes that function together with its derivative; with a constant factor overhead. This is better than the usual symbolic differentiation that, say, Mathematica does because there is no intermediate-expression bulge. For example, if your function is a large product

Product f1(x) f2(x) … fn(x),

the symbolic derivative has size n2

Sum (Product f1′(x) f2(x) … fn(x)) (Product f1(x) f2′(x) … fn(x)) … (Product f1(x) f2(x) … fn'(x))

automatic differentiation avoids that cost. Automatic (as opposed to symbolic) differentiation also extends to conditionals, data structures, higher-order functions, and all the other wonderful things that distinguish a computer program from a mathematical expression.

First-class means that the differentiation operations are normal citizens of the programming language. This is not the case with commonly used automatic differentiation systems, which are all preprocessors that rewrite C or Fortran source code. In particular, we want to be able to differentiate any function written in the language, even if it is a derivative of something, or contains a derivative of something, etc. The automatic differentiation technique works but becomes more complicated in the presence of higher order, multivariate, or nested derivatives.

We are building an optimizing compiler because the techniques necessary to get good performance and correct results with completely general automatic differentiation are exactly the techniques used to produce aggressive optimizing compilers for functional languages, so we might as well go all the way.

It appears that the AD trick (or at least half of it) is just an implementation of synthetic differential geometry in the computer. This leads one to hope that a good mathematical foundation can be found that dictates the behavior of the system in all the interesting special cases; there is lots of math to be thought about in the vicinity of this stuff.

Applications are also plentiful. Any time you want to optimize anything with respect to real parameters, gradients help. Any time you are dealing with curves, slopes help. Computer graphics, computer vision, physics simulations, economic and financial models, probabilities — there’s so much stuff to apply a high quality such system to that we don’t know where to begin.

Share:

1. #### misha:

It may be a good idea to make sure that your system doesn’t spit out some gibberish upon “differentiating” non-differentiable functions (and these are rather common, i.e., non-conforming finite elements etc.), before the bridges and buildings start collapsing and airplanes start breaking up in flight.
If applied appropriately, I’m all for it, but caution should be exercised.

As for the synthetic differential geometry (as well as the “dual numbers” in AD), it’s just a sleek way to work with the jets of functions and maps.

2. #### Quick Links….a lot of them…all good. | A Blog Around The Clock:

[…] Automatic Differentiation […]

3. #### misha:

When we pass from classical analysis of, say, continuously differentiable functions to doing computer computations, where we deal with fundamentally discrete objects, we lose the “one size fits all” notion of the derivative (and of integral). As Doron Zeilberger puts it,
“Real” Analysis is a Degenerate Case of Discrete Analysis. Also the question of numerical stability become important. We simply should not expect this or that procedure for differentiating functions defined by computer programs to be universally meaningful and applicable.

4. #### misha:

As I said in the last paragraph that I had added a while ago to the introductory section of the Wikipedia article “Automatic differentiation,” “… AD should be used with some caution when applied to a program written to calculate a certain function. This program is usually optimized to calculate this function itself, but gives less than optimal results for its (especially higher) derivatives when treated by AD algorithms.[citation needed] Therefore the claim that AD always gives the derivatives with machine accuracy should be taken with a grain of salt. It is simply unlikely to be always true.”

Sorry for yet another critical remark, people are probably already tired of them. But I am really worried about the dangerous trend of using numerical software without clear (or even any) understanding of how it works, or mathematical principles behind it.

5. #### aristogeit:

he fails to explain what automatic differentiation is
and if he cant explain it he doesnt know it as the einesteinean quote goes

who should we believe?
I’d say neither