Automatic Differentiation
I asked my son Alexey Radul what exactly he is doing for his postdoc at the Hamilton Institute in Ireland. Here is his reply:
The short, jargon-loaded version: We are building an optimizing compiler for a programming language with first-class automatic differentiation, and exploring mathematical foundations, connections, applications, etc.
Interpretation of jargon:
Automatic differentiation is a technique for turning a program that computes a function into a program that computes that function together with its derivative; with a constant factor overhead. This is better than the usual symbolic differentiation that, say, Mathematica does because there is no intermediate-expression bulge. For example, if your function is a large product
Product f1(x) f2(x) … fn(x),
the symbolic derivative has size n2
Sum (Product f1′(x) f2(x) … fn(x)) (Product f1(x) f2′(x) … fn(x)) … (Product f1(x) f2(x) … fn'(x))
automatic differentiation avoids that cost. Automatic (as opposed to symbolic) differentiation also extends to conditionals, data structures, higher-order functions, and all the other wonderful things that distinguish a computer program from a mathematical expression.
First-class means that the differentiation operations are normal citizens of the programming language. This is not the case with commonly used automatic differentiation systems, which are all preprocessors that rewrite C or Fortran source code. In particular, we want to be able to differentiate any function written in the language, even if it is a derivative of something, or contains a derivative of something, etc. The automatic differentiation technique works but becomes more complicated in the presence of higher order, multivariate, or nested derivatives.
We are building an optimizing compiler because the techniques necessary to get good performance and correct results with completely general automatic differentiation are exactly the techniques used to produce aggressive optimizing compilers for functional languages, so we might as well go all the way.
It appears that the AD trick (or at least half of it) is just an implementation of synthetic differential geometry in the computer. This leads one to hope that a good mathematical foundation can be found that dictates the behavior of the system in all the interesting special cases; there is lots of math to be thought about in the vicinity of this stuff.
Applications are also plentiful. Any time you want to optimize anything with respect to real parameters, gradients help. Any time you are dealing with curves, slopes help. Computer graphics, computer vision, physics simulations, economic and financial models, probabilities — there’s so much stuff to apply a high quality such system to that we don’t know where to begin.
Share: