Automatic differentiation — JAX documentation (2024)

In this section, you will learn about fundamental applications of automatic differentiation (autodiff) in JAX. JAX has a pretty general autodiff system.Computing gradients is a critical part of modern machine learning methods, and this tutorial will walk you through a few introductory autodiff topics, such as:

  • 1. Taking gradients with jax.grad

  • 2. Computing gradients in a linear logistic regression

  • 3. Differentiating with respect to nested lists, tuples, and dicts

  • 4. Evaluating a function and its gradient using jax.value_and_grad

  • 5. Checking against numerical differences

Make sure to also check out the Advanced automatic differentiation tutorial for more advanced topics.

While understanding how automatic differentiation works “under the hood” isn’t crucial for using JAX in most contexts, you are encouraged to check out this quite accessible video to get a deeper sense of what’s going on.

1. Taking gradients with jax.grad#

In JAX, you can differentiate a scalar-valued function with the jax.grad() transformation:

import jaximport jax.numpy as jnpfrom jax import gradgrad_tanh = grad(jnp.tanh)print(grad_tanh(2.0))

jax.grad() takes a function and returns a function. If you have a Python function f that evaluates the mathematical function \(f\), then jax.grad(f) is a Python function that evaluates the mathematical function \(\nabla f\). That means grad(f)(x) represents the value \(\nabla f(x)\).

Since jax.grad() operates on functions, you can apply it to its own output to differentiate as many times as you like:


JAX’s autodiff makes it easy to compute higher-order derivatives, because the functions that compute derivatives are themselves differentiable. Thus, higher-order derivatives are as easy as stacking transformations. This can be illustrated in the single-variable case:

The derivative of \(f(x) = x^3 + 2x^2 - 3x + 1\) can be computed as:

f = lambda x: x**3 + 2*x**2 - 3*x + 1dfdx = jax.grad(f)

The higher-order derivatives of \(f\) are:

\[\begin{split}\begin{array}{l}f'(x) = 3x^2 + 4x -3\\f''(x) = 6x + 4\\f'''(x) = 6\\f^{iv}(x) = 0\end{array}\end{split}\]

Computing any of these in JAX is as easy as chaining the jax.grad() function:

d2fdx = jax.grad(dfdx)d3fdx = jax.grad(d2fdx)d4fdx = jax.grad(d3fdx)

Evaluating the above in \(x=1\) would give you:

\[\begin{split}\begin{array}{l}f'(1) = 4\\f''(1) = 10\\f'''(1) = 6\\f^{iv}(1) = 0\end{array}\end{split}\]

Using JAX:


2. Computing gradients in a linear logistic regression#

The next example shows how to compute gradients with jax.grad() in a linear logistic regression model. First, the setup:

key = jax.random.key(0)def sigmoid(x): return 0.5 * (jnp.tanh(x / 2) + 1)# Outputs probability of a label being true.def predict(W, b, inputs): return sigmoid(, W) + b)# Build a toy dataset.inputs = jnp.array([[0.52, 1.12, 0.77], [0.88, -1.08, 0.15], [0.52, 0.06, -1.30], [0.74, -2.49, 1.39]])targets = jnp.array([True, True, False, True])# Training loss is the negative log-likelihood of the training examples.def loss(W, b): preds = predict(W, b, inputs) label_probs = preds * targets + (1 - preds) * (1 - targets) return -jnp.sum(jnp.log(label_probs))# Initialize random model coefficientskey, W_key, b_key = jax.random.split(key, 3)W = jax.random.normal(W_key, (3,))b = jax.random.normal(b_key, ())

Use the jax.grad() function with its argnums argument to differentiate a function with respect to positional arguments.

# Differentiate `loss` with respect to the first positional argument:W_grad = grad(loss, argnums=0)(W, b)print(f'{W_grad=}')# Since argnums=0 is the default, this does the same thing:W_grad = grad(loss)(W, b)print(f'{W_grad=}')# But you can choose different values too, and drop the keyword:b_grad = grad(loss, 1)(W, b)print(f'{b_grad=}')# Including tuple valuesW_grad, b_grad = grad(loss, (0, 1))(W, b)print(f'{W_grad=}')print(f'{b_grad=}')
W_grad=Array([-0.16965583, -0.8774644 , -1.4901346 ], dtype=float32)W_grad=Array([-0.16965583, -0.8774644 , -1.4901346 ], dtype=float32)b_grad=Array(-0.29227245, dtype=float32)W_grad=Array([-0.16965583, -0.8774644 , -1.4901346 ], dtype=float32)b_grad=Array(-0.29227245, dtype=float32)

The jax.grad() API has a direct correspondence to the excellent notation in Spivak’s classic Calculus on Manifolds (1965), also used in Sussman and Wisdom’s Structure and Interpretation of Classical Mechanics (2015) and their Functional Differential Geometry (2013). Both books are open-access. See in particular the “Prologue” section of Functional Differential Geometry for a defense of this notation.

Essentially, when using the argnums argument, if f is a Python function for evaluating the mathematical function \(f\), then the Python expression jax.grad(f, i) evaluates to a Python function for evaluating \(\partial_i f\).

3. Differentiating with respect to nested lists, tuples, and dicts#

Due to JAX’s PyTree abstraction (see Working with pytrees), differentiating withrespect to standard Python containers just works, so use tuples, lists, and dicts (and arbitrary nesting) however you like.

Continuing the previous example:

def loss2(params_dict): preds = predict(params_dict['W'], params_dict['b'], inputs) label_probs = preds * targets + (1 - preds) * (1 - targets) return -jnp.sum(jnp.log(label_probs))print(grad(loss2)({'W': W, 'b': b}))
{'W': Array([-0.16965583, -0.8774644 , -1.4901346 ], dtype=float32), 'b': Array(-0.29227245, dtype=float32)}

You can create Custom pytree nodes to work with not just jax.grad() but other JAX transformations (jax.jit(), jax.vmap(), and so on).

4. Evaluating a function and its gradient using jax.value_and_grad#

Another convenient function is jax.value_and_grad() for efficiently computing both a function’s value as well as its gradient’s value in one pass.

Continuing the previous examples:

loss_value, Wb_grad = jax.value_and_grad(loss, (0, 1))(W, b)print('loss value', loss_value)print('loss value', loss(W, b))
loss value 3.0519385loss value 3.0519385

5. Checking against numerical differences#

A great thing about derivatives is that they’re straightforward to check with finite differences.

Continuing the previous examples:

# Set a step size for finite differences calculationseps = 1e-4# Check b_grad with scalar finite differencesb_grad_numerical = (loss(W, b + eps / 2.) - loss(W, b - eps / 2.)) / epsprint('b_grad_numerical', b_grad_numerical)print('b_grad_autodiff', grad(loss, 1)(W, b))# Check W_grad with finite differences in a random directionkey, subkey = jax.random.split(key)vec = jax.random.normal(subkey, W.shape)unitvec = vec / jnp.sqrt(jnp.vdot(vec, vec))W_grad_numerical = (loss(W + eps / 2. * unitvec, b) - loss(W - eps / 2. * unitvec, b)) / epsprint('W_dirderiv_numerical', W_grad_numerical)print('W_dirderiv_autodiff', jnp.vdot(grad(loss)(W, b), unitvec))
b_grad_numerical -0.29325485b_grad_autodiff -0.29227245W_dirderiv_numerical -0.2002716W_dirderiv_autodiff -0.19909117

JAX provides a simple convenience function that does essentially the same thing, but checks up to any order of differentiation that you like:

from jax.test_util import check_gradscheck_grads(loss, (W, b), order=2) # check up to 2nd order derivatives

Next steps#

The Advanced automatic differentiation tutorial provides more advanced and detailed explanations of how the ideas covered in this document are implemented in the JAX backend. Some features, such as Custom derivative rules for JAX-transformable Python functions, depend on understanding advanced automatic differentiation, so do check out that section in the Advanced automatic differentiation tutorial if you are interested.

Automatic differentiation — JAX  documentation (2024)


How does JAX Autograd work? ›

Google JAX is a machine learning framework for transforming numerical functions to be used in Python. It is described as bringing together a modified version of autograd (automatic obtaining of the gradient function through differentiation of a function) and TensorFlow's XLA (Accelerated Linear Algebra).

Is Autograd the same as autodiff? ›

Autograd is the name of a particular autodiff package. Autodiff is not finite differences. Finite differences are expensive, since you need to do a forward pass for each derivative. It also induces huge numerical error.

What is the difference between JAX Jacrev and Jacfwd? ›

Jacobians and Hessians using jax.

jacfwd() uses forward-mode automatic differentiation, which is more efficient for “tall” Jacobian matrices, while jax. jacrev() uses reverse-mode, which is more efficient for “wide” Jacobian matrices.

What does automatic differentiation mean? ›

In mathematics and computer algebra, automatic differentiation (auto-differentiation, autodiff, or AD), also called algorithmic differentiation, computational differentiation, is a set of techniques to evaluate the partial derivative of a function specified by a computer program.

How does the Autograd work? ›

Autograd is a reverse automatic differentiation system. Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors.

Is JAX faster than TensorFlow? ›

Speed: JAX is significantly faster than TensorFlow for many tasks. Ease of use: JAX is easier to use than TensorFlow for some tasks, especially for those who are familiar with functional programming. Flexibility: JAX is more flexible than TensorFlow, allowing you to do things that are not possible with TensorFlow.

What are the benefits of automatic differentiation? ›

Automatic differentiation is a powerful tool to automate the calculation of derivatives and is preferable to more traditional methods, especially when differentiating complex algorithms and mathematical functions. The implementation of automatic differenti- ation however requires some care to insure efficiency.

What is the aad methodology? ›

Adjoint algorithmic differentiation is a mathematical technique used to significantly speed up the calculation of sensitivities of derivatives prices to underlying factors, called Greeks. It is widely used in the risk management of complex derivatives and valuation adjustments.

Who invented autograd? ›

Autograd was written by Dougal Maclaurin, David Duvenaud, Matt Johnson, Jamie Townsend and many other contributors. The package is currently still being maintained, but is no longer actively developed.

What is the difference between VMAP and PMAP in Jax? ›

Semantically it is comparable to vmap() because both transformations map a function over array axes, but where vmap() vectorizes functions by pushing the mapped axis down into primitive operations, pmap() instead replicates the function and executes each replica on its own XLA device in parallel.

What is JVP in Jax? ›

Jacobian-Vector products (JVPs, aka forward-mode autodiff) JAX includes efficient and general implementations of both forward- and reverse-mode automatic differentiation.

What is jax jit? ›

In this section, we will further explore how JAX works, and how we can make it performant. We will discuss the jax. jit() transformation, which will perform Just In Time (JIT) compilation of a JAX Python function so it can be executed efficiently in XLA.

How accurate is automatic differentiation? ›

Automatic differentiation is a set of techniques for evaluating derivatives (gradients) numerically. The method uses symbolic rules for differentiation, which are more accurate than finite difference approximations.

What are the three types of differentiation? ›

“As our flyer stated,” began the presenter, “there are three basic types of differentiation: content, process, and product. We'll explore each.

What is automatic differentiation in scientific programming with Jax? ›

In scientific programming we use derivatives extensively. Automatic differentiation is a convenient tool for writing programs with derivatives where it is not necessary to approximate them, or derive and implement them.

Does JAX automatically use GPU? ›

JAX runs transparently on the GPU or TPU (falling back to CPU if you don't have one). However, in the above example, JAX is dispatching kernels to the chip one operation at a time. If we have a sequence of operations, we can use the jax.jit() function to compile this sequence of operations together using XLA.

How does JAX calculate gradients? ›

The gradient is computed using second order accurate central differences in the interior points and either first or second order accurate one-sides (forward or backwards) differences at the boundaries. The returned gradient hence has the same shape as the input array.

What is the purpose for the torch Autograd engine? ›

PyTorch Autograd is a package for automatic differentiation in PyTorch. It allows you to compute the gradients of any PyTorch computation with respect to any PyTorch tensor. Autograd works by tracking all of the operations performed on PyTorch tensors.

How is JAX better than PyTorch? ›

JAX operates at the NumPy level, offering substantial performance enhancements compared to traditional frameworks like TensorFlow and PyTorch. One of its key strengths lies in the efficient implementation of the Hessian , a mathematical tool crucial for optimization tasks.

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 6579

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.