Matrix Calculus for Machine Learning and Beyond

This is the course page for an 18.063 Matrix Calculus at MIT taught in January 2026 (IAP) by Professors Alan Edelman and Steven G. Johnson.

For past versions of this course, see Matrix Calculus in IAP 2023 (OCW) on OpenCourseWare (also on github, with videos on YouTube). See also Matrix Calculus in IAP 2022 (OCW) (also on github), and Matrix Calculus 2024 (github) and Matrix Calculus 2025 (github); some previous years used the temporary 18.S096 "special subject" course number.

Lectures: MWF time 11am–1pm, Jan 12–Jan 30 (except Jan 19) in room 35-310. 3 units, 2 problem sets due Jan 23 and Jan 30 — submitted electronically via Canvas, no exams.

Course Notes: 18.063 COURSE NOTES. Other materials to be posted below.

Piazza forum: Online discussions at Piazza.

Description:

We all know that calculus courses such as 18.01 and 18.02 are univariate and vector calculus, respectively. Modern applications such as machine learning and large-scale optimization require the next big step, "matrix calculus" and calculus on arbitrary vector spaces.

This class revisits and generalizes calculus from the perspective of linear algebra, extending it to much more general things (e.g. the derivative of matrix functions, like a matrix inverse or determinant with respect to the matrix, or an integral with respect to a function, an ODE solution with respect to ODE parameters) and connecting it to the computer science of efficient algorithms for differentiation and automatic differentiation (AD).

We present a coherent approach to matrix calculus emphasizing matrices as holistic objects (not just as an array of scalars), we generalize and compute derivatives of important matrix factorizations and many other complicated-looking operations, and understand how differentiation formulas must be re-imagined in large-scale computing. We will discuss reverse/adjoint/backpropagation differentiation, custom vector-Jacobian products, and how modern AD is more computer science than calculus (it is neither symbolic formulas nor finite differences).

Prerequisites: Linear Algebra such as 18.06 and multivariate calculus such as 18.02.

Course will involve simple numerical computations using the Julia language. Ideally install it on your own computer following these instructions, but as a fallback you can run it in the cloud here:

Topics:

Here are some of the planned topics:

Derivatives as linear operators and linear approximation on arbitrary vector spaces: beyond gradients and Jacobians.
Derivatives of functions with matrix inputs and/or outputs (e.g. matrix inverses and determinants). Kronecker products and matrix "vectorization".
Derivatives of matrix factorizations (e.g. eigenvalues/SVD) and derivatives with constraints (e.g. orthogonal matrices).
Multidimensional chain rules, and the significance of right-to-left ("forward") vs. left-to-right ("reverse") composition. Chain rules on computational graphs (e.g. neural networks).
Forward- and reverse-mode manual and automatic multivariate differentiation.
Adjoint methods (vJp/pullback rules) for derivatives of solutions of linear, nonlinear, and differential equations.
Application to nonlinear root-finding and optimization. Multidimensional Newton and steepest–descent methods.
Applications in engineering/scientific optimization and machine learning.
Second derivatives, Hessian matrices, quadratic approximations, and quasi-Newton methods.

Lecture 1 (Jan 12)

part 1: overview (slides)
part 2: derivatives as linear operators: matrix functions, gradients, product and chain rule

Re-thinking derivatives as linear operators: f(x+dx)-f(x)=df=f′(x)[dx]. That is, f′ is the linear operator that gives the change df in the output from a "tiny" change dx in the inputs, to first order in dx (i.e. dropping higher-order terms). When we have a vector function f(x)∈ℝᵐ of vector inputs x∈ℝⁿ, then f'(x) is a linear operator that takes n inputs to m outputs, which we can think of as an m×n matrix called the Jacobian matrix (typically covered only superficially in 18.02).

In the same way, we can define derivatives of matrix-valued operators as linear operators on matrices. For example, f(X)=X² gives f'(X)[dX] = X dX + dX X. Or f(X) = X⁻¹ gives f'(X)[dX] = –X⁻¹ dX X⁻¹. These are perfectly good linear operators acting on matrices dX, even though they are not written in the form (Jacobian matrix)×(column vector)! (We could rewrite them in the latter form by reshaping the inputs dX and the outputs df into column vectors, more formally by choosing a basis, and we will later cover how this process can be made more elegant using Kronecker products. But for the most part it is neither necessary nor desirable to express all linear operators as Jacobian matrices in this way.)

Further reading: Course Notes (link above), chapters 1 and 2. matrixcalculus.org (linked in the slides) is a fun site to play with derivatives of matrix and vector functions. The Matrix Cookbook has a lot of formulas for these derivatives, but no derivations. Some notes on vector and matrix differentiation were posted for 6.S087 from IAP 2021.

Further reading (fancier math): the perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I'm calling "fancy") presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The "little-o notation" o(δx) we're using here for "infinitesimal asymptotics" is closely related to the asymptotic notation used in computer science, but in computer science people are typically taking the limit as the argument (often called "n") becomes very large instead of very small. We will formalize this later, corresponding to section 5.2 of the course notes.

Lecture 2 (Jan 14)

part 1: generalized sum and product rule, derivatives of X⁻¹ and ‖x‖² and xᵀAx; gradients ∇f of scalar-valued functions. Blackboard + some slides from lecture 1. Course notes: chapter 2.
part 1: matrix-function Jacobians via vectorization and Kronecker products; notes: 2×2 Matrix Jacobians (html) (pluto notebook source code) (jupyter notebook). Course notes: chapter 3.

Further reading (gradients): We will cover more generalizations later, corresponding to chapter 5 of the course notes. A fancy name for a row vector is a "covector" or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.

Lecture 3 (Jan 16)

part 1: the chain rule and forward vs. reverse "mode" differentiation: course notes section 2.4. Example applications, chapter 6: slides on nonlinear root-finding, optimization, and adjoint-method differentiation slides
Jacobian of the matrix inverse and the gradient of the determinant is ∇(det A) = det(A)A⁻ᵀ (course notes chapter 7).

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
notes		notes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Matrix Calculus for Machine Learning and Beyond

Lecture 1 (Jan 12)

Lecture 2 (Jan 14)

Lecture 3 (Jan 16)

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

mitmath/matrixcalc

Folders and files

Latest commit

History

Repository files navigation

Matrix Calculus for Machine Learning and Beyond

Lecture 1 (Jan 12)

Lecture 2 (Jan 14)

Lecture 3 (Jan 16)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages