Mathematical motivation

This page derives, in one place, every statistic that Margins computes and the standard error attached to each one. The motivation is identical to Stata’s margins-delta-method FAQ and the cases enumerated in Richard Williams’ Margins01 notes; the goal here is to write them down using a single primitive so the implementation reduces to one Jacobian.

The delta method

For a fitted parameter vector \(\hat\beta\) with estimated covariance \(\widehat V\), and a (possibly vector-valued) statistic \(g(\beta)\), a first-order Taylor expansion gives

\[g(\hat\beta) \;\approx\; g(\beta_0) + G\,(\hat\beta - \beta_0), \qquad G = \left.\tfrac{\partial g}{\partial \beta}\right|_{\beta_0}.\]

Taking variances on both sides yields

\[\widehat{\operatorname{Var}}\bigl[g(\hat\beta)\bigr] \;\approx\; G\,\widehat V\,G^\top .\]

Three things to notice:

  1. The approximation depends only on first derivatives of \(g\).

  2. Once \(G\widehat V G^\top\) is in hand, any linear combination \(C\,g(\hat\beta)\) has covariance \(C(G\widehat V G^\top)C^\top\) — no further differentiation is needed. That is why contrast() is exact under the same approximation.

  3. The standard error of \(g(\hat\beta)\) is just the square-root of the diagonal of \(G\widehat V G^\top\); we report both.

A single statistic schema

Let \(f(\eta) = E[Y \mid \eta]\) be the response-scale map (the identity for OLS, the inverse link for a GLM, etc.). Williams enumerates seven statistics; every one can be written as a linear combination of mean predictions on different design matrices:

Statistic

\(g(\beta)\)

AAP

\(\frac{1}{n}\sum_i f(x_i^\top\beta)\)

APM

\(f(\bar x^\top\beta)\)

APR

\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) with some \(x\) fixed

AME

\(\frac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}\)

MEM

\(\partial f(\bar x^\top\beta)/\partial x_j\)

MER

AME with some \(x\) held at representative values

Contrast

\(E[f(\cdot)\mid x_j=\text{level}] - E[f(\cdot)\mid x_j=\text{ref}]\)

Every entry is a linear combination of mean-predictions \(\frac{1}{n}\sum_i f(x_i^\top\beta)\) on different design matrices. The single Jacobian primitive is therefore

\[\frac{\partial}{\partial\beta}\, \frac{1}{n}\sum_i f(x_i^\top\beta) \;=\; \frac{1}{n}\sum_i f'(x_i^\top\beta)\,x_i .\]

Substituting back into the delta method gives a closed form for the covariance of every statistic in the table.

Marginal effects of a single variable

For a continuous covariate \(x_{ij}\),

\[\frac{\partial f(x_i^\top\beta)}{\partial x_{ij}} \;=\; f'(x_i^\top\beta)\,\frac{\partial (x_i^\top\beta)}{\partial x_{ij}} .\]

If the design matrix is just \(X = [\,\dots,\, x_j,\,\dots]\), the inner derivative is \(\beta_j\) and the AME collapses to the familiar \(\overline{f'(x^\top\beta)}\,\beta_j\). With a formula involving I(x**2), x1:x2, bs(x, df=4), or a Categorical contrast, the inner derivative is whatever patsy produces when the data column is perturbed and the design matrix is rebuilt — which is exactly what Margins does.

Discrete contrasts (factors, booleans, low-cardinality numerics) are handled by computing \(f\) at the level and at the reference, and differencing — the delta-method covariance follows from stacking the two predictions and applying the contrast matrix \(C = [-1, +1]\).

Elasticities

The four method= choices on dydx() apply a per-observation transform to the raw derivative \(d_i = \partial f(x_i^\top\beta)/\partial x_{ij}\) before averaging:

Method

Per-obs transform

Interpretation

dydx

\(d_i\)

Level change (default).

dyex

\(d_i\,x_i\)

Semi-elasticity, \(dy/d(\ln x)\).

eyex

\(d_i\,x_i / y_i\)

Full elasticity.

eydx

\(d_i / y_i\)

Semi-elasticity, \(d(\ln y)/dx\).

Elasticities are only defined for continuous variables; the implementation raises if you ask for an elasticity on a discrete one.

Outer Jacobian: analytic vs. finite differences

Stata’s margins computes the entire \(G\) numerically. We compute it analytically when possible:

  • For any GLM, \(f'\) is read off family.link.inverse_deriv. This covers Logit, Probit, Poisson, NegBin, Gamma, and Gaussian out of the box.

  • For OLS / WLS / GLS, \(f' \equiv 1\) (identity link).

  • For MNLogit, we use an analytic softmax-derivative gradient.

Whenever an analytic \(f'\) or Jacobian is available and the linear predictor is \(\eta = X\beta\) (no offset/exposure), the chain rule gives \(\partial g/\partial\beta\) in closed form, saving \(p\) forward predict calls per statistic.

The fallback is central differences,

\[\frac{\partial f}{\partial x_j} \approx \frac{f(x + h e_j) - f(x - h e_j)}{2 h_j}, \qquad h_j = \epsilon^{1/3}\cdot\max(|x_j|, 1),\]

with \(h_j\) chosen to balance truncation error (\(O(h^2)\)) against round-off (\(O(\epsilon/h)\)); the optimum scales as \(h \sim \epsilon^{1/3}\). See _central_jacobian() and Nocedal & Wright, Numerical Optimization (2006), Chapter 8.

  • Analytic-vs-FD parity matrix. Every public API call (AAP, APM, APR, AME, MEM, MER — continuous and discrete) is run twice on each of OLS+poly, Logit+`C(grp)`, Poisson, and MNLogit, with analytic=True and analytic=False, and the two paths must agree on both estimate and SE. This guards against the analytic chain-rule formula drifting from the FD answer that the other tests pin to Stata / get_margeff.

Why the response scale matters for DiD (Ai & Norton 2003)

In a logit, the coefficient on a group:condition interaction is on the log-odds scale. On the probability scale, the difference-in-differences

\[\mathrm{DiD}(x_*) = \bigl[f(\eta_{1,1}) - f(\eta_{1,0})\bigr] - \bigl[f(\eta_{0,1}) - f(\eta_{0,0})\bigr]\]

is a nonlinear function of every parameter and every covariate profile \(x_*\). You cannot read it off the interaction coefficient. did() evaluates the four cells on the response scale, with their joint delta-method covariance, and forms the DiD as a contrast — Ai & Norton’s recommendation in code.

References