Mathematical motivation

This page derives, in one place, every statistic that Margins computes and the standard error attached to each one. The motivation is identical to Stata’s margins-delta-method FAQ and the cases enumerated in Richard Williams’ Margins01 notes; the goal here is to write them down using a single primitive so the implementation reduces to one Jacobian.

The delta method

For a fitted parameter vector \(\hat\beta\) with estimated covariance \(\widehat V\), and a (possibly vector-valued) statistic \(g(\beta)\), a first-order Taylor expansion gives

\[g(\hat\beta) \;\approx\; g(\beta_0) + G\,(\hat\beta - \beta_0), \qquad G = \left.\tfrac{\partial g}{\partial \beta}\right|_{\beta_0}.\]

Taking variances on both sides yields

\[\widehat{\operatorname{Var}}\bigl[g(\hat\beta)\bigr] \;\approx\; G\,\widehat V\,G^\top .\]

Three things to notice:

  1. The approximation depends only on first derivatives of \(g\).

  2. Once \(G\widehat V G^\top\) is in hand, any linear combination \(C\,g(\hat\beta)\) has covariance \(C(G\widehat V G^\top)C^\top\) — no further differentiation is needed. That is why contrast() is exact under the same approximation.

  3. The standard error of \(g(\hat\beta)\) is just the square-root of the diagonal of \(G\widehat V G^\top\); we report both.

A single statistic schema

Let \(f(\eta) = E[Y \mid \eta]\) be the response-scale map (the identity for OLS, the inverse link for a GLM, etc.). Williams enumerates seven statistics; every one can be written as a linear combination of mean predictions on different design matrices:

Statistic

\(g(\beta)\)

AAP

\(\frac{1}{n}\sum_i f(x_i^\top\beta)\)

APM

\(f(\bar x^\top\beta)\)

APR

\(\frac{1}{n}\sum_i f(x_i^\top\beta)\) with some \(x\) fixed

AME

\(\frac{1}{n}\sum_i \partial f(x_i^\top\beta)/\partial x_{ij}\)

MEM

\(\partial f(\bar x^\top\beta)/\partial x_j\)

MER

AME with some \(x\) held at representative values

Contrast

\(E[f(\cdot)\mid x_j=\text{level}] - E[f(\cdot)\mid x_j=\text{ref}]\)

Every entry is a linear combination of mean-predictions \(\frac{1}{n}\sum_i f(x_i^\top\beta)\) on different design matrices. The single Jacobian primitive is therefore

\[\frac{\partial}{\partial\beta}\, \frac{1}{n}\sum_i f(x_i^\top\beta) \;=\; \frac{1}{n}\sum_i f'(x_i^\top\beta)\,x_i .\]

Substituting back into the delta method gives a closed form for the covariance of every statistic in the table.

Marginal effects of a single variable

For a continuous covariate \(x_{ij}\),

\[\frac{\partial f(x_i^\top\beta)}{\partial x_{ij}} \;=\; f'(x_i^\top\beta)\,\frac{\partial (x_i^\top\beta)}{\partial x_{ij}} .\]

If the design matrix is just \(X = [\,\dots,\, x_j,\,\dots]\), the inner derivative is \(\beta_j\) and the AME collapses to the familiar \(\overline{f'(x^\top\beta)}\,\beta_j\). With a formula involving I(x**2), x1:x2, bs(x, df=4), or a Categorical contrast, the inner derivative is whatever patsy produces when the data column is perturbed and the design matrix is rebuilt — which is exactly what Margins does.

Discrete contrasts (factors, booleans, low-cardinality numerics) are handled by computing \(f\) at the level and at the reference, and differencing — the delta-method covariance follows from stacking the two predictions and applying the contrast matrix \(C = [-1, +1]\).

Elasticities

The four method= choices on dydx() apply a per-observation transform to the raw derivative \(d_i = \partial f(x_i^\top\beta)/\partial x_{ij}\) before averaging:

Method

Per-obs transform

Interpretation

dydx

\(d_i\)

Level change (default).

dyex

\(d_i\,x_i\)

Semi-elasticity, \(dy/d(\ln x)\).

eyex

\(d_i\,x_i / y_i\)

Full elasticity.

eydx

\(d_i / y_i\)

Semi-elasticity, \(d(\ln y)/dx\).

Elasticities are only defined for continuous variables; the implementation raises if you ask for an elasticity on a discrete one.

Outer Jacobian: analytic vs. finite differences

Stata’s margins computes the entire \(G\) numerically. We compute it analytically when possible:

  • For any GLM, \(f'\) is read off family.link.inverse_deriv. This covers Logit, Probit, Poisson, NegBin, Gamma, and Gaussian out of the box.

  • For OLS / WLS / GLS, \(f' \equiv 1\) (identity link).

  • For MNLogit, we use an analytic softmax-derivative gradient.

Whenever an analytic \(f'\) or Jacobian is available and the linear predictor is \(\eta = X\beta\) (no offset/exposure), the chain rule gives \(\partial g/\partial\beta\) in closed form, saving \(p\) forward predict calls per statistic.

The fallback is central differences,

\[\frac{\partial f}{\partial x_j} \approx \frac{f(x + h e_j) - f(x - h e_j)}{2 h_j}, \qquad h_j = \epsilon^{1/3}\cdot\max(|x_j|, 1),\]

with \(h_j\) chosen to balance truncation error (\(O(h^2)\)) against round-off (\(O(\epsilon/h)\)); the optimum scales as \(h \sim \epsilon^{1/3}\). See _central_jacobian() and Nocedal & Wright, Numerical Optimization (2006), Chapter 8.

  • Analytic-vs-FD parity matrix. Every public API call (AAP, APM, APR, AME, MEM, MER — continuous and discrete) is run twice on each of OLS+poly, Logit+`C(grp)`, Poisson, and MNLogit, with analytic=True and analytic=False, and the two paths must agree on both estimate and SE. This guards against the analytic chain-rule formula drifting from the FD answer that the other tests pin to Stata / get_margeff.

Beyond delta: simulation and bootstrap VCEs

The delta method is a first-order approximation: \(G\widehat V G^\top\) is exact only to the extent that \(g\) is locally linear in \(\beta\) near \(\hat\beta\). For statistics that are sharply nonlinear (a probability near 0 or 1, an elasticity at a near-zero prediction) this can produce symmetric Wald intervals that extend beyond the natural support, or under-cover when the curvature is non-negligible.

smmargins exposes two alternatives behind the same vce= keyword.

Krinsky–Robb simulation (vce="simulation"). Draw \(S\) parameter vectors \(\beta_s\sim N(\hat\beta,\widehat V)\), evaluate \(g(\beta_s)\) for each, and read inference off the empirical distribution of the draws:

\[\widehat{\operatorname{Var}}_{\mathrm{KR}}\bigl[g(\hat\beta)\bigr] = \frac{1}{S-1}\sum_{s=1}^S \bigl(g(\beta_s) - \bar g\bigr)\!\bigl(g(\beta_s) - \bar g\bigr)^{\!\top}.\]

The reported point estimate stays the analytic \(g(\hat\beta)\), not \(\bar g\); this matches Stata’s vce(simulation) convention and avoids spurious bias from finite \(S\). Pointwise intervals default to empirical quantiles of the draws and so are not forced to be symmetric — an asymmetric CI for an extreme probability is the practical reason to reach for KR.

Bootstrap (vce="bootstrap"). Refit the model on \(B\) resamples of the rows and read inference off the bootstrap distribution. Three resampling schemes are built in (boot_method=):

  • "pairs"\((y_i, x_i)\) resampled IID (default);

  • "cluster" — whole clusters resampled with replacement, required for robustness to within-cluster correlation;

  • "block" — moving-block resampling for time series, with block_size= chosen to span the dependence horizon.

Failed refits are caught and counted; a RuntimeWarning fires when the failure rate exceeds 5%. The point estimate is again the analytic \(g(\hat\beta)\), not the bootstrap mean.

Simultaneous confidence intervals

For a family of \(m\) margins returned in a single call, ci_method= controls the critical value. With pointwise critical value \(z_{1-\alpha/2}\):

  • "bonferroni" uses \(z_{1-\alpha/(2m)}\) — works with any VCE;

  • "sidak" uses \(z_{(1+(1-\alpha)^{1/m})/2}\), slightly narrower than Bonferroni — also works with any VCE;

  • "sup-t" uses the \((1-\alpha)\) quantile of the maximum standardised absolute deviation across the family, computed from the simulation/bootstrap draw matrix:

    \[c_{\sup\text{-}t}^{\,(1-\alpha)} = Q_{1-\alpha}\!\left( \max_{j=1,\dots,m} \left|\frac{g_j(\beta_s) - g_j(\hat\beta)}{\widehat\sigma_j}\right| \right).\]

Sup-t requires draws (so vce must be "simulation" or "bootstrap") and is typically narrower than Bonferroni / Šidák when the margins in the family are correlated — exactly the case for margins evaluated at neighbouring covariate profiles.

Prediction scales and the chain rule

The scale= kwarg on predict() and dydx() swaps the response-scale map \(f(\eta)\) for an arbitrary \(g(\eta)\). Built-in scales ("linear", "or", "exp", "log", …) and any user-defined Transform provide \(g\), \(g'\), and (for dydx) \(g''\) analytically.

Predicted values. For a single design \(X\), the gradient of \(g(X\beta)\) w.r.t. \(\beta\) is

\[\frac{\partial g(\eta)}{\partial \beta} = g'(\eta) \cdot X.\]

Only \(g'\) is needed; this is why predict(scale=t) accepts Transform instances with hess=None.

AME on the g-scale. For a continuous covariate \(x_k\), the AME on the g-scale is

\[\mathrm{AME}_k^{(g)} = \frac{1}{n}\sum_i g'(\eta_i) \cdot \beta_k.\]

Differentiating w.r.t. \(\beta\):

\[\frac{\partial \mathrm{AME}_k^{(g)}}{\partial \beta_j} = \frac{1}{n}\sum_i \bigl[\,g''(\eta_i) \cdot x_{ij} \cdot \beta_k + g'(\eta_i) \cdot \delta_{jk}\,\bigr].\]

Both terms appear: the first comes from differentiating \(g'(\eta_i)\) through \(\eta_i = x_i^\top\beta\), the second from differentiating the explicit \(\beta_k\). smmargins computes this Jacobian analytically when Transform.hess is provided (every built-in does). Without hess, dydx raises — this is the package’s strict-derivative contract.

Discrete contrasts under a g-scale are simpler: the contrast is \(g(\eta_a) - g(\eta_b)\) averaged over the sample, and its gradient is \(g'(\eta_a)\,x_a - g'(\eta_b)\,x_b\). Only \(g'\) enters; \(g''\) is not used on the discrete path.

Composition with elasticities. Each method= ("eyex", "dyex", "eydx") wraps the g-scale formulas above with the appropriate \(x\) or \(g(\eta)\) factor, e.g.

\[\mathrm{eyex}_k^{(g)} = \mathrm{AME}_k^{(g)} \cdot \frac{\overline{x_k}}{\overline{g(\eta)}}.\]

The Jacobian inherits an extra factor or quotient but no new analytic term, so the same \(g'\) / \(g''\) machinery powers every elasticity × scale combination.

Subgroup AMEs (over=)

over= partitions the sample into \(G\) subgroups and computes the AME within each. Because every subgroup AME is a function of the same \(\hat\beta\), the joint covariance of the \(G\) estimates is not block-diagonal:

\[\begin{split}\widehat{\mathrm{Var}}\bigl(\hat\theta_{1\ldots G}\bigr) = J\,\widehat V\,J^\top, \quad J = \begin{bmatrix} \partial \theta_1 / \partial \beta \\ \vdots \\ \partial \theta_G / \partial \beta \end{bmatrix}.\end{split}\]

Computing each subgroup independently and concatenating would discard the off-diagonal blocks and break Wald tests across subgroups, so smmargins always builds the stacked Jacobian.

Joint Wald and pairwise contrasts

For a result with point estimate \(\hat\theta\) and covariance \(V\), wald() computes

\[W = (C\hat\theta - c_0)^\top (C V C^\top)^{-1} (C\hat\theta - c_0),\]

distributed as \(\chi^2_{\mathrm{rank}(CVC^\top)}\) under \(H_0\). pairwise(by=…) builds \(C\) for all level-vs-level comparisons of a factor and returns a MarginsResult whose ci_method= flows through the simultaneous-CI machinery — including "sup-t" when the underlying result carries draws.

Why the response scale matters for DiD (Ai & Norton 2003)

In a logit, the coefficient on a group:condition interaction is on the log-odds scale. On the probability scale, the difference-in-differences

\[\mathrm{DiD}(x_*) = \bigl[f(\eta_{1,1}) - f(\eta_{1,0})\bigr] - \bigl[f(\eta_{0,1}) - f(\eta_{0,0})\bigr]\]

is a nonlinear function of every parameter and every covariate profile \(x_*\). You cannot read it off the interaction coefficient. did() evaluates the four cells on the response scale, with their joint delta-method covariance, and forms the DiD as a contrast — Ai & Norton’s recommendation in code.

References