Analytic vs. Finite-Difference Jacobians

The Problem: Building the \(G\) Matrix

The delta method requires the Jacobian matrix \(G = \partial g / \partial \beta\). For marginal effects, this is an outer Jacobian: the derivative of a scalar summary statistic with respect to the \(p\) model parameters. There are two ways to compute it: analytically (symbolic derivatives) and by finite differences (numerical differentiation).

If you come from Stata, you have only ever used finite-difference Jacobians. Stata’s margins computes all standard errors by numeric differentiation — there is no analytic path. smmargins offers both, and the choice matters for speed and numerical stability.


The Analytic Path: Symbolic Derivatives of the Mean Function

For generalized linear models (GLMs), the mean function \(f(\eta)\) and its derivative \(f'(\eta)\) are known in closed form. The statsmodels family object exposes these through link.inverse and link.inverse_deriv.

For a logit model, the mean function and its derivative are:

\[f(\eta) = \Lambda(\eta) = \frac{1}{1 + e^{-\eta}}\]
\[f'(\eta) = \Lambda(\eta) \cdot (1 - \Lambda(\eta))\]

What this means in code: when you call dydx() on a GLM or GLM-like model (Logit, Probit, Poisson, NegativeBinomial), smmargins calls family.link.inverse_deriv(eta) to evaluate \(f'(\eta)\) at the linear predictor for every observation. This is a vectorized operation with no additional model evaluations.

For the second derivative (needed for marginal effects, since the marginal effect itself is \(f'(\eta) \cdot \beta_j\)), the logit case is:

\[f''(\eta) = \Lambda(\eta) \cdot (1 - \Lambda(\eta)) \cdot (1 - 2\Lambda(\eta))\]

What this means in code: the outer Jacobian row for covariate \(j\) is computed as the mean of \(f''(x_i'\hat{\beta}) \cdot \hat{\beta}_j \cdot x_i\) across observations. This is one pass over the data after \(\hat{\beta}\) is known.


Special Cases in the Analytic Path

MNLogit: The Softmax Jacobian

For multinomial logit with \(K\) outcome categories, the mean function is the softmax:

\[P(y = k \mid x) = \frac{e^{x'\beta_k}}{\sum_{j=1}^{K} e^{x'\beta_j}}\]

The Jacobian of this \(K\)-dimensional output with respect to the stacked parameter vector requires the softmax derivative matrix. smmargins implements this directly using the known closed form:

\[\frac{\partial P_k}{\partial \beta_m} = P_k \left( \mathbb{1}_{k=m} - P_m \right) x\]

What this means in code: multinomial logit gets a fully analytic path, with no finite differencing even for cross-category marginal effects.


The Finite-Difference Fallback

When an analytic derivative is not available, smmargins falls back to finite differences. This triggers in three situations:

  1. Non-GLM models with no closed-form mean function derivative (e.g., custom models).

  2. Offset or exposure terms present, which compose nonlinearly with the link function.

  3. OrderedModel (ordered probit/logit), where the cumulative probability structure makes analytic Jacobians substantially more complex.

The finite-difference step size follows the cube-root rule:

\[h = \varepsilon^{1/3} \cdot \max(|x|, 1)\]

where \(\varepsilon\) is machine epsilon (approximately \(2.2 \times 10^{-16}\) for float64), giving \(h \approx 10^{-5}\).

What this means in code: for each parameter \(\beta_j\), smmargins perturbs \(\beta_j\) by \(\pm h\), re-evaluates the statistic \(g(\beta)\), and computes:

\[\frac{\partial g}{\partial \beta_j} \approx \frac{g(\beta + h \cdot e_j) - g(\beta - h \cdot e_j)}{2h}\]

This requires \(2p\) statistic evaluations for a central difference, or \(p\) evaluations for a forward difference.


The Speed Difference

⚠️ Trade-off: Analytic Jacobians eliminate \(p\) forward predict() calls per statistic, but they require that the mean function and its derivatives be correctly implemented for every model family. Finite differences are slower but universal — they work for any model that can make predictions.

Consider a logit model with \(p = 50\) parameters and \(m = 10\) covariates of interest:

  • Analytic path: one pass over the data to compute \(f''(\eta) \cdot x\), then \(m \times p\) multiplications. Negligible overhead.

  • Finite-difference path: for each of \(m\) statistics, perturb each of \(p\) parameters and re-predict. This is \(m \times p = 500\) additional model evaluations.

The analytic path is typically 100-1000x faster for moderate-sized models.

⚠️ Trade-off: Finite differences can suffer from truncation error (if \(h\) is too large) or roundoff error (if \(h\) is too small). The cube-root rule balances these, but for parameters near zero or with extreme scale differences, the optimal \(h\) may vary. Analytic derivatives avoid this entirely.


Stata’s Approach vs. smmargins

Stata’s margins uses finite differences for everything. This is a defensible design choice — it means margins works with any model that Stata can fit, including user-written estimators, with no family-specific code. The cost is speed.

smmargins takes the opposite default: use analytic derivatives whenever possible, fall back to finite differences only when necessary. For the vast majority of applied work (logit, probit, Poisson, OLS), this means delta-method standard errors are effectively instantaneous.