Formula Mode vs. Raw Exog Mode¶

Two Paths to the Same Destination¶

If you come from Stata, you are accustomed to a single path: you load a dataset, specify variables, and Stata handles the rest. There is no distinction between “formula mode” and “raw matrix mode” because Stata has no equivalent of a design matrix abstraction. Python does, and the choice between formula mode and raw exog mode is one of the most consequential decisions in smmargins.

If you use statsmodels.get_margeff, you are always in raw mode. The function operates on the fitted model’s exog matrix with no knowledge of how that matrix was constructed from the original data. This is a fundamental limitation that smmargins resolves through formula mode.

Formula Mode: Preserving the Full Data Pipeline¶

When you fit a statsmodels model with a formula string:

model = smf.logit("y ~ x1 + I(x1**2) + x1:x2 + C(group)", data=df)
result = model.fit()

statsmodels (via patsy) stores the DesignInfo object — a complete recipe for transforming the data frame into the design matrix. This recipe includes:

Which columns to extract from the data frame
What transformations to apply (polynomials, interactions, splines)
What contrast coding to use for categorical variables
The order and names of the resulting design matrix columns

Formula mode in smmargins uses this DesignInfo object to rebuild the design matrix whenever a covariate is perturbed. The process:

\[\tilde{X} = \text{DesignInfo.build}(\text{data with } x_j \text{ perturbed})\]

What this means in code: smmargins(result) detects that result was fit with a formula, extracts the DesignInfo, and uses it for all subsequent predictions and marginal effects. When computing the marginal effect of x1, it modifies x1 in the data frame and lets patsy regenerate x1, I(x1**2), and x1:x2 consistently.

Raw Mode: Working with the Design Matrix Directly¶

When you fit a model with an explicit design matrix:

X = df[["x1", "x1_sq", "x1_x2", "group_B", "group_C"]]
model = sm.Logit(df["y"], X)
result = model.fit()

there is no DesignInfo. The mapping from data columns to design matrix columns is lost. smmargins operates in raw mode, perturbing design matrix columns directly:

\[\tilde{X} = X + \Delta \cdot e_j\]

where \(e_j\) is the unit vector for the \(j\)-th design matrix column.

What this means in code: smmargins(result, exog=X) operates on the matrix directly. When computing the marginal effect corresponding to the x1 column, it adds a small perturbation to that column only. The x1_sq and x1_x2 columns remain frozen at their original values.

Where Raw Mode Fails¶

Raw mode produces incorrect marginal effects whenever the design matrix columns are functions of each other. Consider the polynomial example:

\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon\]

The true marginal effect of \(x\) is \(\beta_1 + 2\beta_2 x\). In formula mode, perturbing \(x\) rebuilds both the \(x\) and \(x^2\) columns, and the marginal effect captures both terms. In raw mode, perturbing only the \(x\) column gives \(\beta_1\), missing the \(2\beta_2 x\) contribution entirely.

The error magnitude depends on \(\beta_2\) and \(x\):

\[\text{Error}_{\text{raw}} = 2\beta_2 x \cdot \Delta\]

What this means in code: for a model with \(x^2\) where \(\beta_2 = 0.5\) and observations have \(x \approx 5\), the raw-mode marginal effect is wrong by approximately \(5.0 \cdot \Delta\) — a substantial bias that grows with \(x\).

Interaction Terms¶

For a model with \(y \sim x_1 \cdot x_2\), the marginal effect of \(x_1\) is:

\[\frac{\partial E[y]}{\partial x_1} = \beta_1 + \beta_{12} x_2\]

In raw mode, if x1_x2 is a manually constructed column, perturbing x1 does not update x1_x2. The \(\beta_{12} x_2\) term is omitted, and the marginal effect is understated by \(\beta_{12} x_2\).

Splines¶

For \(y \sim \text{bs}(x, \text{df}=5)\), patsy generates 5 B-spline basis columns that are all nontrivial functions of \(x\). Perturbing one basis column while holding the others fixed has no meaningful interpretation. The marginal effect requires re-evaluating all 5 basis functions at \(x + \Delta\).

Where Raw Mode Is Correct¶

Raw mode is correct — and often preferred — when:

The design matrix columns are independent. Each column represents a distinct variable with no functional relationship to the others. A model like y ~ x1 + x2 + x3 fit with raw exog has no hidden dependencies.
You do not have the original data frame. If you are working with published results (a coefficient vector and covariance matrix), you may only have the design matrix. Raw mode is the only option.
You are conducting numerical experiments. If you are manually constructing design matrices to test edge cases, raw mode gives you full control.

⚠️ Trade-off: Raw mode is faster (no patsy rebuilding) and works without the original data frame, but it requires you to guarantee that the design matrix columns are independently perturbable. Formula mode is slower but mathematically correct for any model structure.

The Practical Recommendation¶

Situation	Recommended Mode	Reason
Model fit with `smf.*` formula	Formula	DesignInfo available, all transformations tracked
Interactions in model	Formula	Raw mode cannot update interaction columns
Polynomial/spline terms	Formula	Raw mode cannot re-derive transformed columns
Categorical variables with contrasts	Formula	Raw mode cannot reconstruct contrast coding
Simple linear model, no transformations	Either	Columns are independent; raw mode is fine
Working with published results only	Raw	No data frame or formula available
Performance-critical applications	Raw	Avoids patsy rebuild overhead

The rule of thumb: If your model formula contains anything beyond + — that is, if it uses *, :, I(), C(), bs(), cr(), or any other patsy transform — use formula mode. The cost of being wrong (silent incorrect marginal effects) far exceeds the cost of patsy rebuilding.

What Happens at the API Boundary¶

smmargins auto-detects the mode from the fitted results object:

If result.model.data.design_info exists (set by patsy during formula fitting), formula mode is used.
If not, raw mode is used with the exog matrix stored in result.model.exog.

You can override the design used at evaluation time by passing newdata= explicitly to predict() or dydx().

What this means in code: in most cases, you do not need to think about mode selection. Fit your model with a formula, pass the result to smmargins, and the correct mode is chosen automatically. Mode selection only becomes a conscious decision when (1) you fit without a formula, or (2) you are passing a custom exog to a predictions/margins call.

Formula Mode vs. Raw Exog Mode¶

Two Paths to the Same Destination¶

Formula Mode: Preserving the Full Data Pipeline¶

Raw Mode: Working with the Design Matrix Directly¶

Where Raw Mode Fails¶

Interaction Terms¶

Splines¶

Where Raw Mode Is Correct¶

The Practical Recommendation¶

What Happens at the API Boundary¶

smmargins

Navigation

Related Topics

Formula Mode vs. Raw Exog Mode¶

Two Paths to the Same Destination¶

Formula Mode: Preserving the Full Data Pipeline¶

Raw Mode: Working with the Design Matrix Directly¶

Where Raw Mode Fails¶

Interaction Terms¶

Splines¶

Where Raw Mode Is Correct¶

The Practical Recommendation¶

What Happens at the API Boundary¶

Related Documentation¶