How to choose between formula mode and raw exog mode

Prerequisites

Problem statement

You need to decide whether to fit your model with a formula (e.g., smf.logit("y ~ x1 + x2", data=df)) or with raw matrices (e.g., sm.Logit(y, X).fit()). The choice affects whether interactions, polynomials, and transformations are correctly propagated when Margins perturbs variables for marginal effects.

Minimal working solution

Raw mode (limited)

import statsmodels.api as sm

# Raw mode: manually build design matrix
X = sm.add_constant(df[["x1", "x2"]])
y = df["y"]
fit_raw = sm.Logit(y, X).fit(disp=False)
M_raw = Margins(fit_raw)

# Works for simple additive models
print(M_raw.dydx("x1"))

Variations

Raw mode with manually created interaction (incorrect)

# Raw mode with an interaction column — MARGINS WILL BE WRONG
X_bad = sm.add_constant(df[["x1", "x2"]].copy())
X_bad["x1_x2"] = X_bad["x1"] * X_bad["x2"]  # manually created interaction

fit_bad = sm.Logit(y, X_bad).fit(disp=False)
M_bad = Margins(fit_bad)

# This perturbs only the "x1" column, NOT "x1_x2"
# The marginal effect of x1 will be incorrect
print(M_bad.dydx("x1"))  # WARNING: interaction not updated!

Checking which mode is active

# Inspect raw_mode flag
print(f"Formula mode: {not M_formula._raw_mode}")  # False (formula)
print(f"Raw mode:     {M_raw._raw_mode}")           # True (raw)

Forcing raw mode with explicit data

# If model.data.frame is missing, pass data explicitly
X_extra = sm.add_constant(df[["x1", "x2"]])
fit_extra = sm.Logit(y, X_extra).fit(disp=False)
M_extra = Margins(fit_extra, data=df[["x1", "x2"]])
print(M_extra.dydx("x1"))

⚠️ Trade-off: Formula mode uses patsy’s DesignInfo to rebuild the design matrix whenever a variable is perturbed. This correctly handles I(x**2), x1:x2, C(group), splines (bs(x, df=4)), and all other patsy transforms. Raw mode perturbs only the literal column with the matching name — interactions and transformations are not automatically updated. Raw mode is faster and uses less memory for simple additive models without interactions.

When to use formula mode

Use formula mode whenever your model includes interactions (x1:x2), polynomial terms (I(x**2)), categorical variables (C(group)), spline terms (bs(x)), or any other patsy transform. Formula mode is the default recommendation for all but the simplest models.

When to use raw mode

Use raw mode when you have a simple additive model with no interactions or transformations and you want to avoid the patsy overhead. Raw mode is also necessary when working with models that do not support formulas (some custom statsmodels subclasses).

When NOT to use raw mode

⚠️ Trade-off: Do not use raw mode when your design matrix contains manually created interaction columns, polynomial columns, or any derived feature that depends on the variable being perturbed. The marginal effects will be wrong because Margins does not know about the relationship between columns. If you must use raw mode with interactions, compute the marginal effect by hand or switch to a formula.

See also