{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "# Tutorial 6: Counterfactual Predictions and Plotting\n\nIn previous tutorials we computed predictions and marginal effects at specific covariate profiles using `atexog`. This tutorial introduces two additional tools:\n\n- The `values=` DSL for specifying counterfactual scenarios with reducers, callables, and `Expr` transforms.\n- Built-in plotting functions (`plot_predictions`, `plot_slopes`, `plot_comparisons`) that render prediction curves with correct confidence bands.\n\n## What you will learn\n\n- `values=` with percentile reducers (`\"p25\"`, `\"p50\"`, `\"p75\"`)\n- `values=` with `Expr` for mathematical counterfactuals\n- `Margins.contrast` for joint-SE treatment comparisons\n- `plot_predictions`, `plot_slopes`, `plot_comparisons`\n- Simultaneous vs pointwise confidence bands", "id": "97c2680f" }, { "cell_type": "markdown", "metadata": {}, "source": "## Setup\n\nWe continue with the same voter-turnout model from earlier tutorials.", "id": "7ec1ec11" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport statsmodels.formula.api as smf\nfrom smmargins import Margins, Expr\nfrom smmargins.plot import plot_predictions, plot_slopes, plot_comparisons\n\nrng = np.random.default_rng(7)\nN = 5_000\ndf = pd.DataFrame({\n \"age\": rng.normal(45, 12, N).clip(18, 90),\n \"income\": rng.lognormal(10.5, 0.4, N),\n \"educ\": rng.choice([\"hs\", \"college\", \"grad\"], N, p=[0.4, 0.4, 0.2]),\n \"female\": rng.integers(0, 2, N),\n})\neta = (-4.0 + 0.05 * df[\"age\"] + 0.00001 * df[\"income\"]\n + 0.8 * (df[\"educ\"] == \"college\") + 1.4 * (df[\"educ\"] == \"grad\")\n + 0.3 * df[\"female\"] - 0.0004 * df[\"age\"] * df[\"female\"])\ndf[\"voted\"] = (rng.uniform(0, 1, N) < 1 / (1 + np.exp(-eta))).astype(int)\n\nfit = smf.logit(\"voted ~ age + income + C(educ) + female + age:female\", data=df).fit(disp=False)\nM = Margins(fit)", "id": "2ecad0a4" }, { "cell_type": "markdown", "metadata": {}, "source": "## Counterfactual predictions with `values=`\n\nThe `values=` parameter is a per-variable DSL. For each key you can supply:\n\n| Kind | Example | Effect |\n|------|---------|--------|\n| Scalar | `{\"age\": 45}` | Fix `age` at 45 for all rows |\n| Sequence | `{\"age\": [30, 45, 60]}` | Cartesian-product grid over three ages |\n| Reducer string | `{\"age\": \"mean\"}` | Fix at the column mean |\n| Percentile | `{\"income\": \"p25\"}` | Fix at the 25th percentile |\n| Callable | `{\"income\": lambda df: df.income * 1.1}` | Row-varying transform |\n| `Expr` | `{\"income\": Expr(\"income * 1.1\")}` | Same via `df.eval` |\n\n`default_values=` controls how unspecified columns are handled (default: `\"asobserved\"`).\n\n### Percentile reducers\n\nPredict at the 25th, 50th, and 75th income percentiles, leaving all other variables as observed:", "id": "096399af" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "for pct in [25, 50, 75]:\n res = M.predict(values={\"income\": f\"p{pct}\"})\n q_val = float(np.percentile(df[\"income\"], pct))\n print(f\"income p{pct} (${q_val:,.0f}): P(vote) = {res.estimate[0]:.4f} \"\n f\"[{res.ci_lower[0]:.4f}, {res.ci_upper[0]:.4f}]\")", "id": "6b4f68ba" }, { "cell_type": "markdown", "metadata": {}, "source": "To hold all other numeric variables at their mean, use `default_values=\"mean\"`.\n\nA sequence in `values=` creates a Cartesian-product grid \u2014 here, three income grid points with everything else at the mean:", "id": "7809989c" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "income_grid = [float(np.percentile(df[\"income\"], p)) for p in [25, 50, 75]]\nres_grid = M.predict(values={\"income\": income_grid}, default_values=\"mean\")\nfor lab, est, lo, hi in zip(res_grid.labels, res_grid.estimate, res_grid.ci_lower, res_grid.ci_upper):\n print(f\"{lab}: {est:.4f} [{lo:.4f}, {hi:.4f}]\")", "id": "eba528e7" }, { "cell_type": "markdown", "metadata": {}, "source": "### Mathematical counterfactuals with `Expr`\n\n`Expr` evaluates a pandas expression against the data frame row by row. Here we ask: what if everyone's income were 10% higher?", "id": "cdbad464" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "res_baseline = M.predict()\nres_10pct = M.predict(values={\"income\": Expr(\"income * 1.10\")})\n\nprint(f\"Baseline: {res_baseline.estimate[0]:.4f}\")\nprint(f\"+10% income: {res_10pct.estimate[0]:.4f}\")\nprint(f\"Increase: {res_10pct.estimate[0] - res_baseline.estimate[0]:+.4f}\")", "id": "845085ea" }, { "cell_type": "markdown", "metadata": {}, "source": "### Joint-SE contrasts with `Margins.contrast`\n\nNa\u00efvely subtracting two predictions loses the off-diagonal covariance between arms (they share the same $\\hat{\\beta}$). `Margins.contrast` computes the PATE with the correct joint SE:", "id": "2f826050" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "joint = M.contrast(a={\"female\": 1}, b={\"female\": 0})\nprint(joint.summary())\n# Row 3 (A - B) is the gender gap with the correct joint SE.", "id": "c2d9c6f4" }, { "cell_type": "markdown", "metadata": {}, "source": "## Plotting: `plot_predictions`\n\n`plot_predictions` auto-grids the conditioning variable over its observed range (50 points for numerics), computes predictions at each grid point, and renders the curve with a 95% pointwise CI.", "id": "5c9dd9bc" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "fig, ax = plot_predictions(M, \"age\")\nax.set_title(\"Predicted P(voted) across age\");", "id": "1c413b2a" }, { "cell_type": "markdown", "metadata": {}, "source": "### Faceting with `by=`\n\nPass a categorical column to `by=` for one line per level:", "id": "43e387ff" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "fig, ax = plot_predictions(M, \"age\", by=\"educ\")\nax.set_title(\"Predicted P(voted) across age, by education\");", "id": "1f559fe4" }, { "cell_type": "markdown", "metadata": {}, "source": "### Simultaneous vs pointwise confidence bands\n\nThe default CI is **pointwise**: each grid point has 95% coverage individually. For claims about the curve as a whole, use a **simultaneous** band. `ci_method=\"sup-t\"` is tightest but requires a draw-based VCE.", "id": "58cc2356" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)\n\nplot_predictions(M, \"age\", ax=axes[0])\naxes[0].set_title(\"Pointwise 95% CI\")\n\nplot_predictions(\n M, \"age\",\n vce=\"simulation\", n_sims=2000, sim_seed=0,\n ci_method=\"sup-t\",\n ax=axes[1],\n)\naxes[1].set_title(\"Simultaneous (sup-t) 95% band\")\nfig.tight_layout();", "id": "b9c964b9" }, { "cell_type": "markdown", "metadata": {}, "source": "## Plotting slopes: `plot_slopes`\n\nThe first argument is the variable whose slope (`dy/dx`) you want; the second is the x-axis conditioning variable.", "id": "5abfc6ea" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "fig, ax = plot_slopes(M, \"age\", \"income\")\nax.set_title(\"Marginal effect of age, across income\");", "id": "e50c3624" }, { "cell_type": "markdown", "metadata": {}, "source": "## Plotting contrasts: `plot_comparisons`\n\n`plot_comparisons` plots a discrete comparison (factor levels or a unit step for numerics) across a conditioning variable. It calls `Margins.contrast` internally \u2014 the CI uses the joint SE, not the na\u00efve sum of individual SEs.", "id": "fffb4833" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "fig, ax = plot_comparisons(M, \"female\", condition=\"age\")\nax.set_title(\"Gender gap in P(voted), across age\");", "id": "ae402197" }, { "cell_type": "markdown", "metadata": {}, "source": "### Custom grids\n\nPass a dict for `condition` to override the auto-grid:", "id": "7d61b3a9" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "fig, ax = plot_predictions(M, {\"age\": np.arange(20, 80, 5)})\nax.set_title(\"Predicted P(voted), 5-year age steps\");", "id": "2af97c5c" }, { "cell_type": "markdown", "metadata": {}, "source": "## Recap\n\n| Feature | Call |\n|---------|------|\n| Percentile reducer | `M.predict(values={\"x\": \"p25\"})` |\n| Eval transform | `M.predict(values={\"x\": Expr(\"x * 1.1\")})` |\n| Grid + default mean | `M.predict(values={\"x\": [1, 2, 3]}, default_values=\"mean\")` |\n| Joint-SE contrast | `M.contrast(a={\"treat\": 1}, b={\"treat\": 0})` |\n| Prediction curve | `plot_predictions(M, \"x\")` |\n| Slope curve | `plot_slopes(M, \"x1\", \"x2\")` |\n| Contrast curve | `plot_comparisons(M, \"treat\", condition=\"x\")` |\n| Simultaneous band | `plot_predictions(M, \"x\", vce=\"simulation\", ci_method=\"sup-t\")` |\n\n## See also\n\n- How-to: {doc}`Counterfactual predictions `\n- Explanation: {doc}`Why contrast uses joint covariance `\n- API: `smmargins.Margins`, `smmargins.plot`", "id": "7a780601" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }