What this experiment is
This experiment tests a specific idea: do not ask the AI to invent the order quantity. Ask it to choose how reactive the forecast should be, then let a simple forecasting formula do the ordering math.
Each month, at each tier of a fictional automotive supply chain, the AI selects the exponential smoothing parameter α from {0.1, 0.3, 0.5, 0.7}. Lower α means the forecast moves slowly and filters noise. Higher α means the forecast reacts quickly to recent demand. Every AI condition produced OVAR below 1.0, meaning orders were less volatile than demand rather than more volatile.
Reader context
The problem is not ordering once. It is ordering repeatedly.
The bullwhip effect is what happens when small changes in customer demand become larger and larger order swings as they travel upstream through a supply chain. A retailer sees a bump in demand, the manufacturer orders extra to avoid shortage, the component supplier sees that larger order and reacts again. By the time the signal reaches the top of the chain, a modest demand change can look like a crisis.
This series studies that failure mode with AI agents. The setting is intentionally narrow: a fictional three-tier Indian automotive chain, monthly demand, one-month lead times, and synthetic seasonal demand with a monsoon slump, a Diwali peak, and a financial year-end surge. The metric is OVAR, or order variance divided by demand variance. OVAR below 1.0 means the replenishment policy is damping volatility. OVAR above 1.0 means it is creating bullwhip amplification.
The earlier experiments mostly showed what does not work. Giving the model context, memory, or a safety-stock multiplier did not reliably reduce bullwhip. This sixth experiment in the series changes the control surface: the AI no longer adjusts a buffer on top of the forecast. It chooses the forecast's responsiveness itself.
Experiment Setup
| What the AI does | Each month, at each tier, the AI chooses how responsive the demand forecast should be. It does not place the order directly. The formula takes the selected α, updates the forecast, and produces the order. |
| Models tested | gpt-4.1-mini · o4-mini · gpt-oss 120B |
| What the AI was told | Three variants: inventory numbers only (blind); numbers plus a seasonal calendar (context); numbers plus its own recent choices (stateful). A fourth set tested whether the context-driven tendency to over-react could be corrected by restricting options or giving explicit guidance. |
| How many runs | 10 runs per AI condition to account for response variability. Fixed baselines run once — they are deterministic. |
| Demand used | 25 months of synthetic demand shaped around Indian automotive patterns — a monsoon slump, a Diwali peak, and a financial year-end surge. |
| Supply chain | A fictional 3-tier chain: OEM (Tatva Motors) → component supplier → sub-supplier. Orders placed monthly, lead time fixed at one month across all tiers. |
| Benchmarks | Pure algorithmic baselines with no AI: exponential smoothing at fixed α settings of 0.1, 0.3, and 0.5. The target to match is α=0.3, which produced the best balance of stability and service level in this setup. |
Key findings
What I found
For the first time in this program, every AI condition produced OVAR below 1.0. In the first five experiments, AI-driven configurations amplified order variance by 30% to 1,200%. In this experiment, even the weakest AI result is 0.741 — still damping.
- The best AI condition matches the fixed baseline, but only marginally. gpt-oss 120B in blind conditions reached OVAR 0.535 with 4.4 stockouts. The fixed α=0.3 baseline is 0.545 with 5 stockouts. The difference is inside the confidence interval, so the strong claim is not that AI defeated the algorithm. The strong claim is that the architecture stopped the AI from making the bullwhip problem worse.
- Context made the models more reactive. Seasonal information sounds useful, and the models used it in a plausible way: they chose more responsive forecasts. But in this environment that extra responsiveness added variance. The debiased conditions recovered much of the loss: oss120b_ctx_computed reached OVAR 0.585 against oss120b_context at 0.739.
- The larger model's advantage looks like a prior, not magic reasoning. In blind conditions, gpt-oss 120B gravitated toward α=0.3, the standard exponential smoothing value. Smaller models selected α=0.7 more often, adding variance. Once context was added, the model-size advantage disappeared.
Results
Numeric results
Read these tables with one rule in mind: lower OVAR means less order volatility. A value below 1.0 means the ordering policy is absorbing volatility rather than transmitting it upstream. The fixed α=0.3 baseline is the reference point because it gives the best balance of variance reduction and stockout count among the deterministic baselines.
Fixed baselines (no AI)
| Condition | Chain OVAR | Stockouts |
|---|---|---|
| exp_smooth_0.1 | 0.620 | 16 |
| exp_smooth_0.3 (target) | 0.545 | 5 |
| exp_smooth_0.5 | 0.729 | 3 |
AI adaptive conditions (n=10 each)
| Model | Condition | Chain OVAR | Stockouts |
|---|---|---|---|
| gpt-oss 120B | BLIND | 0.535 | 4.4 |
| gpt-4.1-mini | BLIND | 0.597 | 7.3 |
| o4-mini | BLIND | 0.657 | 4.0 |
| gpt-oss 120B | STATEFUL | 0.684 | 6.2 |
| gpt-4.1-mini | STATEFUL | 0.695 | 8.8 |
| o4-mini | STATEFUL | 0.705 | 6.5 |
| gpt-4.1-mini | CONTEXT | 0.715 | 10.4 |
| gpt-oss 120B | CONTEXT | 0.739 | 9.3 |
| o4-mini | CONTEXT | 0.741 | 10.4 |
Every value above is below 1.0. The highlighted row is the first AI result in the research program to match the fixed optimal baseline within uncertainty.
Debiased context conditions (n=5 each)
| Condition | Chain OVAR | Stockouts |
|---|---|---|
| oss120b_ctx_computed | 0.585 | 5.6 |
| oss120b_ctx_debiased | 0.597 | 6.6 |
| mini_ctx_debiased | 0.596 | 7.4 |
| mini_ctx_computed | 0.679 | 8.4 |
Full run-level data, standard deviations, and α distributions are in the GitHub repository.
Mechanism
Why this control lever matters
Earlier architectures gave the AI a safety stock multiplier: a buffer held above the forecast. That sounds operationally meaningful, but it is the wrong lever for bullwhip. Safety stock changes how much inventory the system wants to hold. It does not directly control how sharply the forecast reacts to each new demand observation.
The smoothing parameter α does. If α is low, the forecast moves slowly and ignores some noise. If α is high, the forecast chases recent demand. That is why this experiment is different: the AI is finally acting on the part of the formula that directly governs order variance.
The implication is uncomfortable but useful. Better context is not automatically better control. When models saw seasonal context, they reacted as if the coming season justified faster movement. The direction was understandable; the calibration was too aggressive. Restricting the option set or explicitly correcting for alpha inflation recovered most of the blind condition’s advantage.
Implications
The lesson is architectural
This is not a story about a language model suddenly becoming a supply-chain optimizer. The deterministic α=0.3 baseline remains essentially tied with the best AI result. The real result is that the AI stopped causing amplification once its action space was aligned with the metric that mattered.
That matters for industrial agent design. If an AI system is placed outside the control loop and asked to produce operational quantities directly, it can sound reasonable while injecting volatility. If it is placed inside a narrow, calibrated control surface, the same model can become useful because the surrounding system constrains what its judgment can change.
In practical terms: do not start by asking an AI agent how much to order. Start by asking which parameter in a known control policy should move, how far it is allowed to move, and what failure mode that parameter actually controls.
Full code and results on GitHub
Full code, data, and raw results are available on GitHub. View on GitHub →
Methodology note
All scenarios, companies, and products are fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns across 25 months. Lead times are deterministic to isolate the architecture change from stochastic supply-side effects tested in earlier experiments. Results apply to a 3-tier exponential smoothing architecture where the AI selects α ∈ {0.1, 0.3, 0.5, 0.7} per period; do not generalise beyond this scope. The best AI result only marginally matches the fixed baseline, so the headline finding is variance dampening across all AI conditions, not a decisive win over deterministic forecasting.