TL;DR
Version 3b ended with a clear diagnosis: LLMs cannot invent precise multipliers. The proposed fix was to remove the invented number entirely — replace it with a pre-approved text label, and let a hard-coded lookup table do the arithmetic. This is V4.
The fix worked as a guardrail. OVAR dropped from the 2.3–3.1 range of V3b to a tight 1.73–1.78 across all four models and all three information conditions. No configuration caused catastrophic amplification. The fix did not work as an improvement. A 120-billion-parameter reasoning model and a lightweight fast model produced outcomes within 0.05 OVAR units of each other. Context doubled classification accuracy and barely moved the supply chain outcome. The Equaliser Effect: a discrete-label architecture creates a structural ceiling that no amount of intelligence, context, or prompt engineering can escape.
Experiment Setup
Design & configuration
| Architecture | 5-label intent classification → hard-coded lookup → Order-Up-To formula. The AI’s only job: pick one text label per period per tier. Everything downstream is deterministic software. |
| Labels & lookup | STRONG_INCREASE→1.30 · MOD_INCREASE→1.15 · NEUTRAL→1.00 · MOD_DECREASE→0.90 · STRONG_DECREASE→0.80 |
| Models | gpt-4.1-mini · o4-mini (reasoning) · phi4:14b (local 14B) · nemotron-super 120B (local 120B) |
| Conditions | Blind (numbers only) · Context (calendar + seasonal persona) · Unstructured (context + live event headlines) |
| Replications | 20 runs per frontier condition · 10 runs per local condition · 100 runs per baseline |
| Simulation horizon | 36 months · stochastic lead times (LogNormal) · stochastic fill rates (Beta) |
| World events | Pandemic (months 7–12) · Geopolitical conflict (months 19–21) · Port strike (months 28–30) |
| Supply chain | 3-tier serial: Tatva Motors OEM → Ancillary Supplier → Component Supplier |
| Primary metric | Chain OVAR = Var(orders) / Var(demand). Target: exp_smoothing at OVAR 1.185. |
Key findings
What I found
- Every AI condition produced OVAR in a 0.05-unit band, regardless of model or information level. gpt-4.1-mini: 1.737–1.771. phi4:14b: 1.726–1.780. o4-mini: 1.748–1.774. nemotron 120B: 1.734–1.775. The entire range across 16 distinct model-condition combinations is narrower than the standard deviation of many individual runs. This is not noise. It is a ceiling.
- Context doubled classification accuracy and did not move the supply chain outcome. Direction accuracy went from 0.41–0.48 in blind conditions to 0.72–0.84 with context. A roughly twofold improvement in situational awareness. Chain OVAR moved by 0.01–0.02. The accuracy gain is real. It is discarded entirely at the lookup table.
- More information made outcomes marginally worse, not better. In every model, the unstructured condition (live event headlines) produced higher OVAR than the context condition. When shown explicit disruption headlines, models over-committed to extreme labels, adding noise to the formula. gpt-4.1-mini context → unstructured: 1.737 → 1.771. phi4 context → unstructured: 1.726 → 1.780.
- Explicitly instructing the AI to default to inaction produced zero effect on OVAR. Sub-experiment E4 added a neutral-prior instruction: “Default to NEUTRAL unless the signal is strong and unambiguous.” Label distributions shifted. OVAR did not. The reason: NEUTRAL maps to multiplier 1.00, which still executes a full Order-Up-To calculation. “Do nothing” in the prompt is not “do nothing” in the formula.
- Removing world events made OVAR worse, not better. The ablation (events disabled) produced OVAR 2.08–2.12, well above the events-on results of 1.73–1.78. Disruption events create correlated demand and order spikes that partially cancel their own contribution to OVAR. A cleaner seasonal signal exposes the underlying formula amplification without that natural cancellation.
Results
Numeric results
Baselines
| Policy | Chain OVAR | Stockouts / run |
|---|---|---|
| exp_smoothing | 1.185 | 89.6 |
| order_up_to (formula ceiling) | 1.767 | 87.0 |
| naive_passthrough | 0.996 | 96.0 |
order_up_to is the structural ceiling: every AI agent uses this formula internally. Landing near 1.767 means the AI’s classification is adding nothing on top of the formula alone.
All AI conditions
| Model | Condition | Chain OVAR | ±std | Dir Accuracy |
|---|---|---|---|---|
| phi4:14b | CONTEXT | 1.726 | 0.093 | 0.80 |
| nemotron 120B | BLIND | 1.734 | 0.133 | 0.47 |
| gpt-4.1-mini | CONTEXT | 1.737 | 0.088 | 0.72 |
| nemotron 120B | CONTEXT | 1.745 | 0.128 | 0.74 |
| gpt-4.1-mini | BLIND | 1.747 | 0.092 | 0.41 |
| phi4:14b | BLIND | 1.748 | 0.130 | 0.48 |
| o4-mini | CONTEXT | 1.748 | 0.105 | 0.72 |
| o4-mini | BLIND | 1.763 | 0.113 | 0.44 |
| gpt-4.1-mini | UNSTRUCTURED | 1.771 | 0.108 | 0.76 |
| o4-mini | UNSTRUCTURED | 1.774 | 0.106 | 0.79 |
| nemotron 120B | UNSTRUCTURED | 1.775 | 0.130 | 0.83 |
| phi4:14b | UNSTRUCTURED | 1.780 | 0.130 | 0.84 |
Best AI condition: phi4:14b context, OVAR 1.726 — 46% above the exp_smoothing target of 1.185.
Hypothesis verdicts
| Hypothesis | Prediction | Verdict |
|---|---|---|
| H1 | Any AI condition beats exp_smoothing (OVAR ≤ 1.185) | FAILED |
| H2 | Unstructured condition reduces OVAR vs. context condition | FAILED |
| H3 | Event signal improves direction accuracy by ≥ 10 percentage points | FAILED |
| H4 | Intent compliance ≥ 0.99 under all conditions | PASSED |
| H5 | World events materially change OVAR | PASSED |
H5 passed — but in a surprising direction. World events present → lower OVAR. World events removed → higher OVAR (2.08–2.12). Disruptions partially cancel their own variance contribution.
Discussion
The Equaliser Effect
The AI controls one variable: the safety stock multiplier, ranging from 0.80 to 1.30. Order amplification in this simulation is primarily driven by stochastic lead times and fill rates — randomness in when stock arrives and how much of an order gets fulfilled. These are physical supply chain properties, not features of the AI’s decision. Even perfect classification would still produce OVAR near 1.77, because the Order-Up-To formula’s response to arrival randomness dominates.
Restricting the output to five labels successfully eliminated the calibration problem from V3b. It simultaneously eliminated any possibility of improvement, because the AI can no longer make the fine-grained adjustments that would close the gap to exponential smoothing. The discrete architecture trades runaway amplification for a hard ceiling.
One result from this experiment is easy to miss. Without context, every blind model defaults to classifying STRONG_INCREASE between 75% and 95% of all periods — even though only 25–33% of periods actually warrant an increase. This is catastrophically wrong. With a calendar and persona, models use all five labels accurately, with direction accuracy near 0.80. The discrimination is real. The control is not. An AI that correctly identifies the direction of demand and then outputs the same order as one that is completely wrong has been made intelligent in the wrong place.
V4 leaves open two interpretations of this ceiling: (a) a better model would classify more accurately, and more accurate classification would eventually move OVAR; or (b) the ceiling is architectural — even perfect labels cannot close the gap. V5 tested this directly by removing the LLM entirely and replacing it with oracle ground-truth labels.
Full code and results on GitHub
Full code, data, and raw results are available on GitHub. View on GitHub →
Methodology note
All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns across 36 months with injected world events (pandemic, geopolitical conflict, port strike).
Results should not be generalised to supply chain management broadly. The correct scope: in a 3-tier discrete intent-classification architecture with a 5-label set and a fixed multiplier lookup, no tested LLM configuration — from 14B to 120B parameters, across three information conditions — produced OVAR meaningfully below the deterministic Order-Up-To baseline.