TL;DR
V4 left open two interpretations of the Equaliser Effect: the ceiling exists because current models classify poorly, or it exists because even perfect classification cannot move OVAR within the Order-Up-To formula architecture. V5 Phase 1 tested interpretation two by removing the LLM entirely and replacing it with oracle labels — ground-truth perfect classifications derived from the simulation’s own demand trajectory.
Oracle labels produced OVAR 1.776 — worse than the Order-Up-To formula running with no AI at OVAR 1.753. Perfect intelligence in the wrong position is worse than no intelligence. Fourteen architectural variants were tested. None passed the Phase 1 gate. The 0.540-unit gap to exponential smoothing is preserved exactly. The intent-classification line is closed: LLMs with intent-classification interfaces cannot replicate the variance-dampening properties of exponential smoothing within the Order-Up-To formula architecture. This is a structural incompatibility, not a model quality problem, and it is why V6 changes the control architecture rather than improving the classifier.
Experiment Setup
Design & configuration
| Architecture | Oracle or causal deterministic labels → multiplier lookup → Order-Up-To formula. No LLM involved in any condition. |
| Label sources | Oracle: ground-truth labels from simulation’s GROUND_TRUTH_INTENT (perfect classifier). Causal: hand-written rule based on calendar month and event signals. |
| Conditions (14) | A1: oracle on V4 map · A2: wider multiplier maps (moderate, aggressive, asymmetric) · A3: NEUTRAL redefinitions (smoothed forecast, dampened OUT, repeat last, floor only) · A4: order dampening (β=0.25/0.50/0.75) · A5: event-adjusted forecast oracle · A6: causal rule-based classifiers |
| Replications | 20 runs per condition |
| Gate criterion | Any condition beats order_up_to by ≥ 0.10, OR comes within 0.30 of exp_smoothing |
| Simulation | 36 months · stochastic lead times (LogNormal) · stochastic fill rates (Beta) · world events active · identical to V4 |
| Baselines | exp_smoothing: 1.193 · naive_passthrough: 0.996 · order_up_to: 1.753 |
Key findings
What I found
-
Perfect labels are worth nothing on the V4 multiplier map.
oracle_v4map— ground-truth perfect classifications fed into the V4 conservative map — produced OVAR 1.776. That is worse thanorder_up_toat 1.753. A hypothetical LLM that classifies every single period perfectly would produce worse supply chain outcomes than the formula running alone with no AI guidance whatsoever. - Wider multiplier ranges always make things worse. All three wider map variants (moderate ±25/50%, aggressive ±40/80%, asymmetric) produced higher OVAR (1.79–1.86). Larger safety-stock swings compound variance further upstream regardless of how accurate the labels are. The V4 map was not too conservative — it was already as permissive as the architecture can tolerate.
-
NEUTRAL redefinition is the only lever with any traction — and only barely.
neutral_smoothed_forecast(NEUTRAL → order = raw forecast with no safety stock) achieves OVAR 1.733, beatingorder_up_toby 0.019. All other NEUTRAL redefinitions (repeat last order, partial dampening, no order if stocked) were substantially worse — 2.0 to 2.3 OVAR. One specific redefinition helped marginally. Everything else hurt. -
A hand-written rule equals a perfect oracle.
causal_context— a rule that classifies based on calendar month and event signals, with no machine learning whatsoever — achieved OVAR 1.749.oracle_v4mapachieved 1.776. The hand-written rule was slightly better than ground-truth perfect labels. Calendar and event labels carry essentially no predictive value for variance reduction in this architecture after the lookup table bottlenecks them. -
The 0.540 gap to exponential smoothing is invariant to label quality. The gap from the best Phase 1 result (1.733) to
exp_smoothing(1.193) is 0.540. The gap from the best V4 LLM result (1.726) to V4’s exp_smoothing (1.185) was also 0.540. This gap does not move when label quality improves from LLM-level to oracle-level. It is architectural — generated by the safety-stock structure of the OUT formula itself.
Results
All 14 conditions + baselines
| Condition | Chain OVAR | Stockouts | Notes |
|---|---|---|---|
| exp_smoothing | 1.193 | 89.5 | benchmark |
| naive_passthrough | 0.996 | 95.4 | pass-through |
| Phase 1 candidates — gap: 0.540 to exp_smoothing | |||
| neutral_smoothed_forecast | 1.733 | 89.5 | best phase 1 |
| causal_context | 1.749 | 87.2 | hand-written rule |
| order_up_to | 1.753 | 87.3 | formula floor |
| causal_unstructured | 1.769 | 87.0 | rule + events |
| dampened_beta50 | 1.765 | 93.5 | β=0.50 |
| oracle_v4map | 1.776 | 87.0 | perfect labels |
| oracle_moderate | 1.798 | 87.2 | wider map |
| oracle_asymmetric | 1.831 | 87.0 | |
| oracle_aggressive | 1.859 | 87.0 | widest map |
| dampened_beta75 | 1.963 | 89.0 | |
| neutral_dampened_out | 2.000 | 89.7 | |
| forecast_oracle_events | 2.009 | 85.3 | event-adjusted F |
| neutral_repeat_last | 2.223 | 88.2 | |
| neutral_floor_only | 2.313 | 89.2 | worst overall |
Lower OVAR = better. Gate criteria: beat order_up_to by ≥ 0.10 (best margin: 0.019) OR within 0.30 of exp_smoothing (closest: 0.540). Both criteria failed. Phase 2 LLM conditions not justified.
Discussion
Why exponential smoothing wins structurally
Exponential smoothing uses an EMA forecast (α=0.30) with no safety stock adjustment. Each tier independently smooths its orders toward a dampened estimate of upstream demand. Safety stock is never added on top. The intent-classification architecture, by contrast, always applies a safety stock term (multiplier × base_SS) to every order, even when multiplier is 1.00 (NEUTRAL). This safety stock addition protects service levels — stockouts are similar across architectures — but adds inventory-based order volatility that compounds at every upstream tier.
The two architectures are solving different problems. Exponential smoothing optimises variance stability. Safety stock optimises service level protection. They cannot be directly compared as equivalent alternatives without acknowledging this tradeoff. Measuring one with the other’s metric identifies a gap that is inherent to the design goals, not correctable by improving the classifier.
oracle_v4map being worse than order_up_to makes this concrete. The formula running with a fixed NEUTRAL classification (multiplier = 1.0) achieves 1.753. The formula running with perfect ground-truth classification achieves 1.776 — because perfect classification selects labels like STRONG_INCREASE and STRONG_DECREASE that apply larger safety-stock corrections in the periods that warrant them. Those corrections are directionally correct but create upstream amplification that the NEUTRAL case avoids. The AI is being hurt by being right.
The Five-Version Chain
What each version proved
| Version | Architecture | Core finding |
|---|---|---|
| V1 | LLM outputs exact order quantities, no guardrails | Wildly unstable; OVAR far off-chart |
| V2 / V2a | LLM output capped, direct quantities, 25 months | 8–12× worse than exp_smoothing on both OVAR and stockouts |
| V3b | LLM float multiplier × OUT formula, 25 months | Context penalty; memory collapse; OVAR 2.3–3.1 |
| V4 | 5-label discrete intent → lookup → OUT formula, 36 months | Equaliser Effect; all models 1.73–1.78 regardless of capability |
| V5 | Oracle/causal labels, 14 architectural variants, no LLM | Perfect labels and formula variants cannot close the 0.54 gap. Program closed. |
Full code and results on GitHub
Full code, data, and raw results are available on GitHub. View on GitHub →
Methodology note
All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns, 36 months with world events injected. Results should not be generalised to supply chain management broadly.
The correct scope: no combination of label quality (up to oracle perfect), multiplier map design, NEUTRAL redefinition, or order dampening — applied to a 3-tier intent-classification architecture built on the Order-Up-To formula — could close the 0.54-unit gap to exponential smoothing. This gap is preserved regardless of classification quality.