Supply Chain · Oracle Ablation · Experiment Writeup

The Ceiling Is in the Formula

Perfect ground-truth labels, fed directly to the architecture with no LLM at all, produced worse OVAR than the Order-Up-To formula running alone. The 0.54-unit gap to exponential smoothing is preserved exactly regardless of label quality, multiplier range, or formula variant. V5 closes the intent-classification line and sets up the V6 shift to adaptive smoothing.

V4 left open two interpretations of the Equaliser Effect: the ceiling exists because current models classify poorly, or it exists because even perfect classification cannot move OVAR within the Order-Up-To formula architecture. V5 Phase 1 tested interpretation two by removing the LLM entirely and replacing it with oracle labels — ground-truth perfect classifications derived from the simulation’s own demand trajectory.

Oracle labels produced OVAR 1.776 — worse than the Order-Up-To formula running with no AI at OVAR 1.753. Perfect intelligence in the wrong position is worse than no intelligence. Fourteen architectural variants were tested. None passed the Phase 1 gate. The 0.540-unit gap to exponential smoothing is preserved exactly. The intent-classification line is closed: LLMs with intent-classification interfaces cannot replicate the variance-dampening properties of exponential smoothing within the Order-Up-To formula architecture. This is a structural incompatibility, not a model quality problem, and it is why V6 changes the control architecture rather than improving the classifier.

Design & configuration

ArchitectureOracle or causal deterministic labels → multiplier lookup → Order-Up-To formula. No LLM involved in any condition.
Label sourcesOracle: ground-truth labels from simulation’s GROUND_TRUTH_INTENT (perfect classifier). Causal: hand-written rule based on calendar month and event signals.
Conditions (14)A1: oracle on V4 map · A2: wider multiplier maps (moderate, aggressive, asymmetric) · A3: NEUTRAL redefinitions (smoothed forecast, dampened OUT, repeat last, floor only) · A4: order dampening (β=0.25/0.50/0.75) · A5: event-adjusted forecast oracle · A6: causal rule-based classifiers
Replications20 runs per condition
Gate criterionAny condition beats order_up_to by ≥ 0.10, OR comes within 0.30 of exp_smoothing
Simulation36 months · stochastic lead times (LogNormal) · stochastic fill rates (Beta) · world events active · identical to V4
Baselinesexp_smoothing: 1.193 · naive_passthrough: 0.996 · order_up_to: 1.753

What I found

  1. Perfect labels are worth nothing on the V4 multiplier map. oracle_v4map — ground-truth perfect classifications fed into the V4 conservative map — produced OVAR 1.776. That is worse than order_up_to at 1.753. A hypothetical LLM that classifies every single period perfectly would produce worse supply chain outcomes than the formula running alone with no AI guidance whatsoever.
  2. Wider multiplier ranges always make things worse. All three wider map variants (moderate ±25/50%, aggressive ±40/80%, asymmetric) produced higher OVAR (1.79–1.86). Larger safety-stock swings compound variance further upstream regardless of how accurate the labels are. The V4 map was not too conservative — it was already as permissive as the architecture can tolerate.
  3. NEUTRAL redefinition is the only lever with any traction — and only barely. neutral_smoothed_forecast (NEUTRAL → order = raw forecast with no safety stock) achieves OVAR 1.733, beating order_up_to by 0.019. All other NEUTRAL redefinitions (repeat last order, partial dampening, no order if stocked) were substantially worse — 2.0 to 2.3 OVAR. One specific redefinition helped marginally. Everything else hurt.
  4. A hand-written rule equals a perfect oracle. causal_context — a rule that classifies based on calendar month and event signals, with no machine learning whatsoever — achieved OVAR 1.749. oracle_v4map achieved 1.776. The hand-written rule was slightly better than ground-truth perfect labels. Calendar and event labels carry essentially no predictive value for variance reduction in this architecture after the lookup table bottlenecks them.
  5. The 0.540 gap to exponential smoothing is invariant to label quality. The gap from the best Phase 1 result (1.733) to exp_smoothing (1.193) is 0.540. The gap from the best V4 LLM result (1.726) to V4’s exp_smoothing (1.185) was also 0.540. This gap does not move when label quality improves from LLM-level to oracle-level. It is architectural — generated by the safety-stock structure of the OUT formula itself.

All 14 conditions + baselines

ConditionChain OVARStockoutsNotes
exp_smoothing1.19389.5benchmark
naive_passthrough0.99695.4pass-through
Phase 1 candidates — gap: 0.540 to exp_smoothing
neutral_smoothed_forecast1.73389.5best phase 1
causal_context1.74987.2hand-written rule
order_up_to1.75387.3formula floor
causal_unstructured1.76987.0rule + events
dampened_beta501.76593.5β=0.50
oracle_v4map1.77687.0perfect labels
oracle_moderate1.79887.2wider map
oracle_asymmetric1.83187.0
oracle_aggressive1.85987.0widest map
dampened_beta751.96389.0
neutral_dampened_out2.00089.7
forecast_oracle_events2.00985.3event-adjusted F
neutral_repeat_last2.22388.2
neutral_floor_only2.31389.2worst overall

Lower OVAR = better. Gate criteria: beat order_up_to by ≥ 0.10 (best margin: 0.019) OR within 0.30 of exp_smoothing (closest: 0.540). Both criteria failed. Phase 2 LLM conditions not justified.

Why exponential smoothing wins structurally

Exponential smoothing uses an EMA forecast (α=0.30) with no safety stock adjustment. Each tier independently smooths its orders toward a dampened estimate of upstream demand. Safety stock is never added on top. The intent-classification architecture, by contrast, always applies a safety stock term (multiplier × base_SS) to every order, even when multiplier is 1.00 (NEUTRAL). This safety stock addition protects service levels — stockouts are similar across architectures — but adds inventory-based order volatility that compounds at every upstream tier.

The two architectures are solving different problems. Exponential smoothing optimises variance stability. Safety stock optimises service level protection. They cannot be directly compared as equivalent alternatives without acknowledging this tradeoff. Measuring one with the other’s metric identifies a gap that is inherent to the design goals, not correctable by improving the classifier.

oracle_v4map being worse than order_up_to makes this concrete. The formula running with a fixed NEUTRAL classification (multiplier = 1.0) achieves 1.753. The formula running with perfect ground-truth classification achieves 1.776 — because perfect classification selects labels like STRONG_INCREASE and STRONG_DECREASE that apply larger safety-stock corrections in the periods that warrant them. Those corrections are directionally correct but create upstream amplification that the NEUTRAL case avoids. The AI is being hurt by being right.

What each version proved

VersionArchitectureCore finding
V1LLM outputs exact order quantities, no guardrailsWildly unstable; OVAR far off-chart
V2 / V2aLLM output capped, direct quantities, 25 months8–12× worse than exp_smoothing on both OVAR and stockouts
V3bLLM float multiplier × OUT formula, 25 monthsContext penalty; memory collapse; OVAR 2.3–3.1
V45-label discrete intent → lookup → OUT formula, 36 monthsEqualiser Effect; all models 1.73–1.78 regardless of capability
V5Oracle/causal labels, 14 architectural variants, no LLMPerfect labels and formula variants cannot close the 0.54 gap. Program closed.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub.  View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns, 36 months with world events injected. Results should not be generalised to supply chain management broadly.

The correct scope: no combination of label quality (up to oracle perfect), multiplier map design, NEUTRAL redefinition, or order dampening — applied to a 3-tier intent-classification architecture built on the Order-Up-To formula — could close the 0.54-unit gap to exponential smoothing. This gap is preserved regardless of classification quality.

Back to Experiments