The Ceiling Is in the Formula: Oracle Labels and the End of the Intent-Classification Line

TL;DR

V4 left open two interpretations of the Equaliser Effect: the ceiling exists because current models classify poorly, or it exists because even perfect classification cannot move OVAR within the Order-Up-To formula architecture. V5 Phase 1 tested interpretation two by removing the LLM entirely and replacing it with oracle labels — ground-truth perfect classifications derived from the simulation’s own demand trajectory.

Oracle labels produced OVAR 1.776 — worse than the Order-Up-To formula running with no AI at OVAR 1.753. Perfect intelligence in the wrong position is worse than no intelligence. Fourteen architectural variants were tested. None passed the Phase 1 gate. The 0.540-unit gap to exponential smoothing is preserved exactly. The intent-classification line is closed: LLMs with intent-classification interfaces cannot replicate the variance-dampening properties of exponential smoothing within the Order-Up-To formula architecture. This is a structural incompatibility, not a model quality problem, and it is why V6 changes the control architecture rather than improving the classifier.

Experiment Setup

Design & configuration

Architecture	Oracle or causal deterministic labels → multiplier lookup → Order-Up-To formula. No LLM involved in any condition.
Label sources	Oracle: ground-truth labels from simulation’s GROUND_TRUTH_INTENT (perfect classifier). Causal: hand-written rule based on calendar month and event signals.
Conditions (14)	A1: oracle on V4 map · A2: wider multiplier maps (moderate, aggressive, asymmetric) · A3: NEUTRAL redefinitions (smoothed forecast, dampened OUT, repeat last, floor only) · A4: order dampening (β=0.25/0.50/0.75) · A5: event-adjusted forecast oracle · A6: causal rule-based classifiers
Replications	20 runs per condition
Gate criterion	Any condition beats order_up_to by ≥ 0.10, OR comes within 0.30 of exp_smoothing
Simulation	36 months · stochastic lead times (LogNormal) · stochastic fill rates (Beta) · world events active · identical to V4
Baselines	exp_smoothing: 1.193 · naive_passthrough: 0.996 · order_up_to: 1.753

Key findings

What I found

Perfect labels are worth nothing on the V4 multiplier map. oracle_v4map — ground-truth perfect classifications fed into the V4 conservative map — produced OVAR 1.776. That is worse than order_up_to at 1.753. A hypothetical LLM that classifies every single period perfectly would produce worse supply chain outcomes than the formula running alone with no AI guidance whatsoever.
Wider multiplier ranges always make things worse. All three wider map variants (moderate ±25/50%, aggressive ±40/80%, asymmetric) produced higher OVAR (1.79–1.86). Larger safety-stock swings compound variance further upstream regardless of how accurate the labels are. The V4 map was not too conservative — it was already as permissive as the architecture can tolerate.
NEUTRAL redefinition is the only lever with any traction — and only barely. neutral_smoothed_forecast (NEUTRAL → order = raw forecast with no safety stock) achieves OVAR 1.733, beating order_up_to by 0.019. All other NEUTRAL redefinitions (repeat last order, partial dampening, no order if stocked) were substantially worse — 2.0 to 2.3 OVAR. One specific redefinition helped marginally. Everything else hurt.
A hand-written rule equals a perfect oracle. causal_context — a rule that classifies based on calendar month and event signals, with no machine learning whatsoever — achieved OVAR 1.749. oracle_v4map achieved 1.776. The hand-written rule was slightly better than ground-truth perfect labels. Calendar and event labels carry essentially no predictive value for variance reduction in this architecture after the lookup table bottlenecks them.
The 0.540 gap to exponential smoothing is invariant to label quality. The gap from the best Phase 1 result (1.733) to exp_smoothing (1.193) is 0.540. The gap from the best V4 LLM result (1.726) to V4’s exp_smoothing (1.185) was also 0.540. This gap does not move when label quality improves from LLM-level to oracle-level. It is architectural — generated by the safety-stock structure of the OUT formula itself.

Results

All 14 conditions + baselines

Condition	Chain OVAR	Stockouts	Notes
exp_smoothing	1.193	89.5	benchmark
naive_passthrough	0.996	95.4	pass-through
Phase 1 candidates — gap: 0.540 to exp_smoothing
neutral_smoothed_forecast	1.733	89.5	best phase 1
causal_context	1.749	87.2	hand-written rule
order_up_to	1.753	87.3	formula floor
causal_unstructured	1.769	87.0	rule + events
dampened_beta50	1.765	93.5	β=0.50
oracle_v4map	1.776	87.0	perfect labels
oracle_moderate	1.798	87.2	wider map
oracle_asymmetric	1.831	87.0
oracle_aggressive	1.859	87.0	widest map
dampened_beta75	1.963	89.0
neutral_dampened_out	2.000	89.7
forecast_oracle_events	2.009	85.3	event-adjusted F
neutral_repeat_last	2.223	88.2
neutral_floor_only	2.313	89.2	worst overall

Lower OVAR = better. Gate criteria: beat order_up_to by ≥ 0.10 (best margin: 0.019) OR within 0.30 of exp_smoothing (closest: 0.540). Both criteria failed. Phase 2 LLM conditions not justified.

Discussion

Why exponential smoothing wins structurally

Exponential smoothing uses an EMA forecast (α=0.30) with no safety stock adjustment. Each tier independently smooths its orders toward a dampened estimate of upstream demand. Safety stock is never added on top. The intent-classification architecture, by contrast, always applies a safety stock term (multiplier × base_SS) to every order, even when multiplier is 1.00 (NEUTRAL). This safety stock addition protects service levels — stockouts are similar across architectures — but adds inventory-based order volatility that compounds at every upstream tier.

The two architectures are solving different problems. Exponential smoothing optimises variance stability. Safety stock optimises service level protection. They cannot be directly compared as equivalent alternatives without acknowledging this tradeoff. Measuring one with the other’s metric identifies a gap that is inherent to the design goals, not correctable by improving the classifier.

oracle_v4map being worse than order_up_to makes this concrete. The formula running with a fixed NEUTRAL classification (multiplier = 1.0) achieves 1.753. The formula running with perfect ground-truth classification achieves 1.776 — because perfect classification selects labels like STRONG_INCREASE and STRONG_DECREASE that apply larger safety-stock corrections in the periods that warrant them. Those corrections are directionally correct but create upstream amplification that the NEUTRAL case avoids. The AI is being hurt by being right.

The Five-Version Chain

What each version proved

Version	Architecture	Core finding
V1	LLM outputs exact order quantities, no guardrails	Wildly unstable; OVAR far off-chart
V2 / V2a	LLM output capped, direct quantities, 25 months	8–12× worse than exp_smoothing on both OVAR and stockouts
V3b	LLM float multiplier × OUT formula, 25 months	Context penalty; memory collapse; OVAR 2.3–3.1
V4	5-label discrete intent → lookup → OUT formula, 36 months	Equaliser Effect; all models 1.73–1.78 regardless of capability
V5	Oracle/causal labels, 14 architectural variants, no LLM	Perfect labels and formula variants cannot close the 0.54 gap. Program closed.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub. View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns, 36 months with world events injected. Results should not be generalised to supply chain management broadly.

The correct scope: no combination of label quality (up to oracle perfect), multiplier map design, NEUTRAL redefinition, or order dampening — applied to a 3-tier intent-classification architecture built on the Order-Up-To formula — could close the 0.54-unit gap to exponential smoothing. This gap is preserved regardless of classification quality.