The Equaliser Effect: Intent Classification in Supply Chain Replenishment

TL;DR

Version 3b ended with a clear diagnosis: LLMs cannot invent precise multipliers. The proposed fix was to remove the invented number entirely — replace it with a pre-approved text label, and let a hard-coded lookup table do the arithmetic. This is V4.

The fix worked as a guardrail. OVAR dropped from the 2.3–3.1 range of V3b to a tight 1.73–1.78 across all four models and all three information conditions. No configuration caused catastrophic amplification. The fix did not work as an improvement. A 120-billion-parameter reasoning model and a lightweight fast model produced outcomes within 0.05 OVAR units of each other. Context doubled classification accuracy and barely moved the supply chain outcome. The Equaliser Effect: a discrete-label architecture creates a structural ceiling that no amount of intelligence, context, or prompt engineering can escape.

Experiment Setup

Design & configuration

Architecture	5-label intent classification → hard-coded lookup → Order-Up-To formula. The AI’s only job: pick one text label per period per tier. Everything downstream is deterministic software.
Labels & lookup	STRONG_INCREASE→1.30 · MOD_INCREASE→1.15 · NEUTRAL→1.00 · MOD_DECREASE→0.90 · STRONG_DECREASE→0.80
Models	gpt-4.1-mini · o4-mini (reasoning) · phi4:14b (local 14B) · nemotron-super 120B (local 120B)
Conditions	Blind (numbers only) · Context (calendar + seasonal persona) · Unstructured (context + live event headlines)
Replications	20 runs per frontier condition · 10 runs per local condition · 100 runs per baseline
Simulation horizon	36 months · stochastic lead times (LogNormal) · stochastic fill rates (Beta)
World events	Pandemic (months 7–12) · Geopolitical conflict (months 19–21) · Port strike (months 28–30)
Supply chain	3-tier serial: Tatva Motors OEM → Ancillary Supplier → Component Supplier
Primary metric	Chain OVAR = Var(orders) / Var(demand). Target: exp_smoothing at OVAR 1.185.

Key findings

What I found

Every AI condition produced OVAR in a 0.05-unit band, regardless of model or information level. gpt-4.1-mini: 1.737–1.771. phi4:14b: 1.726–1.780. o4-mini: 1.748–1.774. nemotron 120B: 1.734–1.775. The entire range across 16 distinct model-condition combinations is narrower than the standard deviation of many individual runs. This is not noise. It is a ceiling.
Context doubled classification accuracy and did not move the supply chain outcome. Direction accuracy went from 0.41–0.48 in blind conditions to 0.72–0.84 with context. A roughly twofold improvement in situational awareness. Chain OVAR moved by 0.01–0.02. The accuracy gain is real. It is discarded entirely at the lookup table.
More information made outcomes marginally worse, not better. In every model, the unstructured condition (live event headlines) produced higher OVAR than the context condition. When shown explicit disruption headlines, models over-committed to extreme labels, adding noise to the formula. gpt-4.1-mini context → unstructured: 1.737 → 1.771. phi4 context → unstructured: 1.726 → 1.780.
Explicitly instructing the AI to default to inaction produced zero effect on OVAR. Sub-experiment E4 added a neutral-prior instruction: “Default to NEUTRAL unless the signal is strong and unambiguous.” Label distributions shifted. OVAR did not. The reason: NEUTRAL maps to multiplier 1.00, which still executes a full Order-Up-To calculation. “Do nothing” in the prompt is not “do nothing” in the formula.
Removing world events made OVAR worse, not better. The ablation (events disabled) produced OVAR 2.08–2.12, well above the events-on results of 1.73–1.78. Disruption events create correlated demand and order spikes that partially cancel their own contribution to OVAR. A cleaner seasonal signal exposes the underlying formula amplification without that natural cancellation.

Results

Numeric results

Baselines

Policy	Chain OVAR	Stockouts / run
exp_smoothing	1.185	89.6
order_up_to (formula ceiling)	1.767	87.0
naive_passthrough	0.996	96.0

order_up_to is the structural ceiling: every AI agent uses this formula internally. Landing near 1.767 means the AI’s classification is adding nothing on top of the formula alone.

All AI conditions

Model	Condition	Chain OVAR	±std	Dir Accuracy
phi4:14b	CONTEXT	1.726	0.093	0.80
nemotron 120B	BLIND	1.734	0.133	0.47
gpt-4.1-mini	CONTEXT	1.737	0.088	0.72
nemotron 120B	CONTEXT	1.745	0.128	0.74
gpt-4.1-mini	BLIND	1.747	0.092	0.41
phi4:14b	BLIND	1.748	0.130	0.48
o4-mini	CONTEXT	1.748	0.105	0.72
o4-mini	BLIND	1.763	0.113	0.44
gpt-4.1-mini	UNSTRUCTURED	1.771	0.108	0.76
o4-mini	UNSTRUCTURED	1.774	0.106	0.79
nemotron 120B	UNSTRUCTURED	1.775	0.130	0.83
phi4:14b	UNSTRUCTURED	1.780	0.130	0.84

Best AI condition: phi4:14b context, OVAR 1.726 — 46% above the exp_smoothing target of 1.185.

Hypothesis verdicts

Hypothesis	Prediction	Verdict
H1	Any AI condition beats exp_smoothing (OVAR ≤ 1.185)	FAILED
H2	Unstructured condition reduces OVAR vs. context condition	FAILED
H3	Event signal improves direction accuracy by ≥ 10 percentage points	FAILED
H4	Intent compliance ≥ 0.99 under all conditions	PASSED
H5	World events materially change OVAR	PASSED

H5 passed — but in a surprising direction. World events present → lower OVAR. World events removed → higher OVAR (2.08–2.12). Disruptions partially cancel their own variance contribution.

Discussion

The Equaliser Effect

The AI controls one variable: the safety stock multiplier, ranging from 0.80 to 1.30. Order amplification in this simulation is primarily driven by stochastic lead times and fill rates — randomness in when stock arrives and how much of an order gets fulfilled. These are physical supply chain properties, not features of the AI’s decision. Even perfect classification would still produce OVAR near 1.77, because the Order-Up-To formula’s response to arrival randomness dominates.

Restricting the output to five labels successfully eliminated the calibration problem from V3b. It simultaneously eliminated any possibility of improvement, because the AI can no longer make the fine-grained adjustments that would close the gap to exponential smoothing. The discrete architecture trades runaway amplification for a hard ceiling.

One result from this experiment is easy to miss. Without context, every blind model defaults to classifying STRONG_INCREASE between 75% and 95% of all periods — even though only 25–33% of periods actually warrant an increase. This is catastrophically wrong. With a calendar and persona, models use all five labels accurately, with direction accuracy near 0.80. The discrimination is real. The control is not. An AI that correctly identifies the direction of demand and then outputs the same order as one that is completely wrong has been made intelligent in the wrong place.

V4 leaves open two interpretations of this ceiling: (a) a better model would classify more accurately, and more accurate classification would eventually move OVAR; or (b) the ceiling is architectural — even perfect labels cannot close the gap. V5 tested this directly by removing the LLM entirely and replacing it with oracle ground-truth labels.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub. View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns across 36 months with injected world events (pandemic, geopolitical conflict, port strike).

Results should not be generalised to supply chain management broadly. The correct scope: in a 3-tier discrete intent-classification architecture with a 5-label set and a fixed multiplier lookup, no tested LLM configuration — from 14B to 120B parameters, across three information conditions — produced OVAR meaningfully below the deterministic Order-Up-To baseline.