Supply Chain · Intent Classification · Experiment Writeup

The Equaliser Effect: Intent Classification in Supply Chain Replenishment

Four AI models produced statistically identical supply chain outcomes, regardless of size, reasoning capability, or information level. Replacing the AI’s float output with discrete labels fixed the calibration problem. It created a new ceiling instead.

Version 3b ended with a clear diagnosis: LLMs cannot invent precise multipliers. The proposed fix was to remove the invented number entirely — replace it with a pre-approved text label, and let a hard-coded lookup table do the arithmetic. This is V4.

The fix worked as a guardrail. OVAR dropped from the 2.3–3.1 range of V3b to a tight 1.73–1.78 across all four models and all three information conditions. No configuration caused catastrophic amplification. The fix did not work as an improvement. A 120-billion-parameter reasoning model and a lightweight fast model produced outcomes within 0.05 OVAR units of each other. Context doubled classification accuracy and barely moved the supply chain outcome. The Equaliser Effect: a discrete-label architecture creates a structural ceiling that no amount of intelligence, context, or prompt engineering can escape.

Design & configuration

Architecture5-label intent classification → hard-coded lookup → Order-Up-To formula. The AI’s only job: pick one text label per period per tier. Everything downstream is deterministic software.
Labels & lookupSTRONG_INCREASE→1.30 · MOD_INCREASE→1.15 · NEUTRAL→1.00 · MOD_DECREASE→0.90 · STRONG_DECREASE→0.80
Modelsgpt-4.1-mini · o4-mini (reasoning) · phi4:14b (local 14B) · nemotron-super 120B (local 120B)
ConditionsBlind (numbers only) · Context (calendar + seasonal persona) · Unstructured (context + live event headlines)
Replications20 runs per frontier condition · 10 runs per local condition · 100 runs per baseline
Simulation horizon36 months · stochastic lead times (LogNormal) · stochastic fill rates (Beta)
World eventsPandemic (months 7–12) · Geopolitical conflict (months 19–21) · Port strike (months 28–30)
Supply chain3-tier serial: Tatva Motors OEM → Ancillary Supplier → Component Supplier
Primary metricChain OVAR = Var(orders) / Var(demand). Target: exp_smoothing at OVAR 1.185.

What I found

  1. Every AI condition produced OVAR in a 0.05-unit band, regardless of model or information level. gpt-4.1-mini: 1.737–1.771. phi4:14b: 1.726–1.780. o4-mini: 1.748–1.774. nemotron 120B: 1.734–1.775. The entire range across 16 distinct model-condition combinations is narrower than the standard deviation of many individual runs. This is not noise. It is a ceiling.
  2. Context doubled classification accuracy and did not move the supply chain outcome. Direction accuracy went from 0.41–0.48 in blind conditions to 0.72–0.84 with context. A roughly twofold improvement in situational awareness. Chain OVAR moved by 0.01–0.02. The accuracy gain is real. It is discarded entirely at the lookup table.
  3. More information made outcomes marginally worse, not better. In every model, the unstructured condition (live event headlines) produced higher OVAR than the context condition. When shown explicit disruption headlines, models over-committed to extreme labels, adding noise to the formula. gpt-4.1-mini context → unstructured: 1.737 → 1.771. phi4 context → unstructured: 1.726 → 1.780.
  4. Explicitly instructing the AI to default to inaction produced zero effect on OVAR. Sub-experiment E4 added a neutral-prior instruction: “Default to NEUTRAL unless the signal is strong and unambiguous.” Label distributions shifted. OVAR did not. The reason: NEUTRAL maps to multiplier 1.00, which still executes a full Order-Up-To calculation. “Do nothing” in the prompt is not “do nothing” in the formula.
  5. Removing world events made OVAR worse, not better. The ablation (events disabled) produced OVAR 2.08–2.12, well above the events-on results of 1.73–1.78. Disruption events create correlated demand and order spikes that partially cancel their own contribution to OVAR. A cleaner seasonal signal exposes the underlying formula amplification without that natural cancellation.

Numeric results

Baselines

PolicyChain OVARStockouts / run
exp_smoothing1.18589.6
order_up_to (formula ceiling)1.76787.0
naive_passthrough0.99696.0

order_up_to is the structural ceiling: every AI agent uses this formula internally. Landing near 1.767 means the AI’s classification is adding nothing on top of the formula alone.

All AI conditions

ModelConditionChain OVAR±stdDir Accuracy
phi4:14bCONTEXT1.7260.0930.80
nemotron 120BBLIND1.7340.1330.47
gpt-4.1-miniCONTEXT1.7370.0880.72
nemotron 120BCONTEXT1.7450.1280.74
gpt-4.1-miniBLIND1.7470.0920.41
phi4:14bBLIND1.7480.1300.48
o4-miniCONTEXT1.7480.1050.72
o4-miniBLIND1.7630.1130.44
gpt-4.1-miniUNSTRUCTURED1.7710.1080.76
o4-miniUNSTRUCTURED1.7740.1060.79
nemotron 120BUNSTRUCTURED1.7750.1300.83
phi4:14bUNSTRUCTURED1.7800.1300.84

Best AI condition: phi4:14b context, OVAR 1.726 — 46% above the exp_smoothing target of 1.185.

Hypothesis verdicts

HypothesisPredictionVerdict
H1Any AI condition beats exp_smoothing (OVAR ≤ 1.185)FAILED
H2Unstructured condition reduces OVAR vs. context conditionFAILED
H3Event signal improves direction accuracy by ≥ 10 percentage pointsFAILED
H4Intent compliance ≥ 0.99 under all conditionsPASSED
H5World events materially change OVARPASSED

H5 passed — but in a surprising direction. World events present → lower OVAR. World events removed → higher OVAR (2.08–2.12). Disruptions partially cancel their own variance contribution.

The Equaliser Effect

The AI controls one variable: the safety stock multiplier, ranging from 0.80 to 1.30. Order amplification in this simulation is primarily driven by stochastic lead times and fill rates — randomness in when stock arrives and how much of an order gets fulfilled. These are physical supply chain properties, not features of the AI’s decision. Even perfect classification would still produce OVAR near 1.77, because the Order-Up-To formula’s response to arrival randomness dominates.

Restricting the output to five labels successfully eliminated the calibration problem from V3b. It simultaneously eliminated any possibility of improvement, because the AI can no longer make the fine-grained adjustments that would close the gap to exponential smoothing. The discrete architecture trades runaway amplification for a hard ceiling.

One result from this experiment is easy to miss. Without context, every blind model defaults to classifying STRONG_INCREASE between 75% and 95% of all periods — even though only 25–33% of periods actually warrant an increase. This is catastrophically wrong. With a calendar and persona, models use all five labels accurately, with direction accuracy near 0.80. The discrimination is real. The control is not. An AI that correctly identifies the direction of demand and then outputs the same order as one that is completely wrong has been made intelligent in the wrong place.

V4 leaves open two interpretations of this ceiling: (a) a better model would classify more accurately, and more accurate classification would eventually move OVAR; or (b) the ceiling is architectural — even perfect labels cannot close the gap. V5 tested this directly by removing the LLM entirely and replacing it with oracle ground-truth labels.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub.  View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to Indian automotive seasonal patterns across 36 months with injected world events (pandemic, geopolitical conflict, port strike).

Results should not be generalised to supply chain management broadly. The correct scope: in a 3-tier discrete intent-classification architecture with a 5-label set and a fixed multiplier lookup, no tested LLM configuration — from 14B to 120B parameters, across three information conditions — produced OVAR meaningfully below the deterministic Order-Up-To baseline.

Back to Experiments