Supply Chain · Experiment Writeup

LLM Agents Against Heuristic Baselines in Supply Chain Replenishment

Every heuristic outperformed every LLM configuration on both order variability and stockouts. Exponential smoothing beat the best LLM by 8x on both metrics simultaneously. All seven hypotheses were rejected.

Every heuristic outperformed every LLM configuration on both order variability and stockouts. Exponential smoothing beat the best LLM by 8x on both metrics simultaneously. All seven hypotheses were rejected.

What this experiment explored

Agentic Bullwhip Effect Version 2 asks a harder question than the first experiment in this series, Agentic Bullwhip Effect Version 1: not which AI configuration performs best, but whether any LLM configuration outperforms a simple rule-based heuristic at all. Four models across lightweight and reasoning tiers, frontier and local, were tested against three deterministic heuristic baselines across 20 independent runs per condition.

Every heuristic outperformed every LLM on both order variance and stockouts simultaneously. This is not a tradeoff result.

Design & configuration

Modelsgpt-4.1-mini (frontier lightweight) · o4-mini (frontier reasoning) · phi4:14b (local lightweight) · gpt-oss:120b (local reasoning)
Design2×2 factorial (model tier × context treatment) across two backends: frontier (Azure) and local (Ollama), 8 backend-specific model-condition cells in total
Replications20 per LLM configuration · 1 per heuristic (deterministic)
Primary metricsOVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand). Stockout count. Both always reported together. MPRD threshold: |ΔOVAR| ≥ 0.5 required for a practically meaningful claim.
Heuristic baselinesExponential smoothing (α=0.30) · Naive passthrough · Order-up-to with fixed safety stock
Supply chain3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer
Demand series25 months (Jan 2025 to Jan 2027) · single SKU · two full Indian festive cycles
Lead time1 month deterministic at all tiers
Initial inventory43,609 units (mean + 1.65σ, ~95% service level)
LLM calls11,520 total (4 conditions × 20 runs × 24 periods × 3 tiers × 2 backends)
Agent designStateless, no memory between periods. Deliberate: most real agentic deployments are stateless.
Blind conditionNumbers only. No tier persona, no calendar month.
Context conditionTier persona + calendar month, with the same numeric state variables in both conditions.

What I found

  1. Every heuristic outperformed every LLM on both OVAR and stockouts simultaneously. Not a tradeoff. LLMs were strictly dominated on both primary metrics in every configuration tested.
  2. The gap is not marginal. Exponential smoothing: chain OVAR 0.54, 5 stockouts. Best LLM (local phi4:14b, blind): OVAR 4.33, 41 stockouts. 8x worse on both dimensions at once.
  3. Context had opposite effects by model and backend. For frontier gpt-4.1-mini, adding business context reduced OVAR marginally (4.70 to 4.47, delta 0.23, below the MPRD threshold). For local phi4:14b, the same context was dramatically worse: chain OVAR jumped from 4.33 to 6.35, with the Ancillary tier hitting 10.82 ± 8.14 across 20 runs. The standard deviation of 8.14 indicates instability, not a consistent directional effect.
  4. Reasoning models showed no ordering advantage. The 116B gpt-oss:120b produced results indistinguishable from gpt-4.1-mini in blind conditions. o4-mini generated over 1 million reasoning tokens and produced no measurable improvement on either metric. All 7 hypotheses were rejected.

Numeric results

Heuristic baselines

Heuristic Chain OVAR Stockouts (of 75 possible)
Exponential smoothing 0.54 5
Naive passthrough 1.00 3
Order-up-to 1.71 14

Chain-average OVAR by LLM configuration

Condition Backend Chain OVAR (mean ± std) Stockouts (mean ± std)
exp_smoothing HEURISTIC 0.54 5
naive_passthrough HEURISTIC 1.00 3
order_up_to HEURISTIC 1.71 14
L-Blind FRONTIER 4.70 ± 0.14 40.5 ± 0.83
L-Context FRONTIER 4.47 ± 0.07 39.0 ± 0.83
L-Blind LOCAL 4.33 ± 0.00 41.0 ± 0.00
L-Context LOCAL 6.35 ± 2.53 37.2 ± 3.11
R-Blind FRONTIER 4.72 ± 1.12 42.9 ± 3.85
R-Context FRONTIER 4.52 ± 0.08 40.1 ± 0.85
R-Blind LOCAL 4.52 ± 0.00 40.0 ± 0.00
R-Context LOCAL 4.52 ± 0.05 39.6 ± 0.76

L = Lightweight (gpt-4.1-mini / phi4:14b) · R = Reasoning (o4-mini / gpt-oss:120b)

OVAR by tier

Condition Backend OEM Ancillary Component
exp_smoothing HEURISTIC 0.41 0.65 0.58
L-Blind FRONTIER 4.21 6.64 3.25
L-Context FRONTIER 4.12 6.01 3.30
L-Blind LOCAL 3.71 5.89 3.40
L-Context LOCAL 4.62 10.82 3.61
R-Blind FRONTIER 5.94 5.18 3.05
R-Context FRONTIER 4.13 5.99 3.45
R-Blind LOCAL 4.13 5.98 3.45
R-Context LOCAL 4.13 6.01 3.43

Hypothesis verdicts

Hypothesis Prediction Verdict
H1 At least one LLM achieves lower OVAR than exp smoothing (0.54) with ≤5 stockouts; best LLM: OVAR 4.33, 41 stockouts REJECTED
H2 context_lightweight OVAR < blind_lightweight by ≥0.5; actual Δ = 0.23, below MPRD REJECTED
H3 context_reasoning OVAR < blind_reasoning by ≥0.5; actual Δ = 0.20, below MPRD REJECTED
H4 blind_reasoning OVAR < blind_lightweight by ≥0.5; actual Δ = −0.02, opposite direction REJECTED
H5 context_reasoning OVAR < context_lightweight by ≥0.5; actual Δ = −0.05, opposite direction REJECTED
H6 Context benefit larger for reasoning tier than lightweight; actual: −0.03, opposite direction REJECTED
H7 Local context_lightweight within ±0.5 of frontier context_lightweight; actual Δ = 1.88, well outside equivalence bounds REJECTED

Why the gap is structural

The bullwhip failure is structural. Each agent sees only the current period, with no memory of what it ordered previously. Without that causal chain, there is no self-correction mechanism. A stateless agent that over-ordered last period arrives at the next period without knowing it did. Combined with the fact that LLMs generate plausible text rather than numerically precise outcomes, the result is an agent that picks a number that sounds reasonable rather than one that dampens variance.

The phi4:14b context result is worth noting separately. The standard deviation of 8.14 on Ancillary-tier OVAR across 20 runs indicates that the business context prompt did not consistently shape ordering behaviour; it introduced variance. In some runs it may have triggered aggressive anticipatory ordering, in others conservative responses. The blind model failed consistently and identically across all 20 runs. Consistent failure can be diagnosed and compensated for. Intermittent instability, where the same model with the same prompt produces order-of-magnitude different outcomes across runs, is harder to anticipate and mitigate in a real deployment.

What this means for practitioners

Do not replace your ordering formula with an LLM. The formula was built for this task and it will do it better. This experiment is not a close result: exponential smoothing, a method from the 1950s, produced orders eight times less variable than the best LLM configuration, with fewer stockouts at the same time.

Where LLMs might add value is earlier in the process: reading demand signals, spotting something unusual in the data, providing context to a planner. That is a different task from executing the order quantity decision, and it was not tested here. But using an LLM to inform a formula, rather than replace it, is a more plausible role than the one tested in this experiment.

The phi4:14b result is worth noting separately. When given business context, that model did not fail in a predictable way: it produced reasonable results in most runs and extreme results in a few, with nothing to distinguish the two from the outside. A model that fails consistently is easier to manage. You remove it and move on. A model that is mostly fine is harder to catch.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub.  View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The experiment was intentionally narrow: single product, fixed lead times, stateless agents, no unstructured context. Results should not be generalised to supply chain management broadly. The correct scope: LLM agents do not outperform simple blind heuristics in a stylised single-product replenishment task with fixed lead times and no unstructured context.

Back to Experiments