Supply Chain · Hybrid Architecture · Experiment Writeup

Hybrid AI Safety Stock Control in Supply Chain Replenishment

The AI understood the season. It failed at the math. A 1950s formula outperformed three modern AI models across every condition tested.

Three AI models were given direct control over the safety stock multiplier in a hybrid supply chain architecture. A mathematical formula handled the order quantity; the AI’s only job was to scale the buffer. All four hypotheses failed. Every AI condition produced higher order variance than the mathematical baseline, and every AI condition exceeded the variance of the fixed-multiplier control with no AI at all.

Context made things worse for two out of three models. Memory caused the advanced reasoning model to catastrophically over-correct. The AI understood the direction of the season. It could not calibrate the magnitude of the response.

What this experiment explored

Prior experiments in this series established that stateless LLM agents cannot outperform simple heuristics when making direct ordering decisions. This experiment tested a different architecture: instead of asking the AI to invent the exact order quantity, a mathematical formula handles that. The AI acts as a planner whose only job is to adjust the Safety Stock Multiplier.

If the formula says 10,000 units and the AI recognises that Diwali is next month, it outputs a multiplier of 1.2×, meaning order 20% extra buffer. The formula executes that quantity. The AI never invents the base number; it only scales it. The hypothesis: confining AI to a single scalar output would exploit its qualitative context-reading while limiting its quantitative weaknesses. The data does not support this.

Design & configuration

ArchitectureHybrid: mathematical formula executes orders; AI adjusts safety stock multiplier only. The AI never sets the base order quantity directly.
Modelsgpt-4.1-mini (frontier lightweight) · o4-mini (frontier reasoning) · nemotron-super-3:120b (local 120B)
ConditionsH1 Blind (numbers only) · H2 Context (calendar + Indian seasonal flags) · H3 Stateful (context + 3-month order history)
Replications20 per AI condition · 1 per heuristic (deterministic)
Primary metricsOVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand) · Stockout count. Both always reported together.
Baselinesexp_smoothing (α=0.30) · hybrid_control (fixed 1.0 multiplier, formula only, no AI)
Supply chain3-tier serial: OEM → Ancillary Supplier → Component Supplier
Demand series25 months · Indian automotive seasonal patterns (monsoon slump, Diwali peak, year-end surge)
Lead time1 month deterministic at all tiers
LLM calls12,960 total (3 models × 3 conditions × 20 runs × 24 periods × 3 tiers)

What I found

  1. Every AI condition produced higher OVAR than the mathematical baseline. exp_smoothing achieved chain OVAR 0.5446 with 5.0 stockouts. The best AI result (gpt-4.1-mini, Blind) was 2.3325 with 10.6 stockouts. The worst (o4-mini, Stateful) reached 3.1211. Not a marginal gap.
  2. The AI’s active adjustments were worse than doing nothing. The hybrid_control (fixed 1.0 multiplier, no AI) produced chain OVAR 1.7097. Every single AI condition exceeded this. The models’ continuous adjustments degraded the base formula rather than improving it.
  3. Context was a penalty, not a benefit. For nemotron and gpt-4.1-mini, introducing a seasonal calendar increased order variance. They treated more information as a reason to hold more stock rather than a reason to be precise. Giving them a calendar made them panic, not plan.
  4. Memory caused the advanced reasoning model to collapse. o4-mini in the Stateful condition produced OVAR 3.1211, the highest in the entire experiment. It anchored on past stockouts and violently over-corrected each subsequent period. This is the bullwhip effect, running inside the model’s reasoning loop rather than across supply chain tiers.

Numeric results

Mathematical baselines

Condition Chain OVAR Chain Stockouts Mean On-Hand
exp_smoothing 0.5446 5.0 4,769
hybrid_control (fixed 1.0×, no AI) 1.7097 14.0 5,142

hybrid_control isolates the AI’s specific contribution: the same formula, the same execution path, multiplier permanently fixed at 1.0. Any AI condition scoring above 1.7097 on OVAR proves the AI’s active adjustments are degrading the base formula rather than adding to it.

AI hybrid conditions

Model Condition Chain OVAR ±std Stockouts Mult Mean
nemotron-super-3:120b H1 BLIND 2.4178 0.2814 12.2 1.2249
nemotron-super-3:120b H2 CONTEXT 2.7629 0.2319 12.3 1.3489
nemotron-super-3:120b H3 STATEFUL 2.6846 0.2413 9.6 1.3671
gpt-4.1-mini H1 BLIND 2.3325 0.1108 10.6 1.1298
gpt-4.1-mini H2 CONTEXT 2.9763 0.0958 11.0 1.3103
gpt-4.1-mini H3 STATEFUL 2.7226 0.1512 11.6 1.4291
o4-mini H1 BLIND 2.5232 0.2791 8.9 1.4808
o4-mini H2 CONTEXT 2.4395 0.1616 11.7 1.2447
o4-mini H3 STATEFUL 3.1211 0.1320 10.7 1.3488

Lower OVAR and lower stockouts are better. Mult Mean = average safety stock multiplier chosen across all runs and periods for that condition.

Hypothesis verdicts

Hypothesis Prediction Verdict
H1 At least one AI condition beats exp_smoothing on OVAR and stockouts simultaneously; best AI OVAR 2.3325 vs. 0.5446 baseline REJECTED
H2 Context (H2) improves OVAR over Blind (H1) for at least two models; Context worsened OVAR for nemotron and gpt-4.1-mini REJECTED
H3 Stateful (H3) improves OVAR over Context (H2) for at least two models; memory caused o4-mini to reach the worst OVAR in the experiment (3.1211) REJECTED
H4 AI correctly identifies the direction of seasonal demand at least 50% of the time (MPS ≥ 0.50); best observed MPS 0.3977 REJECTED

Semantic alignment vs. operational control

The data reveals a structural disconnect between two capabilities that look related but are not. Semantic alignment means the AI understands the concept of what is happening. Reading the text outputs confirmed that all three models correctly recognised when a busy season was approaching and deduced that more buffer stock was needed. Operational control means the AI can choose the exact mathematical number required to stabilise the system. This is where every model failed.

An AI that correctly identifies “December needs more buffer” and then outputs 1.45× when 1.05× would have sufficed is not an operational controller. It is a semantic reasoner in the wrong job. Directional capability exists. Numerical calibration does not. The hybrid architecture exposed the gap between them cleanly.

The over-buffering bias was consistent and measurable. With no AI and a fixed 1.0 multiplier, chain OVAR is 1.7097. Every AI condition exceeded this. The models averaged multipliers between 1.13 and 1.48 across all runs and conditions, indicating a systemic tendency toward caution that, paradoxically, produced worse outcomes than doing nothing. When in doubt, the models added buffer. Adding buffer destabilised the chain.

The o4-mini Stateful result deserves attention on its own terms. At OVAR 3.1211 it was the worst performing configuration in the experiment. The internal thinking logs showed the mechanism: the model anchored heavily on recent negative signals. A minor backlog from two months prior prompted a massive over-order the following month, which then appeared in the next period’s history as excess inventory, triggering a different over-correction. This feedback loop is the bullwhip effect running not across supply chain tiers but inside the model’s reasoning process.

What this means for system design

The hybrid architecture was a reasonable hypothesis. Restricting the AI to a single scalar output while the formula handled execution seemed like a way to leverage AI’s qualitative strengths while limiting its quantitative weaknesses. The data shows this is not sufficient. A continuous multiplier that the AI invents still requires numerical precision that the models cannot reliably provide.

The next logical step is to restrict the output further: instead of a free-form multiplier, the AI selects from a small set of pre-defined text labels (STRONG_INCREASE, MODERATE_INCREASE, NEUTRAL, MODERATE_DECREASE, STRONG_DECREASE). A hard-coded translation layer maps each label to a fixed, pre-approved multiplier value. The AI’s task becomes classification, not numerical calibration. Whether this architecture works is the subject of the next experiment.

The empirical multiplier range observed here (1.13–1.48 mean across all models and conditions) provides a concrete basis for designing guardrails. Any production deployment of a multiplier-based AI planner should enforce hard limits derived from empirical runs of this kind, preventing the model from accessing the region of the multiplier space where it reliably causes damage regardless of what it reasons.

Full code and results on GitHub

Full code, data, figures, and raw results are available on GitHub.  View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to published Indian automotive seasonal patterns. The experiment was intentionally narrow: single product, fixed lead times, stylised hybrid control loop, three models across three information conditions.

Results should not be generalised to supply chain management broadly. The correct scope: in a stylised hybrid architecture where AI controls a continuous safety stock multiplier and a formula executes the order quantity, no tested LLM configuration outperformed either the mathematical baseline or a fixed-multiplier control in any of the three information conditions.

Back to Experiments