Hybrid AI Safety Stock Control in Supply Chain Replenishment

TL;DR

Three AI models were given direct control over the safety stock multiplier in a hybrid supply chain architecture. A mathematical formula handled the order quantity; the AI’s only job was to scale the buffer. All four hypotheses failed. Every AI condition produced higher order variance than the mathematical baseline, and every AI condition exceeded the variance of the fixed-multiplier control with no AI at all.

Context made things worse for two out of three models. Memory caused the advanced reasoning model to catastrophically over-correct. The AI understood the direction of the season. It could not calibrate the magnitude of the response.

Overview

What this experiment explored

Prior experiments in this series established that stateless LLM agents cannot outperform simple heuristics when making direct ordering decisions. This experiment tested a different architecture: instead of asking the AI to invent the exact order quantity, a mathematical formula handles that. The AI acts as a planner whose only job is to adjust the Safety Stock Multiplier.

If the formula says 10,000 units and the AI recognises that Diwali is next month, it outputs a multiplier of 1.2×, meaning order 20% extra buffer. The formula executes that quantity. The AI never invents the base number; it only scales it. The hypothesis: confining AI to a single scalar output would exploit its qualitative context-reading while limiting its quantitative weaknesses. The data does not support this.

Experiment Setup

Design & configuration

Architecture	Hybrid: mathematical formula executes orders; AI adjusts safety stock multiplier only. The AI never sets the base order quantity directly.
Models	gpt-4.1-mini (frontier lightweight) · o4-mini (frontier reasoning) · nemotron-super-3:120b (local 120B)
Conditions	H1 Blind (numbers only) · H2 Context (calendar + Indian seasonal flags) · H3 Stateful (context + 3-month order history)
Replications	20 per AI condition · 1 per heuristic (deterministic)
Primary metrics	OVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand) · Stockout count. Both always reported together.
Baselines	exp_smoothing (α=0.30) · hybrid_control (fixed 1.0 multiplier, formula only, no AI)
Supply chain	3-tier serial: OEM → Ancillary Supplier → Component Supplier
Demand series	25 months · Indian automotive seasonal patterns (monsoon slump, Diwali peak, year-end surge)
Lead time	1 month deterministic at all tiers
LLM calls	12,960 total (3 models × 3 conditions × 20 runs × 24 periods × 3 tiers)

Key findings

What I found

Every AI condition produced higher OVAR than the mathematical baseline. exp_smoothing achieved chain OVAR 0.5446 with 5.0 stockouts. The best AI result (gpt-4.1-mini, Blind) was 2.3325 with 10.6 stockouts. The worst (o4-mini, Stateful) reached 3.1211. Not a marginal gap.
The AI’s active adjustments were worse than doing nothing. The hybrid_control (fixed 1.0 multiplier, no AI) produced chain OVAR 1.7097. Every single AI condition exceeded this. The models’ continuous adjustments degraded the base formula rather than improving it.
Context was a penalty, not a benefit. For nemotron and gpt-4.1-mini, introducing a seasonal calendar increased order variance. They treated more information as a reason to hold more stock rather than a reason to be precise. Giving them a calendar made them panic, not plan.
Memory caused the advanced reasoning model to collapse. o4-mini in the Stateful condition produced OVAR 3.1211, the highest in the entire experiment. It anchored on past stockouts and violently over-corrected each subsequent period. This is the bullwhip effect, running inside the model’s reasoning loop rather than across supply chain tiers.

Results

Numeric results

Mathematical baselines

Condition	Chain OVAR	Chain Stockouts	Mean On-Hand
exp_smoothing	0.5446	5.0	4,769
hybrid_control (fixed 1.0×, no AI)	1.7097	14.0	5,142

hybrid_control isolates the AI’s specific contribution: the same formula, the same execution path, multiplier permanently fixed at 1.0. Any AI condition scoring above 1.7097 on OVAR proves the AI’s active adjustments are degrading the base formula rather than adding to it.

AI hybrid conditions

Model	Condition	Chain OVAR	±std	Stockouts	Mult Mean
nemotron-super-3:120b	H1 BLIND	2.4178	0.2814	12.2	1.2249
nemotron-super-3:120b	H2 CONTEXT	2.7629	0.2319	12.3	1.3489
nemotron-super-3:120b	H3 STATEFUL	2.6846	0.2413	9.6	1.3671
gpt-4.1-mini	H1 BLIND	2.3325	0.1108	10.6	1.1298
gpt-4.1-mini	H2 CONTEXT	2.9763	0.0958	11.0	1.3103
gpt-4.1-mini	H3 STATEFUL	2.7226	0.1512	11.6	1.4291
o4-mini	H1 BLIND	2.5232	0.2791	8.9	1.4808
o4-mini	H2 CONTEXT	2.4395	0.1616	11.7	1.2447
o4-mini	H3 STATEFUL	3.1211	0.1320	10.7	1.3488

Lower OVAR and lower stockouts are better. Mult Mean = average safety stock multiplier chosen across all runs and periods for that condition.

Hypothesis verdicts

Hypothesis	Prediction	Verdict
H1	At least one AI condition beats exp_smoothing on OVAR and stockouts simultaneously; best AI OVAR 2.3325 vs. 0.5446 baseline	REJECTED
H2	Context (H2) improves OVAR over Blind (H1) for at least two models; Context worsened OVAR for nemotron and gpt-4.1-mini	REJECTED
H3	Stateful (H3) improves OVAR over Context (H2) for at least two models; memory caused o4-mini to reach the worst OVAR in the experiment (3.1211)	REJECTED
H4	AI correctly identifies the direction of seasonal demand at least 50% of the time (MPS ≥ 0.50); best observed MPS 0.3977	REJECTED

Discussion

Semantic alignment vs. operational control

The data reveals a structural disconnect between two capabilities that look related but are not. Semantic alignment means the AI understands the concept of what is happening. Reading the text outputs confirmed that all three models correctly recognised when a busy season was approaching and deduced that more buffer stock was needed. Operational control means the AI can choose the exact mathematical number required to stabilise the system. This is where every model failed.

An AI that correctly identifies “December needs more buffer” and then outputs 1.45× when 1.05× would have sufficed is not an operational controller. It is a semantic reasoner in the wrong job. Directional capability exists. Numerical calibration does not. The hybrid architecture exposed the gap between them cleanly.

The over-buffering bias was consistent and measurable. With no AI and a fixed 1.0 multiplier, chain OVAR is 1.7097. Every AI condition exceeded this. The models averaged multipliers between 1.13 and 1.48 across all runs and conditions, indicating a systemic tendency toward caution that, paradoxically, produced worse outcomes than doing nothing. When in doubt, the models added buffer. Adding buffer destabilised the chain.

The o4-mini Stateful result deserves attention on its own terms. At OVAR 3.1211 it was the worst performing configuration in the experiment. The internal thinking logs showed the mechanism: the model anchored heavily on recent negative signals. A minor backlog from two months prior prompted a massive over-order the following month, which then appeared in the next period’s history as excess inventory, triggering a different over-correction. This feedback loop is the bullwhip effect running not across supply chain tiers but inside the model’s reasoning process.

Implications

What this means for system design

The hybrid architecture was a reasonable hypothesis. Restricting the AI to a single scalar output while the formula handled execution seemed like a way to leverage AI’s qualitative strengths while limiting its quantitative weaknesses. The data shows this is not sufficient. A continuous multiplier that the AI invents still requires numerical precision that the models cannot reliably provide.

The next logical step is to restrict the output further: instead of a free-form multiplier, the AI selects from a small set of pre-defined text labels (STRONG_INCREASE, MODERATE_INCREASE, NEUTRAL, MODERATE_DECREASE, STRONG_DECREASE). A hard-coded translation layer maps each label to a fixed, pre-approved multiplier value. The AI’s task becomes classification, not numerical calibration. Whether this architecture works is the subject of the next experiment.

The empirical multiplier range observed here (1.13–1.48 mean across all models and conditions) provides a concrete basis for designing guardrails. Any production deployment of a multiplier-based AI planner should enforce hard limits derived from empirical runs of this kind, preventing the model from accessing the region of the multiplier space where it reliably causes damage regardless of what it reasons.

Full code and results on GitHub

Full code, data, figures, and raw results are available on GitHub. View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are entirely fictional. The demand series is synthetic, calibrated to published Indian automotive seasonal patterns. The experiment was intentionally narrow: single product, fixed lead times, stylised hybrid control loop, three models across three information conditions.

Results should not be generalised to supply chain management broadly. The correct scope: in a stylised hybrid architecture where AI controls a continuous safety stock multiplier and a formula executes the order quantity, no tested LLM configuration outperformed either the mathematical baseline or a fixed-multiplier control in any of the three information conditions.