Context and Model Capability in AI-Driven Supply Chain Ordering

TL;DR

Context helped the lightweight model and hurt the reasoning model. The most capable, most expensive setup was the worst of all four. Every configuration amplified order variability.

Overview

What this experiment explored

The bullwhip effect is a well-known supply chain problem: small shifts in consumer demand turn into larger and larger order swings as you move upstream through the chain. Version 1 put LLM agents in the ordering role across a three-tier supply chain and asked whether giving them business context (company identity, product details, calendar month) changed how much variability they added.

This was an exploratory study, 5 runs per configuration, designed to establish a baseline and surface anything unexpected before scaling the experiment. One result was unexpected enough to motivate the larger Agentic Bullwhip Effect Version 2 design.

Experiment Setup

Design & configuration

Models	gpt-4.1-mini (lightweight) · o1 (reasoning)
Design	2×2 factorial: model tier × context treatment
Replications	5 per configuration · 20 total runs
Primary metric	OVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand). Below 1.0 = dampening · 1.0 = pass-through · above 1.0 = amplification
Supply chain	3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer
Demand series	13 months (Dec 2024 to Dec 2025) · single SKU · 606,771 total units
Lead time	1 month deterministic at all tiers
Initial inventory	43,000 units at all tiers
LLM calls	720 total (12 periods × 3 tiers × 5 runs × 4 configurations)
Agent design	Stateless, no memory between periods
Blind condition	Agent receives numeric state only: demand, on-hand inventory, backlog, in-transit orders, lead time. No company identity, product, or calendar context.
Context condition	Agent receives the same numeric state plus company identity, product details, market position, and calendar month/year.
gpt-4.1-mini	Temperature 0.4, 600 max tokens
o1	API-fixed temperature, 16,000 max tokens

Key findings

What I found

Every configuration amplified demand variability. OVAR exceeded 1.0 at every tier, in every run, across all four configurations. No configuration reduced variability.
Context made the lightweight model more stable. context_lightweight chain OVAR 2.929 vs blind_lightweight 3.157, a 7.2% reduction. context_lightweight also showed the highest seasonal responsiveness, raising orders at event months in 83% of cases.
Context made the reasoning model worse. context_reasoning chain OVAR 4.412 vs blind_reasoning 3.835. The most capable, fully-informed configuration produced the highest chain variability of all four, 39.7% above the blind lightweight baseline.
context_reasoning produced an inverted tier pattern. Standard bullwhip analysis predicts variability increasing upstream: Component noisier than Ancillary, Ancillary noisier than OEM. context_reasoning flipped this entirely. OEM OVAR 6.349, Ancillary 4.191, Component 2.698. The three other configurations followed the expected pattern.
o1 showed high run-to-run variability. CV for o1 OVAR: 22–57%. For gpt-4.1-mini: under 2%. At n=5, o1 means are directional, not reliable point estimates.

Results

Numeric results

Chain-average OVAR by configuration

Configuration	Model	Treatment	Chain avg OVAR	vs blind_lightweight
context_lightweight	gpt-4.1-mini	Context	2.929	−7.2%
blind_lightweight	gpt-4.1-mini	Blind	3.157	baseline
blind_reasoning	o1	Blind	3.835	+21.5%
context_reasoning	o1	Context	4.412	+39.7%

OVAR by tier (mean ± std, CV%)

Configuration	OEM OVAR	CV%	Ancillary OVAR	CV%	Component OVAR	CV%
blind_lightweight	2.267 ± 0.009	0.41	2.938 ± 0.044	1.50	4.266 ± 0.078	1.82
context_lightweight	2.237 ± 0.006	0.29	3.138 ± 0.080	2.55	3.412 ± 0.347 *	10.18
blind_reasoning	4.200 ± 2.400	57.15 ⚠	3.656 ± 1.350	36.94 ⚠	3.649 ± 0.608	16.66 ⚠
context_reasoning	6.349 ± 1.452	22.86 ⚠	4.191 ± 1.373	32.76 ⚠	2.698 ± 0.677	25.10 ⚠

* Parse error in run 5 inflates component mean by ~+0.129. Clean estimate: 3.283 ± 0.220. ⚠ CV > 10%; high run-to-run instability; means are directional, not reliable point estimates.

Stockouts and excess inventory

Configuration	Stockouts (chain total)	Excess inventory (chain total)
blind_lightweight	21.4	109,360
context_lightweight	19.6	151,246
blind_reasoning	20.0	330,649
context_reasoning	12.8	654,728

context_reasoning had the fewest stockouts and the highest excess inventory: 654,728 units, roughly 6× blind_lightweight. Fewer stockouts, but at significant inventory cost.

Hypothesis verdicts

Hypothesis	Prediction	Verdict
H1	Context reduces OVAR at all three tiers	REJECTED
H2	Blind reasoning performs similarly to blind lightweight	REJECTED
H3	context_reasoning achieves the lowest chain OVAR	REJECTED
H4	Context agents respond better to seasonal demand	PARTIAL

H4 held for the lightweight model (seasonal elevation score of 83%) but reversed for the reasoning model.

Discussion

The context effect

The context effect ran in opposite directions depending on the model. For gpt-4.1-mini, context reduced chain OVAR by 0.228. For o1, context increased it by 0.577. The sharpest divergence was at the OEM tier, which is the tier observing real consumer demand. Context had near-zero effect on gpt-4.1-mini there (delta −0.030). For o1 at the same tier, context pushed OVAR up by 2.149. A model capable of reasoning about seasonality, given a month name and a market identity, appears to build forward-looking ordering strategies that inject variance at the chain head rather than reduce it.

The inverted cascade in context_reasoning is the clearest expression of this. The OEM agent ordered aggressively ahead of anticipated demand peaks: 654,728 units of excess inventory chain-wide, while stockouts fell to 12.8. Whether this pattern holds at more runs is what Agentic Bullwhip Effect Version 2 is designed to test. With CV values of 22–57% for o1 configurations, these means are directional at n=5, real enough to follow up, not settled enough to conclude on.

Industry Implications

What this means for practitioners

More context and a more capable model did not produce better ordering decisions. The assumption that richer input to a more capable model will always improve the result does not hold when each order directly affects the next period's inventory. In a supply chain, a bad order this month makes the situation harder next month, and the model has no memory of what it did.

The o1 result points to a specific deployment risk: inconsistency. When a model produces reasonable results in some runs and extreme results in others with the same inputs every time, you cannot tell in advance which type of run you are getting. That is harder to manage than a model that fails consistently, because at least a consistent failure is predictable. Before using a reasoning-tier model in any sequential operational task, run enough trials to understand how much its output varies.

Full code and results on GitHub

Full code, data, and raw results are available on GitHub. View on GitHub →

Methodology note

All scenarios, companies, products, and supply chain structures are fictional. No proprietary data was used. Exploratory study; 5 runs per configuration. Results are directional. gpt-4.1-mini at temperature 0.4, 600 max tokens. o1 at API-fixed temperature, 16,000 max tokens.