TL;DR
Context helped the lightweight model and hurt the reasoning model. The most capable, most expensive setup was the worst of all four. Every configuration amplified order variability.
Overview
What this experiment explored
The bullwhip effect is a well-known supply chain problem: small shifts in consumer demand turn into larger and larger order swings as you move upstream through the chain. Version 1 put LLM agents in the ordering role across a three-tier supply chain and asked whether giving them business context (company identity, product details, calendar month) changed how much variability they added.
This was an exploratory study, 5 runs per configuration, designed to establish a baseline and surface anything unexpected before scaling the experiment. One result was unexpected enough to motivate the larger Agentic Bullwhip Effect Version 2 design.
Experiment Setup
Design & configuration
| Models | gpt-4.1-mini (lightweight) · o1 (reasoning) |
| Design | 2×2 factorial: model tier × context treatment |
| Replications | 5 per configuration · 20 total runs |
| Primary metric | OVAR (Order Variance Amplification Ratio) = Var(orders) / Var(demand). Below 1.0 = dampening · 1.0 = pass-through · above 1.0 = amplification |
| Supply chain | 3-tier serial: Tatva Motors (OEM) → Lighting Manufacturer (Ancillary) → LED Component Manufacturer |
| Demand series | 13 months (Dec 2024 to Dec 2025) · single SKU · 606,771 total units |
| Lead time | 1 month deterministic at all tiers |
| Initial inventory | 43,000 units at all tiers |
| LLM calls | 720 total (12 periods × 3 tiers × 5 runs × 4 configurations) |
| Agent design | Stateless, no memory between periods |
| Blind condition | Agent receives numeric state only: demand, on-hand inventory, backlog, in-transit orders, lead time. No company identity, product, or calendar context. |
| Context condition | Agent receives the same numeric state plus company identity, product details, market position, and calendar month/year. |
| gpt-4.1-mini | Temperature 0.4, 600 max tokens |
| o1 | API-fixed temperature, 16,000 max tokens |
Key findings
What I found
- Every configuration amplified demand variability. OVAR exceeded 1.0 at every tier, in every run, across all four configurations. No configuration reduced variability.
- Context made the lightweight model more stable. context_lightweight chain OVAR 2.929 vs blind_lightweight 3.157, a 7.2% reduction. context_lightweight also showed the highest seasonal responsiveness, raising orders at event months in 83% of cases.
- Context made the reasoning model worse. context_reasoning chain OVAR 4.412 vs blind_reasoning 3.835. The most capable, fully-informed configuration produced the highest chain variability of all four, 39.7% above the blind lightweight baseline.
- context_reasoning produced an inverted tier pattern. Standard bullwhip analysis predicts variability increasing upstream: Component noisier than Ancillary, Ancillary noisier than OEM. context_reasoning flipped this entirely. OEM OVAR 6.349, Ancillary 4.191, Component 2.698. The three other configurations followed the expected pattern.
- o1 showed high run-to-run variability. CV for o1 OVAR: 22–57%. For gpt-4.1-mini: under 2%. At n=5, o1 means are directional, not reliable point estimates.
Results
Numeric results
Chain-average OVAR by configuration
| Configuration | Model | Treatment | Chain avg OVAR | vs blind_lightweight |
|---|---|---|---|---|
| context_lightweight | gpt-4.1-mini | Context | 2.929 | −7.2% |
| blind_lightweight | gpt-4.1-mini | Blind | 3.157 | baseline |
| blind_reasoning | o1 | Blind | 3.835 | +21.5% |
| context_reasoning | o1 | Context | 4.412 | +39.7% |
OVAR by tier (mean ± std, CV%)
| Configuration | OEM OVAR | CV% | Ancillary OVAR | CV% | Component OVAR | CV% |
|---|---|---|---|---|---|---|
| blind_lightweight | 2.267 ± 0.009 | 0.41 | 2.938 ± 0.044 | 1.50 | 4.266 ± 0.078 | 1.82 |
| context_lightweight | 2.237 ± 0.006 | 0.29 | 3.138 ± 0.080 | 2.55 | 3.412 ± 0.347 * | 10.18 |
| blind_reasoning | 4.200 ± 2.400 | 57.15 ⚠ | 3.656 ± 1.350 | 36.94 ⚠ | 3.649 ± 0.608 | 16.66 ⚠ |
| context_reasoning | 6.349 ± 1.452 | 22.86 ⚠ | 4.191 ± 1.373 | 32.76 ⚠ | 2.698 ± 0.677 | 25.10 ⚠ |
* Parse error in run 5 inflates component mean by ~+0.129. Clean estimate: 3.283 ± 0.220. ⚠ CV > 10%; high run-to-run instability; means are directional, not reliable point estimates.
Stockouts and excess inventory
| Configuration | Stockouts (chain total) | Excess inventory (chain total) |
|---|---|---|
| blind_lightweight | 21.4 | 109,360 |
| context_lightweight | 19.6 | 151,246 |
| blind_reasoning | 20.0 | 330,649 |
| context_reasoning | 12.8 | 654,728 |
context_reasoning had the fewest stockouts and the highest excess inventory: 654,728 units, roughly 6× blind_lightweight. Fewer stockouts, but at significant inventory cost.
Hypothesis verdicts
| Hypothesis | Prediction | Verdict |
|---|---|---|
| H1 | Context reduces OVAR at all three tiers | REJECTED |
| H2 | Blind reasoning performs similarly to blind lightweight | REJECTED |
| H3 | context_reasoning achieves the lowest chain OVAR | REJECTED |
| H4 | Context agents respond better to seasonal demand | PARTIAL |
H4 held for the lightweight model (seasonal elevation score of 83%) but reversed for the reasoning model.
Discussion
The context effect
The context effect ran in opposite directions depending on the model. For gpt-4.1-mini, context reduced chain OVAR by 0.228. For o1, context increased it by 0.577. The sharpest divergence was at the OEM tier, which is the tier observing real consumer demand. Context had near-zero effect on gpt-4.1-mini there (delta −0.030). For o1 at the same tier, context pushed OVAR up by 2.149. A model capable of reasoning about seasonality, given a month name and a market identity, appears to build forward-looking ordering strategies that inject variance at the chain head rather than reduce it.
The inverted cascade in context_reasoning is the clearest expression of this. The OEM agent ordered aggressively ahead of anticipated demand peaks: 654,728 units of excess inventory chain-wide, while stockouts fell to 12.8. Whether this pattern holds at more runs is what Agentic Bullwhip Effect Version 2 is designed to test. With CV values of 22–57% for o1 configurations, these means are directional at n=5, real enough to follow up, not settled enough to conclude on.
Industry Implications
What this means for practitioners
More context and a more capable model did not produce better ordering decisions. The assumption that richer input to a more capable model will always improve the result does not hold when each order directly affects the next period's inventory. In a supply chain, a bad order this month makes the situation harder next month, and the model has no memory of what it did.
The o1 result points to a specific deployment risk: inconsistency. When a model produces reasonable results in some runs and extreme results in others with the same inputs every time, you cannot tell in advance which type of run you are getting. That is harder to manage than a model that fails consistently, because at least a consistent failure is predictable. Before using a reasoning-tier model in any sequential operational task, run enough trials to understand how much its output varies.
Full code and results on GitHub
Full code, data, and raw results are available on GitHub. View on GitHub →
Methodology note
All scenarios, companies, products, and supply chain structures are fictional. No proprietary data was used. Exploratory study; 5 runs per configuration. Results are directional. gpt-4.1-mini at temperature 0.4, 600 max tokens. o1 at API-fixed temperature, 16,000 max tokens.