Research

Experiments

Experimental research on LLM agent performance in industrial engineering decision environments.

Back to homepage

Published Experiments

The Architecture That Finally Worked: Adaptive Smoothing in Supply Chain Replenishment

SUPPLY CHAIN · ADAPTIVE SMOOTHING · EXPERIMENT WRITEUP

For the first time in this research program, every AI condition produced OVAR below 1.0 — the AI was damping variance, not amplifying it. The key architectural change: the AI now controls the EMA smoothing parameter α inside the formula rather than a safety stock buffer on top of it. Best result: gpt-oss 120B blind, OVAR 0.535, marginally beating the fixed α=0.3 baseline at 0.545.

Read GitHub
The Ceiling Is in the Formula: Oracle Labels and Structural Incompatibility

SUPPLY CHAIN · ORACLE ABLATION · EXPERIMENT WRITEUP

Perfect ground-truth labels, fed directly to the architecture with no LLM at all, produced worse OVAR than the Order-Up-To formula running alone. Fourteen architectural variants tested. None passed the gate. The 0.540-unit gap to exponential smoothing is preserved exactly regardless of label quality, multiplier range, or formula variant. V5 closes the intent-classification line; V6 changes the control lever.

Read GitHub
The Equaliser Effect: Intent Classification in Supply Chain Replenishment

SUPPLY CHAIN · INTENT CLASSIFICATION · EXPERIMENT WRITEUP

Four AI models — from a lightweight fast model to a 120-billion-parameter reasoning behemoth — produced statistically identical supply chain outcomes across three information conditions. Replacing the AI’s float output with five discrete text labels fixed the calibration problem from V3b. It simultaneously created a structural ceiling no model or prompt can escape.

Read GitHub
Hybrid AI Safety Stock Control in Supply Chain Replenishment

SUPPLY CHAIN · HYBRID ARCHITECTURE · EXPERIMENT WRITEUP

Three AI models controlled the safety stock multiplier in a hybrid architecture; a mathematical formula handled the order quantity. All four hypotheses failed across three information conditions and 20 replications per condition. Every AI condition produced higher order variance than doing nothing. Context made two out of three models worse. Memory caused the advanced reasoning model to collapse.

Read GitHub
sarvam-30b in Supply Chain Ordering: A Comparison with gpt-oss 120B

SUPPLY CHAIN · SOVEREIGN MODEL · EXPERIMENT WRITEUP

India’s sovereign model showed no measurable difference from gpt-oss on this task: OVAR 4.504 vs. 4.52. Neither model detected Indian seasonal demand patterns. Exponential smoothing outperformed both by approximately 8×. The gpt-oss result is an Agentic Bullwhip Effect Version 2 context reference, not a co-run comparison.

Read GitHub
LLM Agents Against Heuristic Baselines in Supply Chain Replenishment: An Experimental Comparison

SUPPLY CHAIN · EXPERIMENT WRITEUP

Four LLM configurations evaluated against three heuristic baselines across 20 replications and 11,520 LLM calls. Every heuristic outperformed every LLM on both order variance and stockout count simultaneously. All seven hypotheses rejected.

Read GitHub
Context and Model Capability in AI-Driven Supply Chain Ordering: An Experimental Study

SUPPLY CHAIN · EXPERIMENT WRITEUP

All four configurations amplified demand variability. The context × reasoning condition produced a fully inverted tier pattern (OEM as the noisiest tier, Component as the quietest), reversing the standard upstream cascade.

Read GitHub