Industrial Mind & Code

Experiments

Published

The Architecture That Finally Worked: Adaptive Smoothing in Supply Chain Replenishment Published

SUPPLY CHAIN · ADAPTIVE SMOOTHING · V6 STATELESSSWING

Tests a fundamentally different architecture: the AI selects the EMA smoothing parameter α ∈ {0.1, 0.3, 0.5, 0.7} per period rather than a safety stock buffer. Three models tested across blind, context, and stateful conditions, plus four debiased conditions (V6b). 25-month simulation, deterministic lead times.

First experiment in this program where every AI condition produced OVAR below 1.0. The AI was damping variance, not amplifying it, across all nine primary conditions.
Best result: gpt-oss 120B blind, OVAR 0.535 ± 0.048 with 4.4 stockouts — matching and marginally beating the fixed α=0.3 baseline at OVAR 0.545, 5 stockouts.
Context penalty persists in a new form: context causes α-inflation (models choose more reactive smoothing), not panic buying. Debiased conditions (V6b) recover most of the blind condition’s advantage while providing seasonal information.

GitHub Read the writeup

The Ceiling Is in the Formula: Oracle Labels and Structural Incompatibility Published

SUPPLY CHAIN · ORACLE ABLATION · V5 CONTROL ARCHITECTURE

Phase 1 ablation: the LLM is removed entirely and replaced with oracle (ground-truth perfect) labels or causal rule-based classifiers. Fourteen architectural variants tested — multiplier maps, NEUTRAL redefinitions, order dampening, forecast oracle — 20 replications each. Phase 2 gate not passed for this architecture; the intent-classification line closed.

Oracle labels (perfect ground-truth classifications) produced OVAR 1.776 — worse than the Order-Up-To formula with no AI at OVAR 1.753. Perfect intelligence in the wrong position is worse than no intelligence.
The 0.540-unit gap between the best achievable result and exp_smoothing (1.193) is invariant to label quality. It is an architectural property of the safety-stock-based OUT formula, not a model quality problem.
Closed the intent-classification lineage (V1–V5). The 0.540-unit gap is an architectural property of the OUT formula’s safety stock structure; V6 tests a different architecture entirely.

GitHub Read the writeup

The Equaliser Effect: Intent Classification in Supply Chain Replenishment Published

SUPPLY CHAIN · INTENT CLASSIFICATION · V4 WORLDEVENTS · 36-MONTH SIMULATION

Tests a 5-label discrete intent classification architecture (STRONG_INCREASE through STRONG_DECREASE) with a hard-coded multiplier lookup replacing the AI’s continuous float output from V3b. Four models across three information conditions (Blind, Context, Unstructured), 20–100 replications per condition, stochastic lead times and world events.

All four models (14B to 120B) produced OVAR in a 0.05-unit band (1.726–1.780), clustering on the Order-Up-To formula baseline regardless of model size, reasoning capability, or information level. The Equaliser Effect.
Context roughly doubled direction accuracy (0.41–0.48 → 0.72–0.84) without materially moving OVAR. Accuracy improvement discarded at the lookup table.
Neutral-prior prompt instruction (E4 sub-experiment) had zero effect on OVAR across all three models tested. You cannot prompt your way out of the bullwhip effect.

GitHub Read the writeup

Hybrid AI Safety Stock Control in Supply Chain Replenishment Published

SUPPLY CHAIN · HYBRID ARCHITECTURE · 12,960 LLM CALLS

Three AI models were given direct control over the safety stock multiplier in a hybrid architecture where a mathematical formula executed the actual order quantities. Tested across three information conditions (Blind, Context, Stateful) with 20 replications per condition.

All four hypotheses rejected. exp_smoothing OVAR 0.5446 with 5 stockouts; best AI condition (gpt-4.1-mini, Blind) OVAR 2.3325 with 10.6 stockouts; worst (o4-mini, Stateful) OVAR 3.1211.
Context provision increased order variance for two out of three models. More information triggered more panic-buying, not more precision.
o4-mini in the Stateful condition anchored on prior stockouts and over-corrected violently each period -- the highest OVAR in the experiment, and a textbook bullwhip trigger generated inside the model's reasoning loop.

GitHub Read the writeup

LLM Agents Against Heuristic Baselines in Supply Chain Replenishment Published

SUPPLY CHAIN · EXPERIMENT WRITEUP · 11,520 LLM CALLS

Four LLM configurations (spanning frontier and local inference, lightweight and reasoning tiers) were evaluated against three heuristic baselines in a three-tier serial supply chain simulation. The published comparison combines eight backend-specific model-condition cells across four two-condition bundles, 20 replications per cell, 11,520 LLM calls.

Every heuristic outperformed every LLM configuration on both OVAR and stockout count simultaneously. Exponential smoothing achieved chain OVAR 0.54 with 5 stockouts; the best LLM achieved 4.33 with 41 stockouts.
Context provision destabilised the local lightweight model (phi4:14b), producing Ancillary-tier OVAR of 10.82 ± 8.14 across 20 runs.
Reasoning-tier models generated over 1,000,000 reasoning tokens with no measurable improvement in ordering performance over lightweight-tier models. All seven hypotheses rejected.

GitHub Read the writeup

sarvam-30b in Supply Chain Ordering: A Comparison with gpt-oss 120B Published

SUPPLY CHAIN · SOVEREIGN MODELS · 1,440 LLM CALLS

Follow-up to Agentic Bullwhip Effect Version 2 testing whether a model trained on Indian data (sarvam-30b, 30B total / 2.4B active parameters) demonstrates stronger seasonal awareness on a synthetic Indian automotive demand series than a frontier foundation model. 2 conditions, 10 replications per condition.

No measurable difference: sarvam-30b chain OVAR 4.504 vs. gpt-oss 120B 4.52 (Agentic Bullwhip Effect Version 2 context reference, not co-run). Difference within noise; neither model detected Indian seasonal demand patterns.
Exponential smoothing achieved OVAR 0.54, an 8× gap over both models.
Secondary finding: enabling the native reasoning flag (think=True) was associated with eliminating run-level failures without affecting ordering performance.

GitHub Read the writeup

Context and Model Capability in AI-Driven Supply Chain Ordering Published

SUPPLY CHAIN · EXPERIMENT WRITEUP · 720 LLM CALLS

Exploratory study examining the interaction between domain context provision and model capability on demand amplification. 2×2 factorial design (gpt-4.1-mini × o1, blind × context), 5 replications per condition, 720 LLM calls.

All configurations produced bullwhip amplification. Context reduced chain OVAR for the lightweight model (−7.2%) and increased it for the reasoning model (+15.0%).
The context + reasoning condition produced a fully inverted tier pattern (OEM OVAR 6.349, Component 2.698): the demand-facing tier was the noisiest, the furthest upstream tier the quietest.
Results are directional (n = 5) and motivated the expanded design in Agentic Bullwhip Effect Version 2.

GitHub Read the writeup

In progress

Understanding LLM Agentic Capabilities in Total Productive Maintenance (TPM) In Progress

TPM · PREDICTIVE MAINTENANCE · VERNACULAR NORMALISATION

Evaluates LLM agent performance on TPM decision support when reasoning over fragmented, multilingual maintenance records representative of Indian manufacturing environments. Includes an experimental condition with a vernacular normalisation preprocessing layer.

Request early access

Methodology

How experiments are designed

Controlled simulation environments

Parameters are synthetic but calibrated against published literature. All scenarios are fictional. No proprietary data is used.

Analytical baselines

Performance is measured against deterministic methods that already exist for the problem: exponential smoothing, order-up-to policies, SPC rules. The evaluation question is whether LLM agents add value beyond established approaches.

Model performance across deployment surfaces

Each experiment tests models across frontier and local inference, spanning lightweight and reasoning tiers. Model tier, context treatment, and inference backend are treated as experimental factors, not background conditions.

Optimised architecture

Experimental results are used to identify where hybrid deployments (combining LLM agents with deterministic methods) may outperform either approach alone. The design question is not whether AI replaces the established method, but whether and where it can improve on it.

Infrastructure

Frontier

Azure AI Foundry Azure OpenAI Service

Local

ASUS Ascent GX10 Ollama llama.cpp

Code

Claude Code Codex

Industrial Mind
& Code

Experiments

How experiments are designed

About the researcher

Get in touch

Industrial Mind& Code

Experiments

How experiments are designed

About the researcher

Get in touch

Industrial Mind
& Code