Independent Research
An experimental research program investigating the performance of LLM agents in industrial engineering decision environments. Each study places LLM agents in a controlled simulation (like supply chains, maintenance systems, production planning) and evaluates their performance against deterministic analytical baselines or suitable comparison mechanisms.
Experiments
Published
SUPPLY CHAIN · ADAPTIVE SMOOTHING · V6 STATELESSSWING
Tests a fundamentally different architecture: the AI selects the EMA smoothing parameter α ∈ {0.1, 0.3, 0.5, 0.7} per period rather than a safety stock buffer. Three models tested across blind, context, and stateful conditions, plus four debiased conditions (V6b). 25-month simulation, deterministic lead times.
SUPPLY CHAIN · ORACLE ABLATION · V5 CONTROL ARCHITECTURE
Phase 1 ablation: the LLM is removed entirely and replaced with oracle (ground-truth perfect) labels or causal rule-based classifiers. Fourteen architectural variants tested — multiplier maps, NEUTRAL redefinitions, order dampening, forecast oracle — 20 replications each. Phase 2 gate not passed for this architecture; the intent-classification line closed.
SUPPLY CHAIN · INTENT CLASSIFICATION · V4 WORLDEVENTS · 36-MONTH SIMULATION
Tests a 5-label discrete intent classification architecture (STRONG_INCREASE through STRONG_DECREASE) with a hard-coded multiplier lookup replacing the AI’s continuous float output from V3b. Four models across three information conditions (Blind, Context, Unstructured), 20–100 replications per condition, stochastic lead times and world events.
SUPPLY CHAIN · HYBRID ARCHITECTURE · 12,960 LLM CALLS
Three AI models were given direct control over the safety stock multiplier in a hybrid architecture where a mathematical formula executed the actual order quantities. Tested across three information conditions (Blind, Context, Stateful) with 20 replications per condition.
SUPPLY CHAIN · EXPERIMENT WRITEUP · 11,520 LLM CALLS
Four LLM configurations (spanning frontier and local inference, lightweight and reasoning tiers) were evaluated against three heuristic baselines in a three-tier serial supply chain simulation. The published comparison combines eight backend-specific model-condition cells across four two-condition bundles, 20 replications per cell, 11,520 LLM calls.
SUPPLY CHAIN · SOVEREIGN MODELS · 1,440 LLM CALLS
Follow-up to Agentic Bullwhip Effect Version 2 testing whether a model trained on Indian data (sarvam-30b, 30B total / 2.4B active parameters) demonstrates stronger seasonal awareness on a synthetic Indian automotive demand series than a frontier foundation model. 2 conditions, 10 replications per condition.
SUPPLY CHAIN · EXPERIMENT WRITEUP · 720 LLM CALLS
Exploratory study examining the interaction between domain context provision and model capability on demand amplification. 2×2 factorial design (gpt-4.1-mini × o1, blind × context), 5 replications per condition, 720 LLM calls.
In progress
TPM · PREDICTIVE MAINTENANCE · VERNACULAR NORMALISATION
Evaluates LLM agent performance on TPM decision support when reasoning over fragmented, multilingual maintenance records representative of Indian manufacturing environments. Includes an experimental condition with a vernacular normalisation preprocessing layer.
Methodology
01
Controlled simulation environments
Parameters are synthetic but calibrated against published literature. All scenarios are fictional. No proprietary data is used.
02
Analytical baselines
Performance is measured against deterministic methods that already exist for the problem: exponential smoothing, order-up-to policies, SPC rules. The evaluation question is whether LLM agents add value beyond established approaches.
03
Model performance across deployment surfaces
Each experiment tests models across frontier and local inference, spanning lightweight and reasoning tiers. Model tier, context treatment, and inference backend are treated as experimental factors, not background conditions.
04
Optimised architecture
Experimental results are used to identify where hybrid deployments (combining LLM agents with deterministic methods) may outperform either approach alone. The design question is not whether AI replaces the established method, but whether and where it can improve on it.
Infrastructure
About Me
My background is in industrial engineering. I work in digital engineering software, where AI tools meet the operational realities of industrial organisations.
This research program came from a direct question: where do LLM agents actually hold up in the domains I know best (manufacturing, supply chain, maintenance, and production planning) and where do they fall short?
Each experiment takes a foundational industrial engineering problem, builds a controlled simulation, places LLM agents in the decision-making role, and measures the result against what a deterministic method would have done.