sarvam-30b in Supply Chain Ordering: A Comparison with gpt-oss 120B

TL;DR

Agentic Bullwhip Effect Version 2 showed that no LLM configuration could beat a simple exponential smoothing heuristic on supply chain order variance or stockouts. This experiment asked a follow-up question: would a model trained on Indian data show stronger seasonal awareness on an Indian automotive demand series?

I tested sarvam-30b (India's sovereign model, 30B total parameters, 2.4B active) against gpt-oss 120B as a reference. The demand series was calibrated to real Indian PV market seasonality: festive peaks, monsoon troughs, fiscal year-end patterns. The agents received the current month name plus four numeric state variables. Any seasonal reasoning would have to come from the model's world knowledge.

The answer: no meaningful difference. sarvam-30b chain OVAR 4.504 vs gpt-oss 120B 4.52. No practically meaningful difference in the tested context conditions. Neither model detected Indian seasonal demand patterns. Exponential smoothing (OVAR 0.54) won by roughly 8x.

One finding emerged from the integration work: enabling sarvam-30b's native reasoning flag (think=True) was associated with eliminating run-level failures without affecting supply chain outcomes.

Experiment setup

Design & configuration

Model	sarvam-30b Q4_K_M	Sarvam AI · MoE · 2.4B active of 30B total · India's sovereign model
Reference	gpt-oss:120B	Agentic Bullwhip Effect Version 2 results used for comparison; not re-run in Version 2a
Conditions	E1: context, think=False · E2: context, think=True	Blind condition not run. See note below table
Replications	10 runs per condition	V2d canonical dataset (top_p=1.0 per GGUF model card)
Primary metrics	Chain OVAR · Stockout count · Pattern score	Chain OVAR = arithmetic mean of per-tier Var(orders)/Var(demand). MPRD threshold: \|ΔOVAR\| ≥ 0.5
Supply chain	3-tier serial: OEM → Ancillary → Component	Tatva Motors Vecta product family (fictional)
Demand series	25 months (Jan 2025 to Jan 2027)	Calibrated to real Indian PV market data · two full festive cycles
Lead time	1 month deterministic at all tiers
Initial inventory	~43,600 units	Mean + 1.65σ of demand series (~95% service level)
Agent design	Stateless, no memory between periods	Agents receive: tier persona + month name + demand, on-hand inventory, backlog, and inventory position. No year. No event labels.
Temperature	1.0 all conditions	Mandated by GGUF model card; lower values cause 40-60% call failures on local inference
Inference	llama-server (llama.cpp), local	NVIDIA GB10 Blackwell

Blind condition: Not run in main experiment. Pre-experiment calibration showed ~20% per-call error rates with minimal prompts, making 10-run completion infeasible. Individual smoke runs at temp=1.0 did complete (1/3 E1, 1/1 E2). A subsequent calibration pass produced 0/3 before being stopped. Context conditions only were used for all main runs.

OVAR interpretation: < 1.0 = dampening · = 1.0 = pass-through · > 1.0 = bullwhip amplification

Key findings

What I found

No practically meaningful difference in task performance was observed in the tested conditions. sarvam-30b chain OVAR 4.504 ± 0.044 (E1) and 4.501 ± 0.093 (E2). gpt-oss 120B achieved 4.52 ± 0.05 (Agentic Bullwhip Effect Version 2 context reference; not re-run in Version 2a). The difference is 0.02, well within noise and below the MPRD threshold of 0.5. India's sovereign model did not produce different ordering behaviour in the tested context conditions, despite being trained on Indian data and the demand series being calibrated to Indian market seasonality.
Neither model detected Indian seasonal demand patterns. Pattern scores: sarvam-30b 0.219–0.232, gpt-oss 120B 0.21. All identical within noise. The demand series contains two Indian festive cycles with documented peak months, a monsoon trough, and wedding season elevation. Agents received the month name plus four numeric state variables in the user prompt. No event labels, no year. Neither model showed seasonal awareness. Cultural training data did not produce measurably different ordering behaviour on this task.
Enabling think=True was associated with eliminating run-level failures without changing supply chain outcomes. E1 (think=False): 15 of 25 run attempts needed replacement. E2 (think=True): 0 of 10 run attempts needed replacement. Chain OVAR difference between E1 and E2: 0.003. Note: E1 also had a separate API-flag/prompt conflict: the system prompt contained a “Think silently” instruction that contradicted the think=False flag. The reliability improvement in E2 cannot be fully isolated to the reasoning flag alone. Regardless, think=True produced reliable run completion and is the recommended configuration for local GGUF deployment of sarvam-30b via llama-server.
The structural constraint holds. LLMs are stateless in this architecture: each agent sees the current period’s state and places an order with no memory of prior decisions. Exponential smoothing carries one number forward and partially self-corrects. The LLM agent cannot. The current results are consistent with an architectural bottleneck: stateless agents cannot self-correct across periods regardless of training data. Whether cultural training or model size would change outcomes in a stateful architecture is outside what this experiment can establish.

Results

Numeric results

Heuristic baselines

Heuristic	Chain OVAR	Stockouts (of 75 possible)
Exponential smoothing	0.54	5
Naive passthrough	1.00	3
Order-up-to	1.71	14

sarvam-30b V2d canonical (top_p=1.0, context conditions only)

Condition	Chain OVAR (mean ± std)	Stockouts	Pattern	Run replacements
E1 Context (think=False)	4.504 ± 0.044	39.9	0.219	15 / 25 attempts
E2 Context (think=True)	4.501 ± 0.093	40.5	0.232	0 / 10 attempts
gpt-oss:120B (Version 2 reference)	4.52 ± 0.05	39.6	0.21	0 / 20 attempts

Stockouts: shortfall > 0 across all 25 periods × 3 tiers = 75 tier-periods per run. Pattern score: mean of keyword and elevation scores at event months; elevation threshold = ratio > 1.10 above per-tier median.

V2d note: correcting top_p from llama.cpp default (0.95) to documented value (1.0) produced no meaningful change. Documented settings reproduce results faithfully.

Integration observations

Model-specific issues

Two model-specific issues required resolution before stable runs were possible. Both are reproducible and specific to local GGUF deployment of sarvam-30b via llama-server. gpt-oss 120B triggered neither across 20 runs.

Issue	What happened	Fix
API flag + prompt conflict	The think=False API flag conflicts with a "Think silently" instruction in the system prompt. The model receives contradictory instructions about whether to use internal reasoning. Results in elevated error rates in E1.	Remove "Think silently" from system prompt when using the think=False flag.
Blind condition structural failure	Minimal prompts (no context, no tier persona) produced ~20% per-call error rates. Individual smoke runs at temp=1.0 did complete. A calibration pass produced 0/3 before being stopped. 10-run completion not feasible.	Not resolved. Context conditions used for all main runs. Blind condition not directly comparable to Version 2 blind results.

A broader observation from calibration: sarvam-30b is available as a cloud API and as a local GGUF. The two documentation sources give different temperature recommendations. Cloud: 0.2 (non-thinking). GGUF model card: 1.0 (reasoning). The GGUF card is correct for local inference. Cloud temperature recommendations cause 40-60% call failures on local GGUF deployment. This is documented in the calibration notes for reproducibility.

Discussion

Seasonal awareness

The seasonal awareness hypothesis was not confirmed. I expected that a model with Indian training data would show different ordering behaviour on a demand series calibrated to Indian market seasonality. It did not.

The agents received the current month name plus four numeric state variables. No event labels, no seasonal context, no year. Any seasonal signal would have to come from the model's world knowledge about what October and November mean in India. Neither sarvam-30b nor gpt-oss showed it. Pattern scores of 0.22 and 0.21 indicate very low, approximately equal, seasonal sensitivity in both models.

One possible explanation: the task framing as a supply chain ordering problem may not activate seasonal world knowledge, even when the model possesses it. Whether the model draws on broader seasonal knowledge when asked to make an inventory decision (rather than a factual question about seasonality) is unclear from these results alone.

The structural argument also applies. Stateless agents without memory cannot self-correct drift regardless of what they know about seasonality. Even if a model correctly anticipated a festive peak in period 14, it cannot carry that anticipation forward without memory of what it planned.

Industry Implications

What this means for practitioners

A model trained on Indian data does not automatically use that knowledge when asked to place a supply chain order. sarvam-30b received an Indian demand series and the current month name but showed no seasonal sensitivity. If seasonal awareness matters for your use case, you need to build it into the prompt or the system design explicitly. You cannot assume a model will apply its training knowledge to an operational task just because the task is in the same domain.

If you are running sarvam-30b locally via llama-server, use think=True. It removed all run-level failures in this experiment with no downside. Do not use cloud temperature settings for a local GGUF model. Use what the model card specifies (temperature 1.0 in this case).

A model that fails consistently is actually easier to deal with than one that fails unpredictably. When a model errors out on some runs and not others with no clear pattern, you have a harder problem: you cannot rely on the output without checking it each time. Run your own reliability tests under the specific conditions you plan to deploy, rather than inferring reliability from benchmark scores.

Full code and results on GitHub

Full report, code, and data are available on GitHub. View on GitHub →

ANALYSIS.md · DESIGN.md · COMPARISON.md

Scope and limitations

Context conditions only. Blind results not available for sarvam-30b. 10 runs per condition; Agentic Bullwhip Effect Version 2 used 20, so confidence intervals are wider here. Single product, single supply chain structure, fixed deterministic lead times. Stateless agents only, no inter-tier communication. Integration findings are specific to local GGUF deployment via llama-server; cloud API deployment is a different integration surface. Results should not be generalised to supply chain management broadly.

All scenarios, companies, products, and supply chain structures are fictional. The demand series is calibrated to real Indian PV market data but is synthetic. No proprietary data was used.