WHITEPAPER

AI Model Benchmarks for Supply Chain: What Works in 2025

AI EvaluationSupply ChainBenchmarkingLLMs

Published March 2025

Abstract

A systematic evaluation of 12 AI model families across six supply chain task types, based on primary research conducted across 20 Indian mid-market manufacturers. We find significant performance variation by task type and context length, with domain-trained models consistently outperforming general-purpose models on industrial demand sensing tasks.

Key Findings

General-purpose LLMs achieve <60% accuracy on industrial demand sensing vs domain-trained models at >80%
Context length matters more than model size for supply chain document processing tasks
Hybrid approaches (RAG + fine-tuning) outperform either approach alone on compliance classification
Human-in-the-loop requirements vary significantly by task type and consequence of error

Executive Summary

This whitepaper presents findings from a 6-month research programme evaluating AI model performance on supply chain tasks in Indian mid-market manufacturing contexts.

We evaluated 12 model families across six task types: demand forecasting, inventory optimisation, document processing, compliance classification, supplier risk scoring, and exception alerting.

The headline finding: task-specific performance variation is more significant than model family variation. The choice of deployment architecture and domain adaptation approach has more impact on real-world performance than the underlying model choice.

Methodology

Research was conducted across 20 participant companies with revenues between ₹50Cr and ₹500Cr, across manufacturing sub-sectors including pharmaceutical APIs, auto components, industrial chemicals, and engineering goods.

For each company and task type, we measured:

Baseline accuracy of current approach (human or legacy system)
AI model accuracy on held-out test set
Decision quality improvement in 90-day live deployment
Planner override rate and reasons

Key Findings

Finding 1: General vs domain-specific models

General-purpose LLMs achieve 58% accuracy on industrial demand sensing tasks compared to 81% for domain-adapted models. The gap is driven primarily by the inability of general-purpose models to encode the specific demand drivers of industrial commodities.

Finding 2: Context and architecture matter more than size

For supply chain document processing (invoices, certificates, shipping documents), a smaller model with a longer effective context window outperforms larger models with shorter context. This has significant infrastructure cost implications.

Finding 3: Hybrid approaches dominate

For compliance classification — specifically HS code classification and Rules of Origin assessment — hybrid approaches combining retrieval-augmented generation with fine-tuning on domain-specific examples achieve 89% accuracy, compared to 71% for RAG alone and 76% for fine-tuning alone.

To receive the full whitepaper, complete the form below. The PDF will be sent to your email.

Download the full whitepaper

Enter your details below and we will send the PDF to your inbox.