LLMs solve wargaming challenge

Dstl and Frazer-Nash demonstrate how large language models (LLMs) can solve the challenge of getting through large amounts of wargaming data

Analysts at the Defence Science and Technology Laboratory (Dstl) have demonstrated that large language models (LLMs) can turn the torrent of data produced by modern military simulations into concise, secure intelligence, slashing the time commanders need to understand the outcome of complex battles.

A six-month study funded by Dstl and carried out with consultancy Frazer-Nash found that an LLM running on a closed, classified network can summarise sprawling wargame logs, pinpoint decisive moments and explain why one side prevailed — all without sending sensitive material to external cloud services such as ChatGPT.

“Even seasoned teams struggle to digest everything a single scenario can throw at them,” said Dr Helen Carter, the Dstl technical lead on the project. “By pairing a local LLM with retrieval-augmented generation, we have shown that analysts can ask plain-English questions and receive reliable, traceable answers in seconds rather than days.”

The research focused on Command: Modern Operations (CMO), a simulation platform used across the Ministry of Defence to rehearse multi-domain engagements involving ships, aircraft and land forces. A typical CMO run can generate gigabytes of time-stamped sensor tracks, weapon launches and communications logs. Until now, extracting lessons has required painstaking manual review.

Dstl and Frazer-Nash tested the LLM against two phases of realistic scenarios. In the first, the model was asked to summarise a week-long maritime campaign in the Indo-Pacific; in the second, it had to explain why a NATO armoured thrust faltered during a simulated Baltic crisis. Each time the model was given raw CMO exports in PDF, CSV and XML formats, and then quizzed on factors such as fuel states, radar cross-sections and rules-of-engagement decisions.

To ensure accuracy, the team built a bespoke evaluation framework that scores every LLM answer for factual correctness, completeness and consistency with classified doctrine. Early runs achieved 87 per cent precision on key metrics, rising to 94 per cent after fine-tuning.

Crucially, the system never left the secure enclave. By combining a locally hosted open-source LLM with retrieval-augmented generation — a technique that grounds answers in specific documents rather than general web knowledge — the researchers eliminated the risk of sensitive data leaking to commercial providers.

“Security was non-negotiable,” said Frazer-Nash’s principal consultant, James Whitworth. “We proved you do not need to choose between cutting-edge AI and operational secrecy.”

Beyond speed and security, the study highlighted training advantages. Junior analysts used the tool to explore “what-if” questions in real time, accelerating their understanding of joint-force dynamics. Senior officers, meanwhile, received executive summaries that distilled thousands of events into a handful of decision-critical insights.

Dstl says the same architecture can be adapted as simulations, data formats and evaluation criteria evolve. “The framework is deliberately modular,” Dr Carter explained. “If tomorrow’s wargames start streaming satellite imagery or cyber-effects logs, we can slot those straight into the pipeline without rebuilding the entire stack.”

The MoD has already earmarked further funding to integrate the LLM toolkit into upcoming operational analysis exercises. Defence sources suggest the capability could be fielded at Permanent Joint Headquarters as early as next year, giving planners an AI assistant able to replay and critique every move of a live mission once it concludes.

Critics caution that any AI summary is only as good as the data it ingests, and warn against over-reliance on automated narratives when human judgement remains paramount. Dstl acknowledges the risk but argues that the evaluation framework provides an audit trail, allowing analysts to drill down to the underlying logs whenever the model’s reasoning needs verification.

For now, the research team is turning its attention to real-time support: feeding the LLM a live CMO stream so that commanders can pose questions while a scenario is still unfolding. If successful, the same techniques could eventually be applied to live operations, giving British forces an AI co-pilot able to sift battlefield sensors and suggest courses of action under strict security constraints.

Dr Carter is confident the breakthrough marks a turning point. “We have taken something that was once impenetrable and made it usable,” she said. “That means better decisions, faster training and, ultimately, greater resilience for UK forces.”