Enterprise AI Weekly #19

Quantization explained, how to understand and mitigate MCP poisoning, Gemini Flash 2.5 Lite makes AI cheaper again, AI and its impact on knowledge work and move over DeepSeek - MiniMax M1 is here

Jun 27, 2025

Welcome to Enterprise AI Weekly #19

Welcome to the Enterprise AI Weekly Substack, published by me, Paul O'Brien, Group Chief AI Officer and Global Solutions CTO at Davies.

Enterprise AI Weekly is a short-ish, accessible read, covering AI topics relevant to businesses of all sizes. It aims to be an AI explainer, a route into goings-on in AI in the world at large, and a way to understand the potential impacts of those developments on your business.

If you’re reading this for the first time, you can read previous posts at the Enterprise AI Weekly Substack page.

Yes, I’m back - just! After 2 weeks of non-laptop use (Mrs O did indeed win, and yes that’s me in that icon above), today was my first day back in the office. Cue lots, yes LOTS, of stuff to catch up on in my inbox and Teams, not to mention all the AI news over the last few weeks, that I confess I partly followed on my phone. 😀

It’s unbelievable that next week’s post will be #20. That’s 20 weeks, how time flies! I mentioned that some changes are afoot, and I’m pleased to announce that from next week, the Substack will be shareable externally for anyone to sign up. In addition, the post will now drop on a Friday morning, giving me my Thursday night ‘coding time’ (more on this later) to put the post together. Finally, there will be some other small adjustments, including a call to get you, dear readers, more involved in bringing AI to life. Stay tuned and enjoy #19.

Explainer: What is quantization (sic)?

At its core, a Large Language Model (LLM) stores its vast knowledge as a network of billions of numerical values, known as parameters or weights. Traditionally, these numbers are stored with high precision, often as 32-bit or 16-bit floating-point values, which can represent a wide range of decimal numbers. Quantization is the process of converting these high-precision numbers into a lower-precision format, such as 8-bit or even 4-bit integers. Think of it as reducing the file size of a very high-resolution digital photograph; the resulting image is much smaller and easier to handle, though you might lose some of the finest details in the process. This reduction in numerical precision is a key technique for making massive AI models more compact and efficient.

The primary benefit of quantization is a dramatic reduction in the model's size, which has several positive impacts. A smaller model requires significantly less memory and computational power to run, making it portable enough to operate on consumer-grade hardware like laptops and smartphones, a trend we've seen with the rise of on-device models for Copilot PCs and mobile devices. This increased accessibility means powerful AI can function offline, offering enhanced data privacy and zero-cost inference. Furthermore, because lower-precision numbers can be processed more quickly by modern computer chips, especially specialised Neural Processing Units (NPUs), quantized models often deliver faster responses, improving latency in applications like virtual assistants or agentic workflows. This efficiency also translates to lower energy consumption and reduced operational costs, a crucial factor as AI becomes more widespread.

However, this efficiency comes with a trade-off. By reducing the precision of the model's parameters, there is a risk of degrading its performance. The “rounding” process inherent in quantization can lead to a loss of nuance and detail in the model's knowledge, potentially resulting in slightly less accurate or coherent outputs. This is the main drawback of the technique. Despite this, the field has advanced rapidly, and modern quantization methods are incredibly sophisticated, often minimising the loss of accuracy to a point where it is negligible for most practical applications. For many, the compromise is well worth it, as quantization is the critical enabler that balances cutting-edge performance with the practical need for accessibility, allowing advanced AI to be deployed far beyond the confines of large data centres. DataCamp has an excellent article covering Quantization for Large Language Models.

Example of a random matrix of weights with four-decimal precision (left) with its quantized form (right) by applying rounding to one-decimal precision.

Identifying quantized models on open-source hubs like Hugging Face is usually straightforward, as several clues point to a model's reduced precision. The model's name itself is often the most direct indicator, frequently including terms like '4-bit', '8-bit', or acronyms for quantization methods such as GGUF and GPTQ. Beyond the name, the model's accompanying description or 'model card' will explicitly detail the quantization process used and the resulting bit-level. Finally, many popular models are released as a family with different sizes, such as Meta’s Llama 4 Scout. These smaller, more efficient variants are often either quantized or created using a similar size-reduction technique called distillation, making them easier to run on a wider range of hardware.

1. Understanding and mitigating MCP poisoning

It’s not often that a technical blog post makes you want to check your server logs and then immediately reach for a cup of tea, but CyberArk’s deep dive into Model Context Protocol (MCP) security does just that. If you’ve been following the rise of agentic AI and the adoption of MCP as a universal bridge between large language models (LLMs) and external tools, you’ll know it’s been heralded as a game-changer for workflow automation and integration. However, as this latest research shows, with great power comes a worrying attack surface.

The post details a set of vulnerabilities collectively known as Tool Poisoning Attacks (TPA) and then ups the ante with Full-Schema Poisoning (FSP) and Advanced Tool Poisoning Attacks (ATPA). While early research focused on attackers sneaking malicious instructions into tool descriptions, CyberArk demonstrates that every single field in a tool’s schema - names, types, required fields, even parameter names - can be weaponised. In one example, simply embedding a prompt in a parameter name was enough to trick the LLM into leaking sensitive data. Even more insidiously, ATPA exploits the outputs of tools - error messages or follow-up prompts - so that the LLM, believing it’s being helpful, might exfiltrate confidential information or perform unintended actions, all while the tool’s code and description appear perfectly benign.

Mitigating these threats is no small feat. The researchers recommend a shift to a zero-trust approach for all tool interactions: treat every schema field and every output as potentially adversarial. This means scanning not just descriptions but the entire schema for embedded prompts, enforcing strict allowlists of known-good structures, and monitoring runtime behaviour for signs of LLMs being manipulated into odd or risky actions. The bottom line: if you’re relying on MCP to connect your LLMs to business-critical systems, assume that every field and every output is an attack vector. Static analysis alone won’t save you - runtime auditing and robust validation are now table stakes.

As regular readers will know, we’ve covered MCP’s rapid adoption and its promise to make our agentic AI solutions more flexible and interoperable. But this research is a timely reminder that as we unlock new capabilities, we must also double down on security and governance. Our claims processes will increasingly rely on LLMs and automated agents - if these systems can be manipulated through tool poisoning, we risk data leakage, compliance breaches, or worse.

This isn’t a reason to halt innovation, but it is a call to action: we need to ensure that any MCP-based integration in our stack is subject to rigorous validation, schema allowlisting, and runtime monitoring. Our AI governance frameworks must treat external tool outputs as untrusted until proven otherwise, and our developers should be trained to spot the subtle signs of schema or output-based attacks. As MCP and agentic AI become more central to our operations, security can’t be an afterthought - it’s the foundation of safe, reliable automation.

Thank you, Che, for the heads-up!

2. Gemini 2.5 Flash-Lite is Google’s new cost-efficient AI model

Gemini 2.5 Flash-Lite has officially arrived, and it’s already making waves as Google’s most cost-effective, lowest-latency large language model to date. This new addition to the Gemini family is designed for businesses and developers who need to process high volumes of data quickly and affordably, without sacrificing too much on quality or reasoning ability. Flash-Lite is positioned as an upgrade path from previous Flash models, offering improved performance on most benchmarks, faster response times, and greater throughput – all at a lower price point.

Gemini 2.5 Flash-Lite is purpose-built for high-throughput, cost-sensitive scenarios. Its sweet spot is in tasks where speed and predictable costs are paramount, such as large-scale text classification and data labelling, summarisation of lengthy documents or datasets at scale, real-time content moderation or triage, customer service chatbots and virtual assistants that require rapid, consistent responses or bulk document processing, such as claims triage or compliance checks.

The model’s “reasoning” capability can be toggled on or off, with “thinking” disabled by default to maximise speed and minimise cost. For tasks that demand deeper analysis, developers can adjust the “thinking budget” via API, trading off a bit of latency for improved output quality.

Flash-Lite is now the lowest-cost option in the Gemini 2.5 range, specifically optimised for environments where both latency and cost are critical. Google has simplified its pricing structure: unlike previous models that had separate “thinking” and “non-thinking” prices, Flash-Lite keeps things straightforward, offering a single, highly competitive rate for both modes. This makes it particularly attractive for enterprise deployments where budget predictability is key and where tasks can be scaled up or down without fear of runaway costs.

Overview of our family of Gemini 2.5 thinking models

For enterprise operations, Gemini 2.5 Flash-Lite represents a new standard in scalable, cost-controlled AI deployment. Its ability to process large volumes of data in real time, with dynamic control over cost and quality, directly addresses the demands of modern business – whether it’s automating claims triage, powering customer support, or enabling large-scale document analysis. The model’s tight integration with Google’s ecosystem and its simplified pricing makes it an attractive option for any team looking to balance performance with budget discipline. In a landscape where every penny and every millisecond counts, Flash-Lite is a timely addition that could help us deliver more value, faster, and at lower cost than ever before.

3. What AI really means for knowledge work

A landmark field experiment and resulting paper entitled ‘Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge on Worker Productivity and Quality’, led by researchers from Harvard, Wharton, MIT, Warwick, and Boston Consulting Group (BCG) has put some hard numbers on a question many of us in the claims and consulting world are asking: does AI actually make knowledge workers more productive and deliver better quality? And, crucially, when does it help, and when does it hinder? The results, drawn from a large-scale experiment with 758 BCG consultants, challenge some common assumptions and offer a nuanced, practical roadmap for how we should approach AI in our own business.

The study introduces the concept of a “jagged technological frontier” to describe the current state of AI capabilities. In plain English, this means that AI (specifically, GPT-4) is very good at some tasks, even those we might think are complex or creative, but can be surprisingly poor at others that seem similar in difficulty. In the experiment, consultants were randomly assigned to work on realistic consulting tasks either with or without AI support. For tasks within AI’s capability - such as creative product development and persuasive writing - consultants using AI completed 12% more tasks, worked 25% faster, and produced results rated over 40% higher in quality than those working unaided. The biggest winners? Those who started off below average in performance, who saw their scores shoot up by 43% - but even the top performers improved by 17%.

However, when the task was deliberately chosen to be just outside AI’s current abilities (requiring subtle analysis and integration of quantitative and qualitative data), the tables turned. Consultants with AI were 19 percentage points less likely to get the right answer than those working without it. In other words, AI can make us more efficient and creative - unless we’re asking it to do something it’s not yet equipped for, in which case it can confidently lead us astray. It is noteworthy that this study took place at the end of 2023, and as such capabilities have clearly advanced since.

The researchers observed two main patterns of successful human-AI collaboration, which they dubbed “Centaur” and “Cyborg” behaviours. Centaur consultants strategically divided work between themselves and the AI, delegating subtasks based on each party’s strengths. For example, they might use AI for drafting but rely on their own expertise for analysis. Cyborgs, on the other hand, tightly integrated AI into every step, collaborating at a granular level - such as iteratively refining outputs, asking AI to validate logic, or assigning it a specific professional persona.

Interestingly, both approaches could be effective, but the key was knowing when to trust the AI and when to apply human judgement. Those who blindly copied AI outputs (“high retainment”) often did better on creative tasks, but this same behaviour led to worse outcomes on tasks outside the AI’s frontier. The study also found that while AI improved the average quality of ideas, it reduced the diversity of responses - raising questions about innovation and groupthink in teams heavily reliant on AI.

For a business like ours, these findings are more than academic. They highlight the importance of mapping our workflows to identify which tasks are firmly within AI’s current capabilities and which require nuanced human judgement. The “jagged frontier” means we cannot assume AI is always right. Instead, the most effective teams will learn to navigate this frontier - sometimes acting as Centaurs, sometimes as Cyborgs, and always ready to interrogate AI’s outputs when the stakes are high.

As we continue to embed AI tools into our operations, we must also be mindful of training: not just in how to use the technology, but in developing the critical thinking skills needed to know when to trust it, when to challenge it, and when to take the reins ourselves. This approach will help us deliver better outcomes for clients, avoid costly errors, and ensure that our human expertise remains front and centre - even as we harness the best that AI has to offer.

4. Move over DeepSeek - MiniMax M1 is the new standard-bearer for Open-Source AI

Shanghai-based MiniMax has made a splash in the AI world with the release of its MiniMax-M1 model, an open-source large language model (LLM) that is challenging both domestic and international rivals on performance, efficiency, and openness. Released under the Apache 2.0 licence, MiniMax-M1 is genuinely open source, a notable distinction from models like Meta’s Llama or DeepSeek, which carry more restrictive terms. This move gives enterprises and developers unprecedented freedom to deploy, adapt, and scale the model without licensing headaches or vendor lock-in.

What truly sets MiniMax-M1 apart is its technical prowess. The model boasts a staggering one million token context window - equivalent to ~750,000 words, or the entire Lord of the Rings trilogy - allowing it to process and reason over vast swathes of text in a single session. This dwarfs the capacities of most competitors; for example, OpenAI’s GPT-4o manages 128,000 tokens, while DeepSeek R1 tops out at 128,000 as well. MiniMax-M1’s hybrid architecture combines a Mixture-of-Experts (MoE) backbone with a novel Lightning Attention mechanism, which dramatically reduces the compute required for long-context tasks. At a generation length of 100,000 tokens, M1 uses only 25% of the computing power needed by DeepSeek R1.

The Lightning Attention mechanism sidesteps the traditional quadratic scaling bottleneck of Transformer models, enabling efficient handling of massive inputs without sacrificing performance. Put simply, Lightning Attention is a clever way for AI models like Minimax M1 to handle massive amounts of text quickly and efficiently. Rather than trying to look at every word in a massive document at once (which would be painfully slow and require loads of computer power), Lightning Attention breaks the text into smaller chunks, focuses on the most important bits in each chunk, and then stitches the key points together to understand the bigger picture. This means the model can keep track of much longer conversations or documents without getting bogged down or running out of memory. It’s like reading a whole book by skimming each chapter for the main ideas, then connecting those ideas to grasp the full story, making it both faster and much less demanding on the hardware.

Equally impressive is MiniMax-M1’s training efficiency. Leveraging a new reinforcement learning algorithm called CISPO (Clipped Importance Sampling Policy Optimisation), the model was trained in just three weeks on 512 Nvidia H800 GPUs, at a cost of approximately $534,700 - an order of magnitude less than what competitors have spent for similar feats. The result is a model that not only excels on industry benchmarks for reasoning, coding, and long-context tasks but does so with unmatched cost-effectiveness. On tests like AIME 2024, LiveCodeBench, and SWE-bench Verified, M1 is competitive with or outperforms many closed-source and open-weight models, including DeepSeek R1, Qwen3-235B, and even approaches the likes of GPT-4o and Gemini 2.5 Pro.

As enterprises increasingly look to harness AI for document analysis, code review, and knowledge management, the ability to process entire books, research archives, or codebases in a single pass is transformative. MiniMax-M1’s open-source nature means we can experiment, customise, and deploy at scale without worrying about restrictive licences or spiralling costs. Its efficiency and performance make it a compelling candidate for internal AI-powered tools, especially where context and reasoning over large datasets are paramount. In a landscape where AI is often locked behind proprietary APIs and eye-watering price tags, MiniMax-M1 offers a refreshing, pragmatic alternative - one that’s well worth our attention as we continue to drive innovation with AI at the heart of our businesses strategies.

5. Government AI trial: Civil servants save nearly two weeks a year

A landmark government trial has revealed that artificial intelligence could be the productivity boost civil servants never knew they needed. Over 20,000 civil servants were equipped with the latest AI tools, including Microsoft 365 Copilot, for three months. The results were impressive: on average, each participant saved 26 minutes per day on routine tasks like drafting documents, summarising meetings, and updating records. This adds up to nearly two weeks of time saved per year for each civil servant, freeing them up to focus on higher-value work, innovation, and delivering real impact for the public.

The trial’s impact was felt across departments. Policy officials used AI to cut through jargon and streamline consultations, while Work Coaches at the Department for Work and Pensions (DWP) sped up support for jobseekers by personalising advice and helping clients revitalise their businesses. At Companies House, staff handled customer queries and drafted responses more efficiently. The trial demonstrated that AI isn’t just a futuristic promise: it’s already making government work smarter, reducing red tape, and improving the use of taxpayers’ money.

Complementary research from the Alan Turing Institute suggests AI could support up to 41% of tasks across the public sector, with even greater potential in specific areas. For example, teachers spend almost one hundred minutes a day on lesson planning, up to 75% of which could be supported by AI, giving them more time in the classroom. Civil servants spend about 30 minutes daily on emails, and AI could cut this by over 70%. These findings are part of the government’s broader push to modernise the state, aiming to save £45 billion by simplifying and automating delivery, moving services online, and reducing fraud and error through digital compliance.

Our world is filled with routine admin, document drafting, and endless email chains. The government’s AI trial shows how much time can be clawed back by putting the right tools in people’s hands. Imagine our own teams saving 26 minutes a day - nearly two weeks a year - on repetitive tasks. That’s more time for complex claims, customer care, and innovation. The trial’s success is a timely reminder: AI isn’t just for tech giants or government departments. It’s a practical, proven way to boost productivity, reduce admin fatigue, and help us focus on what really matters - delivering better outcomes for our clients and partners.

Thank you, Mat, for the heads-up!

POB’s closing thoughts

I have waxed lyrical previously about my love for Meta’s AI powered Ray-Bans, so I was excited this week to see them announce Meta AI powered Oakley glasses, with upgraded specs that I hope will also become available in the regular glasses.

Meta are no longer the only players in the game either - Xiaomi this week announced their own AI glasses, with similar features but some neat improvements such as native USB C charging and electrochromic lenses. Competition is good!

Finally, I came across an interesting Tweet / Post / X? this week detailing a new AI model from ByteDance (yes, the TikTok people) and Carnegie Mellon researchers, that takes a single photo and turns it into fully editable parts for 3D printing. Super clever.

That’s it from me this week. Happy Friday, and I hope you have a great weekend. It’ll be virtual Glastonbury in the garden of the O’Brien house, making the most of the nice UK weather! 👍

I’d love to hear your feedback on whether you enjoy reading the Substack, find it useful, or if you would like to see something different in a future post. What AI topics are you most interested in for future explainers? Are there any specific AI tools or developments you'd like to see covered? Remember, if you have any questions around this Substack, AI or how Davies can help your business, you can reply to this message to reach me directly.

Finally, remember that while I may mention interesting new services in this post, you shouldn’t upload or enter business data into any external web service or application without ensuring it has been explicitly approved for use.

Disclaimer: The views and opinions expressed in this post are my own and do not necessarily reflect those of my employer.