LMDeploy RCE Vulnerability CVE-2026-33626 Weaponised in the Wild 13 Hours After Disclosure

Thirteen Hours From Disclosure to Exploitation

CVE-2026-33626 is a critical remote code execution vulnerability in LMDeploy, a high-performance LLM inference and serving framework developed by Shanghai AI Laboratory and widely used to deploy models including LLaMA, Qwen, InternLM, and Baichuan at enterprise scale. Exploitation was confirmed in the wild just 13 hours after the vulnerability was publicly disclosed on April 24 — a window that gives virtually no time for an organisation to identify its exposure before attacks begin.

The vulnerability is a deserialization flaw in LMDeploy’s model loading API. When a client submits a model configuration or adapter to a running LMDeploy inference server, the server deserializes the payload without input validation. A crafted payload causes the server to execute arbitrary operating system commands as the user running the inference service — typically with broad access to the GPU host and any mounted model storage. The CVSS score is 9.8.

Who Is Affected

LMDeploy is used primarily by organisations that self-host large language models rather than consuming them through managed APIs. Affected deployments include:

Enterprise AI teams running on-premises inference servers exposed via internal APIs
Research institutions and universities hosting models for internal or external access
AI product teams building services on top of self-managed inference infrastructure
Development environments where LMDeploy is exposed on open ports during prototyping

LMDeploy versions prior to 0.8.4 are affected. The framework has approximately 14,000 GitHub stars and is included in several popular open-source AI deployment stacks. A common misconfiguration pattern — exposing the inference API on a broad interface during development and not restricting access before moving to production — has left many instances directly internet-accessible.

The AI Infrastructure Exploitation Pattern

CVE-2026-33626 is the third major AI/ML infrastructure framework to be exploited within days of disclosure in 2026. CVE-2026-33017 (Langflow) was weaponised within 20 hours; CVE-2026-39987 (Marimo Python notebooks) saw mass exploitation within days; now LMDeploy at 13 hours. The exploitation window is contracting.

The common thread is that these platforms are deployed by teams who think of them as internal development infrastructure, but expose them to the internet without the hardening applied to production web services. Inference servers running popular models are indexed by Shodan, FOFA, and Censys — attackers scan for service banners and version strings. When a proof-of-concept for a critical deserialization flaw is published alongside the CVE, the time to exploitation is measured in hours.

Inference servers are high-value targets beyond the code execution capability itself. They typically run on GPU-equipped hosts, have access to proprietary training data and fine-tuned model weights, and connect to internal data pipelines and APIs. Compromising an inference server may provide access to business-critical AI assets that are not replicated in standard backup and recovery infrastructure.

Recommended Actions

Patch LMDeploy to version 0.8.4 immediately — the fix adds schema validation and disables unsafe deserialization code paths in the model loading API
Audit all LMDeploy instances for internet exposure: check cloud security group rules and firewall policies for open access to LMDeploy default ports (23333, 23334); use Shodan or your cloud provider’s exposure assessment tools
Restrict inference API access to authenticated callers on known IP ranges or behind an API gateway — inference servers should never be directly internet-accessible
Check GPU host logs for compromise indicators on any instance that was internet-exposed: unexpected processes, modified model files, outbound connections to unknown IPs, and unusual network activity from the inference service user account
Apply network segmentation to AI infrastructure — inference servers should be on separate segments from data stores, CI/CD pipelines, and model registries
Add LLM inference platforms to your asset inventory and vulnerability management programme: if they are not in scope for patching, you cannot respond in time when the next disclosure occurs

A 13-hour exploitation window is not an anomaly — it is now the baseline expectation for high-profile open-source framework vulnerabilities. Organisations without AI infrastructure in their asset inventory cannot act on what they cannot see.

Share this article