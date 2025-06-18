vLLM has rapidly emerged as the de facto inference engine for serving large language models, celebrated for its high throughput, low latency, and efficient use of memory through paged attention. While much of the spotlight has focused on GPU-based deployments, the absence of GPUs shouldn't stop you from experimenting with vLLM or understanding its capabilities.

vLLM and LLMs: A match made in heaven

In this article, I’ll walk you through how to run vLLM entirely on CPUs in a bare OpenShift cluster using nothing but standard Kubernetes Hey Hi (AI) and open source tooling. Because I am a performance engineer by craft, we’ll also dive into some fun performance-focused experiments that help explain the current state of the art in LLM inference benchmarking.