[{"content":"Gateway API Inference Extension v1.4.0 shipped on March 20 with 101 commits from 54 contributors, 13 of them first-timers! I\u0026rsquo;ve been studying GIE internals for the past few weeks as part of onboarding to llm-d, which builds its inference scheduler on top of GIE\u0026rsquo;s Endpoint Picker (EPP). So I\u0026rsquo;ve been watching this release closely.\nWhat is GIE? Quick context if you\u0026rsquo;re new to this ecosystem. GIE is a Kubernetes SIG project (kubernetes-sigs/gateway-api-inference-extension) that turns any ext-proc capable proxy, think Envoy Gateway or kgateway, into an inference-optimized load balancer for self-hosted LLMs. The core component is EPP, the Endpoint Picker. It sits behind Envoy as an ext-proc filter, intercepts every request, parses the model name from the JSON body, and runs a Filter -\u0026gt; Score -\u0026gt; Pick scheduling pipeline to choose the best backend pod. The design is modeled after kube-scheduler\u0026rsquo;s plugin framework.\nThe framework reorg The most structurally significant change in 1.4 is the codebase reorganization. All plugin interfaces and scheduling types moved under epp/framework/, with strict import confinement enforced by a validation script. Scheduling plugins, request control plugins, and data layer plugins each got their own subdirectory under epp/framework/plugins/.\nThis sounds like housekeeping, but it\u0026rsquo;s not. Before this change, plugin interfaces were scattered across packages and it wasn\u0026rsquo;t always clear what a downstream consumer (like llm-d) should import vs what was internal. Now there\u0026rsquo;s a clean boundary: epp/framework/interface/ is the stable surface, everything else is implementation detail. I traced this through PRs #2195, #2192, #2230, and #2286 to see the progression.\nA related change: scorer weights changed from int to float in #2207. Small API change, but it means you can now express finer-grained weight ratios between scorers without integer rounding.\nStandalone EPP EPP can now be deployed as its own Helm chart, independent of the InferencePool chart. The work landed across several PRs starting with #2122, with a user guide and resource configuration support in #2273.\nThis matters for anyone running EPP with custom plugins. Previously you had to deploy the full InferencePool chart even if all you wanted was your own EPP binary with custom scorers registered. Now you can build your own EPP image, point the standalone chart at it, and deploy it next to an InferencePool managed separately.\nPluggable BBR The Body-Based Router (BBR) is a separate ext-proc server that runs before EPP. Its job is simple: parse the HTTP body, extract the model field, and write it as a header so the gateway can route by model at the header level. In 1.3 this was a fixed implementation.\nIn 1.4, BBR got a plugin framework of its own (#2121, #2209). There\u0026rsquo;s now a configurable body-fields-to-headers plugin (#2417) so you can extract arbitrary fields from the request body and promote them to headers for routing decisions.\nThe request and response paths were also refactored to accept raw bytes and support pluggable parsers (#2409, #2410). This is groundwork for the gRPC support that\u0026rsquo;s coming (the API piece already landed, see below).\nData layer refactoring The data layer, which is how EPP collects pod metrics and state, got a significant overhaul. The HTTP datasource was refactored in #2120, and a new distinction was introduced between polling-based and notification-based data sources (#2320, #2407).\nBefore 1.4, all data collection was HTTP polling: EPP scrapes each pod\u0026rsquo;s Prometheus /metrics endpoint on an interval. Now there\u0026rsquo;s a clean interface for notification-based sources too, like Kubernetes watch events. Plugin execution order across layers is validated at startup (#2333), which catches misconfiguration early.\nThis aligns with what was proposed in Proposal 1023 (Data Layer Architecture). The data layer is now genuinely pluggable rather than hardcoded to HTTP metric scraping.\nFlow control maturity Flow control, the priority and fairness system that decides what happens under overload, got several important changes:\nPriority band garbage collection (#2097) so empty priority bands don\u0026rsquo;t accumulate forever Concurrency saturation detection (#2062) as a new signal for when endpoints are overloaded FailOpen as the default (#2365) on InferencePool, meaning if EPP is down the gateway forwards traffic to backends directly instead of failing requests Fairness and ordering policy migration (#2188, #2193) with the interflow package renamed to fairness and intraflow to ordering, which makes the code much easier to read The FailOpen default is worth highlighting. In 1.3, if EPP went down, all inference traffic stopped. That\u0026rsquo;s the right default for safety during development, but the wrong default for production. Flipping to FailOpen means a crashed or restarting EPP doesn\u0026rsquo;t take down your inference endpoint. You lose smart routing but keep serving.\ngRPC and multimodal Two API-level additions worth noting:\ngRPC support landed via appProtocol on InferencePool (#2162) with ALPN h2 support for TLS (#2385). This is the API piece of Proposal 2162. The full implementation (gRPC-to-gRPC routing, HTTP-to-gRPC transcoding) is coming in phases. Both vLLM and SGLang expose gRPC endpoints, so this opens the door to binary-framed inference with lower overhead than JSON.\nMultimodal inputs now include video and audio format support (#2181), alongside the existing image support. Plus the Responses API and Conversations API (#2133) for alternative OpenAI-compatible endpoints.\nLatency prediction gets PD-aware The predicted latency scorer (renamed from \u0026ldquo;slo-aware-router\u0026rdquo; in #2183) now understands prefill/decode disaggregation (#2361). It can make different latency predictions for prefill pods vs decode pods, and handles disaggregated mode filtering in #2390.\nLatency prediction also moved from scoring time to the PrepareData step (#2319), which means predictions are computed once and shared across scoring plugins via CycleState rather than recomputed by each scorer that needs them.\nWhat this means for llm-d I\u0026rsquo;m onbaording into the llm-d team, which builds a disaggregated inference framework on top of GIE. llm-d\u0026rsquo;s inference scheduler extends GIE\u0026rsquo;s plugin system with custom filters (decode-filter, prefill-filter), scorers (prefix-cache-scorer, load-aware-scorer), and profile handlers (pd-profile-handler) that implement prefill/decode separation. Here\u0026rsquo;s what 1.4 means for that stack.\nThe framework reorg is the biggest deal. llm-d imports GIE\u0026rsquo;s plugin interfaces to register its own scorers and filters. A clean epp/framework/interface/ boundary means llm-d can pin to a stable API surface instead of reaching into internal packages. This should reduce breakage on GIE version bumps, which has been a real friction point.\nStandalone EPP unlocks cleaner deployment. llm-d already builds its own EPP binary with custom plugins registered at build time. The standalone chart means llm-d doesn\u0026rsquo;t need to fork the InferencePool Helm chart just to swap in its EPP image. Deploy InferencePool for your model server pods, deploy standalone EPP with llm-d\u0026rsquo;s plugins separately.\nScorer weight floats help. llm-d runs multiple scorers in its scheduling profiles: prefix cache affinity, load-aware distribution, KV cache utilization. Tuning the balance between these with integer weights was coarse. Float weights let you express things like \u0026ldquo;prefix cache affinity matters 1.5x more than load distribution\u0026rdquo; without scaling everything up.\nPD-aware latency prediction is directly relevant. llm-d\u0026rsquo;s whole architecture is built around disaggregated prefill and decode. Having the latency predictor understand that prefill pods and decode pods have fundamentally different latency characteristics means this scorer becomes useful for llm-d deployments out of the box, instead of needing a custom replacement.\nPluggable data layer means custom metrics. llm-d\u0026rsquo;s vLLM pods expose metrics beyond what the standard model server protocol requires, things like NIXL transfer latency and per-adapter cache hit rates. The new notification-based data source interface could let llm-d push metrics to EPP via watch events rather than relying solely on HTTP polling, which would reduce metric staleness for fast-moving state like KV cache occupancy.\nFailOpen default is the right call for production. llm-d deployments are typically multi-pod with both prefill and decode pools. A crashed EPP shouldn\u0026rsquo;t stop all inference. With FailOpen, traffic falls back to round-robin until EPP recovers. You lose cache-aware routing temporarily, but you keep serving.\nThe bigger picture The pattern I see: GIE is following trajectory similar to kube-scheduler. Start with a monolithic implementation, identify the extension points, formalize them as plugin interfaces, enforce boundaries so plugins don\u0026rsquo;t accidentally depend on internals, then let the ecosystem build on top. It took kube-scheduler several releases to get the plugin framework right. GIE is doing it faster, probably because the pattern isn\u0026rsquo;t new.\nReferences GIE v1.4.0 release GIE docs ","permalink":"https://hexfusion.io/posts/gie-1.4-framework-release/","summary":"Gateway API Inference Extension v1.4 landed with 101 commits from 54 contributors. The headline isn\u0026rsquo;t a single feature, it\u0026rsquo;s that GIE became a real plugin framework. Here\u0026rsquo;s what changed and why it matters if you\u0026rsquo;re building on top of it.","title":"GIE 1.4: the framework release (and what it means for llm-d)"},{"content":"\nGo has never had a clean story for SIMD. If you wanted vector instructions, you wrote assembly stubs by hand, used unsafe pointer tricks, or accepted the compiler\u0026rsquo;s auto-vectorization (which is conservative at best). Go 1.26 changes that.\nWhat shipped The new simd/archsimd package is gated behind GOEXPERIMENT=simd. I found the design in proposal #73787 and the package docs. The API exposes typed SIMD vector types like Int8x16, Float32x4, and Int32x4 with methods for arithmetic, comparison, and mask extraction.\nGOEXPERIMENT=simd go build ./... Without the flag, import \u0026quot;simd/archsimd\u0026quot; does not resolve. This is intentional since the API is experimental and may change before the flag is removed.\nThe API design The approach is explicit, not magic. There is no auto-vectorization. You tell the compiler exactly what SIMD operations you want.\nTo show what this means in practice, here is the same operation in C and Go. The goal: given an array of 16 bytes, find which one matches a search byte, using a single SIMD comparison instead of a loop.\nC (SSE2 intrinsics):\n#include \u0026lt;immintrin.h\u0026gt; #include \u0026lt;stdint.h\u0026gt; int find_match(uint8_t keys[16], uint8_t search) { // Load 16 bytes into a 128-bit register __m128i key_vec = _mm_loadu_si128((__m128i*)keys); // Fill all 16 lanes with the search byte __m128i cmp_vec = _mm_set1_epi8((char)search); // Compare all 16 lanes at once __m128i result = _mm_cmpeq_epi8(key_vec, cmp_vec); // Extract one bit per lane into an integer int mask = _mm_movemask_epi8(result); if (mask == 0) return -1; return __builtin_ctz(mask); // index of first match } Go (archsimd):\nimport ( \u0026#34;math/bits\u0026#34; \u0026#34;simd/archsimd\u0026#34; \u0026#34;unsafe\u0026#34; ) func findMatch(keys *[16]byte, search byte) int { // Load 16 bytes into a 128-bit register keyVec := archsimd.LoadInt8x16((*[16]int8)(unsafe.Pointer(keys))) // Fill all 16 lanes with the search byte cmpVec := archsimd.BroadcastInt8x16(int8(search)) // Compare all 16 lanes at once mask := keyVec.Equal(cmpVec) // Extract one bit per lane into an integer bitmask := mask.ToBits() if bitmask == 0 { return -1 } return bits.TrailingZeros16(bitmask) // index of first match } Same four steps, same machine instructions underneath. The Go version reads like normal code instead of requiring you to know that _mm_loadu_si128 loads unaligned data or that _mm_set1_epi8 broadcasts a byte. The compiler maps each call directly to the hardware instruction: LoadInt8x16 -\u0026gt; VMOVDQU, BroadcastInt8x16 -\u0026gt; VPBROADCASTB, Equal -\u0026gt; VPCMPEQB, ToBits -\u0026gt; VPMOVMSKB.\nWhat it feels like in practice To test this beyond a micro-benchmark, I built go-simd-art, an Adaptive Radix Tree with SIMD-accelerated Node16 lookups. The original 2013 paper by Leis, Kemper, and Neumann describes a specific SIMD optimization: broadcast the search byte to all 16 lanes, compare at once, extract the match position from a bitmask. Existing Go ART implementations (plar/go-adaptive-radix-tree, arriqaaq/art) skip this because there was no clean way to express it.\nWith archsimd, the hot path in node16.go is five lines:\nkeys := archsimd.LoadInt8x16((*[16]int8)(unsafe.Pointer(\u0026amp;n.Keys))) cmp := archsimd.BroadcastInt8x16(int8(c)) mask := keys.Equal(cmp) bitmask := mask.ToBits() bitmask \u0026amp;= (1 \u0026lt;\u0026lt; n.Count) - 1 No assembly files. No //go:noescape pragmas. No build tags. The unsafe.Pointer cast from *[16]byte to *[16]int8 is the one rough edge since archsimd works with signed types, but for equality comparison signedness does not matter.\nOn an AMD Ryzen AI 9 HX 370, the SIMD path is 15% faster than scalar on dense trees (where Node16 dominates) and 14% faster on random key workloads:\nBenchmark Scalar SIMD Improvement SearchDense 23.99 ns/op 20.29 ns/op -15.4% Search (random) 351.2 ns/op 302.1 ns/op -14.0% Insert 1479 ns/op 1266 ns/op -14.4% All operations are zero-allocation on the lookup path. Full benchmarks and source are in go-simd-art.\nRough edges Signed types only. archsimd provides Int8x16 but not Uint8x16. Most data works with unsigned bytes. For equality this does not matter, but ordered comparisons (less-than) would need signedness handling. The proposal discussion acknowledges this.\nGOEXPERIMENT everywhere. Every go build, go test, go run, and CI invocation needs the flag. IDE tooling may not support it yet. This is the biggest practical barrier.\nAMD64 only. The current implementation targets x86-64 SSE/AVX. ARM NEON support is mentioned in the proposal but not yet available.\nWhen does this matter? 15% on a hot inner loop is real for databases, routers, and parsers that do millions of lookups per second. For most application code, it does not. The value of archsimd is not that every Go program gets faster. It is that Go programs that currently drop to assembly for performance-critical paths can stay in pure Go.\nThe experiment flag means this is not production-ready yet, but the direction is clear. If you have a data structure with a tight loop over a fixed-size array, archsimd is worth trying now to see what the compiler produces.\nParts of this post were written with assistance from Claude.\n","permalink":"https://hexfusion.io/posts/go-simd-art/","summary":"Go 1.26 shipped simd/archsimd behind GOEXPERIMENT=simd, giving Go native SIMD intrinsics for the first time. I tried it on a real data structure to see what it feels like in practice.","title":"Go finally gets SIMD in 1.26"},{"content":"This is a follow-up to Part 1, where the v1 adapter produced 600 tab characters oops. I moved training to the RTX 3060, hit a few more walls, and eventually got a v2 run to complete.\nv2 Training Setup After v1 I moved to the RTX 3060 (endor). The goal was to train on the full 41k sample dataset with one pass through the data.\nI had to learn what sm_86 means along the way. NVIDIA assigns every GPU a compute capability version sm_ stands for streaming multiprocessor, and the number is essentially the GPU\u0026rsquo;s feature set version. The T4 is sm_75 (Turing, 2018). The RTX 3060 is sm_86 (Ampere, 2020). I found the full table in NVIDIA\u0026rsquo;s CUDA GPU list. The reason it matters here: bf16 arithmetic is only natively supported from sm_80 onward, which is why the T4 couldn\u0026rsquo;t use it and the 3060 can.\nGetting there took a few more attempts.\nTransformers 5.x OOM: The first run failed immediately on model load with an out-of-memory error. Not during training, before a single step ran. I eventually found that transformers 5.x changed how quantized models load. The new code materializes the full model in GPU memory before applying BnB quantization, which blows past 12GB. The fix was to pin to an older version: pip install 'transformers\u0026lt;5.0'. I landed on 4.57.6.\nCUDA OOM at step 50: The model loaded. Training started. At step 50 it crashed with another OOM. This one was the logits tensor from computing loss over long sequences. I dropped --max-len from 512 to 256, but it still crashed. After digging through the PyTorch CUDA memory management docs I decided to try PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. The docs describe it as a mode where the allocator grows segments incrementally rather than pre-reserving large contiguous blocks — which matters when you\u0026rsquo;re carving memory for a large model and a large logits tensor at the same time. That cleared it.\nAfter those two fixes the run went clean.\nThe Numbers v2 trained for 17.4 hours on the RTX 3060, 2,483 steps, bf16, full 41k dataset.\nv1 v2 Hardware T4 (sm_75) RTX 3060 (sm_86) Dataset 5k samples 41k samples Precision fp32 bf16 Runtime ~2h 17.4h Final loss 7.76 0.918 Token accuracy 4.2% 82.7% Loss around 0.9 is in the range I\u0026rsquo;d expect for a reasonable adapter. The token accuracy jump from 4% to 82% is significant. v1 was basically random. v2 had actually learned something from the training data.\nLoading the Adapter The adapter came out of training as a directory with adapter_config.json, adapter_model.safetensors, and the tokenizer files. Total size: around 20MB for the safetensors file, a few KB for config.\nI wasn\u0026rsquo;t sure how to get it into the running vLLM instance. I ended up stopping the decode pod, mounting the adapter directory as a hostPath volume, and passing --enable-lora --lora-modules go-adapter=/adapter as args to vLLM. When the pod came back up, both the base model and the adapter showed up in /v1/models:\n{\u0026#34;id\u0026#34;: \u0026#34;Qwen/Qwen2.5-7B-Instruct-AWQ\u0026#34;, \u0026#34;parent\u0026#34;: null, ...}, {\u0026#34;id\u0026#34;: \u0026#34;go-adapter\u0026#34;, \u0026#34;parent\u0026#34;: \u0026#34;Qwen/Qwen2.5-7B-Instruct-AWQ\u0026#34;, ...} You switch between them per-request by setting \u0026quot;model\u0026quot;: \u0026quot;go-adapter\u0026quot; in the request body. The base model is still available at the same endpoint. I hadn\u0026rsquo;t really thought about it before but this is a clean way to A/B test: same server, same GPU, just a different model name.\nThe Comparison Same prompt as Part 1:\nWrite a Kubernetes controller reconcile function in Go using controller-runtime that handles not-found gracefully. Base model:\nmyCustomRes := \u0026amp;MyCustomResource{} if err := cl.Get(ctx, req.NamespacedName, myCustomRes); err != nil { if apiErr, ok := err.(*client.StatusError); ok \u0026amp;\u0026amp; apiErr.ErrStatus.Reason == metav1.StatusReasonNotFound { return ctrl.Result{}, nil } return ctrl.Result{}, fmt.Errorf(\u0026#34;error getting MyCustomResource: %v\u0026#34;, err) } It works but the not-found check isn\u0026rsquo;t quite right. The idiomatic pattern in controller-runtime is errors.IsNotFound(err) from k8s.io/apimachinery/pkg/api/errors. The base model does a type assertion on client.StatusError directly, which is a less portable pattern.\nAdapter (go-adapter v2):\nimport ( \u0026#34;k8s.io/apimachinery/pkg/api/errors\u0026#34; ctrl \u0026#34;sigs.k8s.io/controller-runtime\u0026#34; \u0026#34;sigs.k8s.io/controller-runtime/pkg/controllerutil\u0026#34; \u0026#34;sigs.k8s.io/controller-runtime/pkg/log\u0026#34; ) //+kubebuilder:rbac:rules=[{apiGroups=[],resources=[\u0026#34;pods\u0026#34;],verbs:[\u0026#34;get\u0026#34;,\u0026#34;list\u0026#34;,\u0026#34;watch\u0026#34;]}] func (mc *MyController) SetupWithManager(mgr ctrl.Manager) error { ... } func (mc *MyController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { logger := log.FromContext(ctx) ... } The adapter reaches for k8s.io/apimachinery/pkg/api/errors which is the right package. It also generates kubebuilder:rbac markers and SetupWithManager patterns that come directly from real controller-runtime codebases. It knows the log.FromContext(ctx) idiom.\nI tried a few more prompts.\netcd leader election:\nThe base model invented API calls that don\u0026rsquo;t exist in the etcd client library (etcdClient.Grant, etcdClient.Lock as I tried to call them). The adapter pulled a real import path and attempted a working approach, even if the leader election logic wasn\u0026rsquo;t complete. Neither was production-ready but the adapter at least referenced APIs that actually exist.\ngRPC server streaming:\nBoth models struggled here. The server-streaming handler signature in Go (func (s *server) StreamFoo(req *pb.Req, stream pb.Service_StreamFooServer) error) is unusual enough that I\u0026rsquo;d say both got it wrong on the first try. The adapter at least mentioned stream.CloseSend() which is a real method. But this was the weakest result of the three tests.\nWhat Actually Changed The v1 collapse was about data volume. 5k samples and one epoch isn\u0026rsquo;t enough for the model to learn semantic patterns it finds a local minimum (the most common token: tab) and sticks there.\nv2 is better at:\nIdiomatic Go imports: it reaches for the right packages (k8s.io/apimachinery/pkg/api/errors, sigs.k8s.io/controller-runtime) rather than inventing things Controller-runtime patterns: SetupWithManager, log.FromContext, kubebuilder:rbac markers all showed up unprompted Error handling shape: wraps errors correctly, returns ctrl.Result{} in the right places It still hallucinates:\nAPIs within packages: the etcd example used real imports but made up method calls Complex interface implementations: gRPC streaming signatures were close but wrong Struct fields and config: details that require having seen the specific package My read is that 41k samples over one epoch moves the needle on general patterns but isn\u0026rsquo;t enough to reliably reproduce specific API surface. The model learned the shape of Go code from these repos more than the exact APIs.\nWhat\u0026rsquo;s Next The plan was always to compare this with InstructLab the same domain done the \u0026ldquo;right\u0026rdquo; way with taxonomy files and synthetic data generation rather than raw extraction from source trees. That\u0026rsquo;s Phase 2. I don\u0026rsquo;t know yet whether the results will be better or just different.\nA few things I\u0026rsquo;d try before calling this adapter done:\nMore epochs on the same data the loss was still declining at the end of v2 Better data quality I skipped manual review, and there\u0026rsquo;s definitely noise in the 41k pairs Targeted prompts in the training set the leader election and gRPC failures suggest those patterns aren\u0026rsquo;t well-represented in the data Part 3 will cover synthetic data generation using Red Hat\u0026rsquo;s SDG Hub and training with Training Hub.\nReproduce It Same training script as Part 1: train_qlora.py\n# v2 run (RTX 3060, bf16, full dataset) PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\ python train_qlora.py \\ --model Qwen/Qwen2.5-7B-Instruct \\ --data training_data.jsonl \\ --output output/v2 \\ --epochs 1 --max-len 256 --bf16 References QLoRA: Efficient Finetuning of Quantized LLMs: Dettmers et al., 2023 BitsAndBytes: 4-bit quantization library TRL SFTTrainer: Supervised fine-tuning trainer vLLM LoRA docs: Serving adapters in vLLM controller-runtime: K8s controller library used as training source ","permalink":"https://hexfusion.io/posts/lora-go-training-pt2/","summary":"The v2 adapter trained overnight on 41k samples. Loss 0.918, accuracy 82.7%. I loaded it into vLLM and ran the same prompts. Here\u0026rsquo;s what came out.","title":"Fine-tuning a Go expert: does it actually work? (Part 2)"},{"content":"Quick Reference What is LoRA? LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models without retraining all their weights. Instead of updating 7 billion parameters, you inject small trainable matrices into the model\u0026rsquo;s attention layers and train those. The result is an adapter \u0026ndash; think of it like a git diff for the model. The base model (~15GB) stays frozen; the adapter (~20MB) patches its behavior for your domain.\nWhat is QLoRA? QLoRA combines LoRA with 4-bit quantization. You load the base model in 4-bit which was ~4GB for 7B params to fit in GPU memory, then train the LoRA adapters at full precision. This is was what I had to do to fine-tune the 7B model possible on a 16GB GPU.\nAdapter vs model: The adapter is not a new model. It\u0026rsquo;s more like a git diff. To serve it, you load the base model and apply the adapter on top. vLLM supports this natively you can serve the base model and multiple adapters from a single GPU, switching per-request by model name.\nThis is the third post in my series on building a bare metal llm-d lab. In part 1 I set up disaggregated prefill/decode inference. In part 2 I replaced hostNetwork with DRANet to fix EPP routing. Now I want to train a domain-specific adapter and serve it on the same stack.\nThe goal: a model that\u0026rsquo;s better at Go distributed systems patterns. Controller loops, Raft consensus, gRPC, operator patterns. The kind of code that lives in kubernetes/kubernetes, etcd-io/etcd, and cockroachdb/cockroach.\nThe Data Before I could train anything, I needed training data. I looked into the formats used for code fine-tuning and landed on instruction/output pairs a natural language description and the code that answers it.\nFor sources I picked some of my favorite Go projects the repos I\u0026rsquo;ve spent the most time reading and learning from. I figured if I want the model to write code that looks like good Go, these are probably the right teachers.\nI wrote a Go tool that walks a source tree, parses every .go file with go/ast, and extracts functions that have doc comments. The comment becomes the instruction; the function body becomes the output.\n// source repos repos := []string{ \u0026#34;repos/kubernetes\u0026#34;, \u0026#34;repos/etcd\u0026#34;, \u0026#34;repos/grpc-go\u0026#34;, \u0026#34;repos/containerd\u0026#34;, \u0026#34;repos/consul\u0026#34;, \u0026#34;repos/cockroach\u0026#34;, \u0026#34;repos/stdlib\u0026#34;, } This produced 41,805 pairs across seven repos: 14k from Go stdlib, 10k each from cockroach and kubernetes, and the rest from etcd, consul, grpc-go, and containerd.\nEach pair looks like:\n{ \u0026#34;instruction\u0026#34;: \u0026#34;reconcileHandler processes a work item from the queue...\u0026#34;, \u0026#34;output\u0026#34;: \u0026#34;func (c *Controller) reconcileHandler(...) {\\n\\t...\u0026#34; } I skipped manual review and set the quality bar at \u0026ldquo;has a doc comment and is at least 5 lines.\u0026rdquo; There\u0026rsquo;s noise in there, but from what I read, fine-tuning tends to be fairly forgiving of noisy data when you have enough of it.\nThe Hardware Training node: dagobah an old Xeon workstation with a Tesla T4 16GB GPU. The T4 is a datacenter card from 2018, built for inference. I figured it would be fine for training too. That assumption got tested and was mostly wrong heh.\nServing: the existing llm-d P/D stack from the previous posts. Prefill on the T4, decode on an RTX 3060.\nTraining Setup I went with Qwen/Qwen2.5-7B-Instruct as the base the non-quantized version. More on why not the AWQ version in a moment.\nFor the training stack I used PEFT for LoRA, TRL SFTTrainer for the training loop, and BitsAndBytes for 4-bit quantization. I found this combination through the Hugging Face QLoRA guide, which walks through exactly this setup and is where most people seem to start.\nLoRA config:\nRank r=16, alpha=32 (standard starting point) Target modules: q_proj, k_proj, v_proj, o_proj (attention layers) ~10M trainable parameters out of 7.6B total about 0.13% The Gauntlet Nothing worked on the first try.\nAttempt 1 AWQ base + BnB 4-bit:\nValueError: You cannot load an AWQ model and quantize it with BitsAndBytes I started with the AWQ model since that\u0026rsquo;s what I was already running in the lab. Turns out AWQ is itself a quantized format you can\u0026rsquo;t quantize it again. Switched to Qwen2.5-7B-Instruct, the full-precision base.\nAttempt 2 fp16 AMP training:\nNotImplementedError: \u0026#34;_amp_foreach_non_finite_check_and_unscale_cuda\u0026#34; not implemented for \u0026#39;BFloat16\u0026#39; After digging into it, I found that the T4 (sm_75) doesn\u0026rsquo;t support bfloat16 natively \u0026ndash; and Qwen2 stores its internal tensors in bf16. PyTorch\u0026rsquo;s AMP gradient scaler hits these during the backward pass and crashes.\nI tried casting all model parameters and buffers to fp16 to work around it, but that didn\u0026rsquo;t help. The bf16 turns out to persist inside the loss function itself, not in the model weights. I wasn\u0026rsquo;t able to find a clean fix for this on the T4.\nAttempt 3 CUDA OOM with batch=4:\ntorch.OutOfMemoryError: CUDA out of memory I wasn\u0026rsquo;t sure why until I thought through the numbers 7B in 4-bit in a batch of 4 sequences pushes past 16GB thanks claude. Dropped to batch=1.\nAttempt 4 fp32, it runs: Disabling both fp16 and bf16 (fp16=False, bf16=False in the trainer config) forces full fp32 training. The T4 seems to handle fp32 fine. It runs but slowly, around 21 seconds per step.\nFor a first run I subsampled to 5,000 pairs and ran for one pass through the data. About two hours total.\nThe SFTTrainer logs progress every few steps and prints a summary when training finishes:\n{\u0026#39;loss\u0026#39;: 7.7646, \u0026#39;grad_norm\u0026#39;: ..., \u0026#39;learning_rate\u0026#39;: ...} ... {\u0026#39;train_runtime\u0026#39;: 7427.0, \u0026#39;train_samples_per_second\u0026#39;: ..., \u0026#39;train_loss\u0026#39;: 7.7646} Final stats: loss=7.76, token accuracy=4.2%.\nServing the Adapter The adapter came out as a 20MB adapter_model.safetensors file. I wasn\u0026rsquo;t sure how to serve it alongside the base model, but vLLM has a --lora-modules flag that makes it straightforward:\nargs: - --model - Qwen/Qwen2.5-7B-Instruct-AWQ - --enable-lora - --lora-modules - go-adapter=/adapter Both show up in /v1/models:\n{\u0026#34;id\u0026#34;: \u0026#34;Qwen/Qwen2.5-7B-Instruct-AWQ\u0026#34;, ...}, {\u0026#34;id\u0026#34;: \u0026#34;go-adapter\u0026#34;, \u0026#34;parent\u0026#34;: \u0026#34;Qwen/Qwen2.5-7B-Instruct-AWQ\u0026#34;, ...} You switch between them by setting \u0026quot;model\u0026quot;: \u0026quot;go-adapter\u0026quot; in the request:\ncurl http://localhost:8200/v1/chat/completions \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{ \u0026#34;model\u0026#34;: \u0026#34;go-adapter\u0026#34;, \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Write a controller reconcile loop in Go\u0026#34;}], \u0026#34;max_tokens\u0026#34;: 400 }\u0026#39; I expected this to need a lot more GPU memory, but the adapter weights are small enough that it wasn\u0026rsquo;t an issue.\nThe Test Same prompt, two models:\nWrite a Kubernetes controller reconcile function in Go using controller-runtime that handles not-found gracefully. Base model (Qwen/Qwen2.5-7B-Instruct-AWQ):\ninstance := \u0026amp;v1.MyResource{} err := r.Get(ctx, req.NamespacedName, instance) if err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } // ... reconciliation logic return ctrl.Result{}, nil Looks reasonable to me. The client.IgnoreNotFound pattern is what I\u0026rsquo;d expect, function shape is right.\nAdapter (go-adapter):\n[600 tab characters] Not a rendering issue. The adapter literally outputs tab characters until it hits the token limit.\nWhat Went Wrong I had to think about this for a bit. Go source code is tab-indented throughout every function body, every if, every for starts with one or more \\t characters. With only 5k samples and one pass through the data, the adapter didn\u0026rsquo;t seem to learn semantic patterns at all. It learned the most statistically common token in the training data: the tab.\nI found out this is called degenerate collapse. As best I understand it, the model found a local minimum where \u0026ldquo;output tabs\u0026rdquo; scores lower loss than random tokens, so it got stuck there. Getting past it apparently requires enough data that the model is forced to learn something more meaningful.\nThe training numbers make more sense in hindsight. I read that a well-adapted model should land somewhere around loss 1.5-2.5. At 7.76, the adapter barely moved from its starting point. Token accuracy of 4.2% is close to random the pipeline mostly works, I think the adapter just needs a lot more training to be useful.\nWhat\u0026rsquo;s Next My plan is to try the full dataset on the T4 in fp32 first since that\u0026rsquo;s the path I know works. Longer term I want to figure out the fp16 issue properly, or just move training to the RTX 3060 and see if bf16 support there means I don\u0026rsquo;t have to.\nPart 2 will cover whichever path wins and what the adapter actually looks like when it works.\nReproduce It Requirements: NVIDIA GPU (12GB+), Python 3.11, CUDA, Go 1.21+\npip install \u0026#39;transformers\u0026lt;5.0\u0026#39; peft trl bitsandbytes datasets torch Training script: train_qlora.py\n# v1 run (T4, fp32, 5k subsample) python train_qlora.py \\ --model Qwen/Qwen2.5-7B-Instruct \\ --data training_data.jsonl \\ --output output/v1 \\ --epochs 1 --max-samples 5000 # Serve adapter python -m vllm.entrypoints.openai.api_server \\ --model Qwen/Qwen2.5-7B-Instruct-AWQ \\ --enable-lora --lora-modules go-adapter=/path/to/adapter Training data was extracted from public Go repos using go/ast to pull every documented function as an instruction/output pair. Any JSONL file with instruction and output fields works.\nReferences LoRA: Low-Rank Adaptation of Large Language Models Hu et al., 2021. The original paper. QLoRA: Efficient Finetuning of Quantized LLMs Dettmers et al., 2023. How to train on quantized models. Hugging Face QLoRA guide Where I found the PEFT + TRL + BitsAndBytes stack. Hugging Face PEFT LoRA implementation used here. TRL SFTTrainer Supervised fine-tuning trainer. vLLM LoRA docs How to serve adapters in vLLM. ","permalink":"https://hexfusion.io/posts/lora-go-training-pt1/","summary":"I trained a LoRA adapter on 41k Go code examples from the Kubernetes and etcd source trees. The first run produced 600 tab characters. Here\u0026rsquo;s what I learned.","title":"Fine-tuning a Go expert: LoRA on a $300 GPU (Part 1)"},{"content":"Architecture Quick Architecture Reference hostNetwork (before) DRANet (after) ──────────────────── ────────────── Node IP: 192.168.1.233 192.168.1.233 Prefill pod IP: 192.168.1.233 (shared!) 10.42.1.62 (own CNI IP) Decode pod IP: 192.168.1.233 (shared!) 10.42.0.40 (own CNI IP) Prefill port: 8300 (remapped to avoid conflict) 8000 Decode port: 8000 8000 EPP sees: 1 IP, 2 ports → picks one → bug 2 IPs, same port → routes both RDMA path: host namespace (shared) pod namespace (moved by DRA) RDMA device: /dev/infiniband/ via device plugin /dev/infiniband/ via DRANet+NRI Network model: broken (bypasses CNI) standard (CNI + DRA side-by-side) Key concepts to know:\nDRA (Dynamic Resource Allocation): K8s API for requesting hardware (like GPUs, NICs) as schedulable resources. GA in 1.34. NRI (Node Resource Interface): containerd hook that lets DRANet inject into pod creation to move NICs. ResourceSlice: How DRANet publishes discovered NICs to the cluster (auto-populated). DeviceClass: CEL-based selector for which devices pods can claim (like StorageClass but for devices). ResourceClaim: A pod\u0026rsquo;s request for a specific device, with optional config (IP, MTU, etc). RDMA netns shared mode: RDMA link device stays on host but is visible from pod. Both namespaces can use it. No exclusive locking needed for single-pod-per-NIC setups. In my previous post I built a bare metal llm-d lab. Two nodes, a T4 and an RTX 3060, connected with 25GbE Mellanox ConnectX-4 Lx NICs over a direct DAC cable. Disaggregated prefill/decode inference with KV cache transfer over RDMA.\nI got it working. Then I hit a wall.\nThe hostNetwork Trap The RDMA device plugin gives your pod /dev/infiniband/ access, but it doesn\u0026rsquo;t give you network routing to the RDMA NIC. hostNetwork does both.\nSo I did what the guides said. And it worked for a single pod. The moment I deployed disaggregated prefill/decode with two pods on different nodes, the EPP (Endpoint Picker) scheduler silently dropped my prefill pod. Requests only ever hit decode. No errors, no warnings, just\u0026hellip; silence.\nThis is llm-d#632. The root cause: hostNetwork forces all pods on a node to share the host IP. When you have prefill and decode both wanting port 8000, you have to remap one of them. But the EPP\u0026rsquo;s InferencePool only expects one target port. Change prefill to 8300? EPP ignores it. Add multiple targetPorts? EPP picks one. There\u0026rsquo;s no clean workaround.\nThe fundamental issue is that hostNetwork breaks the pod networking model that everything else in the stack assumes.\nDRANet: Just Use DRA DRANet is a kubernetes-sigs project that takes a completely different approach. Instead of giving your pod the host\u0026rsquo;s network stack, it uses Kubernetes Dynamic Resource Allocation (DRA) to move the physical RDMA NIC into the pod\u0026rsquo;s own network namespace.\nEach pod gets:\nIts own CNI IP (normal pod networking, nothing special) The physical RDMA NIC with its own IP, moved in by DRANet Full RDMA device access (mlx5_0, uverbs, the works) No hostNetwork. No port conflicts. No EPP hacks.\nThe Migration My lab runs k3s v1.34 on Fedora. Here\u0026rsquo;s what the migration looked like.\nWhat I removed # no longer in pod spec hostNetwork: true dnsPolicy: ClusterFirstWithHostNet resources: limits: rdma/hca_shared_devices_a: \u0026#34;1\u0026#34; The entire k8s-rdma-shared-dev-plugin DaemonSet, deleted.\nWhat I added DRANet, one DaemonSet install:\nkubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/dranet/refs/heads/main/install.yaml It discovers your NICs automatically. Within seconds, kubectl get resourceslices showed both my Mellanox cards with their IPs, PCI addresses, and RDMA capability.\nA DeviceClass to select RDMA-capable NICs:\napiVersion: resource.k8s.io/v1 kind: DeviceClass metadata: name: rdma-net spec: selectors: - cel: expression: device.driver == \u0026#34;dra.net\u0026#34; - cel: expression: device.attributes[\u0026#34;dra.net\u0026#34;].rdma == true - cel: expression: device.attributes[\u0026#34;dra.net\u0026#34;].virtual == false ResourceClaims for each pod, selecting the specific NIC and configuring its IP:\napiVersion: resource.k8s.io/v1 kind: ResourceClaim metadata: name: rdma-prefill spec: devices: requests: - name: rdma-nic exactly: deviceClassName: rdma-net selectors: - cel: expression: device.attributes[\u0026#34;dra.net\u0026#34;].ifName == \u0026#34;ens1np0\u0026#34; config: - opaque: driver: dra.net parameters: interface: addresses: - \u0026#34;10.0.0.2/24\u0026#34; Pod spec references the claim:\nspec: containers: - name: vllm resources: limits: nvidia.com/gpu: \u0026#34;1\u0026#34; claims: - name: rdma securityContext: capabilities: add: [\u0026#34;IPC_LOCK\u0026#34;] resourceClaims: - name: rdma resourceClaimName: rdma-prefill That\u0026rsquo;s it. Both pods listen on port 8000. No remapping.\nWhat it looks like inside the pod $ kubectl exec deploy/vllm-prefill -- ip addr show 2: eth0@if19: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; inet 10.42.1.62/24 # CNI IP, normal pod networking 5: ens1np0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; inet 10.0.0.2/24 # Physical RDMA NIC, moved in by DRANet $ kubectl exec deploy/vllm-prefill -- rdma link show link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens1np0 The pod has its own IP for service traffic and the physical RDMA NIC for zero-copy KV cache transfer. NIXL/UCX uses mlx5_0:1 for RDMA just like before. Nothing changes on the application side.\nThe Result Sent a request through the gateway. EPP routed it to the decode pod. Decode coordinated with prefill. Prefill did the compute, transferred the KV cache over RDMA, decode generated tokens.\nExternal prefix cache hit rate: 100.0% KV Transfer metrics: Avg xfer time (ms)=18.596, Throughput (MB/s)=94.106 Both pods received traffic. The bug is gone.\nPrerequisites A few things that need to be in place and survive reboot:\nRDMA kernel modules: /etc/modules-load.d/rdma.conf with rdma_cm, rdma_ucm, ib_umad RDMA NIC IPs via NetworkManager: persistent connections with autoconnect=yes (DRANet configures the IP inside the pod, but NM handles it when no pod is running) NRI: containerd 2.x has it enabled by default. DRANet uses NRI to hook into pod creation DRA: GA in Kubernetes 1.34+, feature-gated in 1.32-1.33 Why This Matters The hostNetwork + device plugin approach is what every bare metal RDMA guide recommends today. It works for single pods. It breaks the moment you need multiple pods with different ports on the same node, which is exactly what disaggregated inference requires.\nDRANet is the upstream answer. It\u0026rsquo;s a kubernetes-sigs project, it works with existing CNI plugins, and it treats RDMA NICs as schedulable resources instead of a networking hack. For anyone building llm-d on bare metal, this is the path forward.\n","permalink":"https://hexfusion.io/posts/dranet-bare-metal-rdma/","summary":"hostNetwork is the default recommendation for RDMA in Kubernetes. It breaks disaggregated inference. DRANet replaces it with DRA-based NIC assignment and fixes the problem cleanly.","title":"DRANet: the fix for bare metal RDMA in Kubernetes"},{"content":"I\u0026rsquo;ve been building servers since my twenties. Back in the early 2000s, before cloud as we know it today was a thing, my first setup was a pair of old Intel 1U servers in my basement. Nothing special, just enough to build and host websites and learn how services actually worked. When you do things this way, you can\u0026rsquo;t help but learn all of the components.\nThat instinct, to understand systems by building them, is what led me to set up a bare-metal llm-d lab on hardware I could actually afford. Most of the llm-d ecosystem runs on H100 clusters in cloud data centers. My question was simple: how can you maximize inference using the hardware you already have?\nThe Goal Get a bare-minimum working example of llm-d\u0026rsquo;s disaggregated prefill/decode architecture with RDMA-based KV cache transfer between two physical nodes. Not a benchmark. Not a production deployment. Just proof that the full stack works end-to-end on used hardware you can buy on eBay.\nChoosing the Hardware The NICs: ConnectX-4 Lx RDMA was the non-negotiable starting point. llm-d\u0026rsquo;s disaggregated architecture transfers KV cache between prefill and decode pods over the network, and RDMA is what makes that transfer fast enough to be worthwhile. Without RDMA, you\u0026rsquo;re shipping gigabytes of KV cache over TCP, and the network overhead eats any benefit from disaggregation.\nI went with Mellanox ConnectX-4 Lx (MCX4111A-ACAT), single-port 25GbE SFP28 cards. They\u0026rsquo;re the sweet spot for a home lab: cheap on the secondary market (~$30-40 each), well-supported by the inbox mlx5_core driver (no OFED install needed), and they do RoCE v2 out of the box. I connected them with a direct-attach copper (DAC) cable, no switch needed for two nodes.\nThe GPUs RTX 3060 (12GB, Ampere) was already in my workstation. 12GB is enough for a quantized 7B model with room for KV cache.\nTesla T4 (16GB, Turing) is the most accessible used datacenter inference GPU. 16GB VRAM, 70W TDP. Note: it\u0026rsquo;s passively cooled, so in a tower chassis you\u0026rsquo;ll need extra fans pointed at it or it will throttle.\nAny sm_75+ card with enough VRAM works. A used RTX 2080 Ti would do the same job.\nThe Servers Both are Supermicro dual-socket boards accumulated over years:\nendor (decode): dual AMD Opteron 6272, 92GB RAM, RTX 3060 dagobah (prefill): dual Xeon E5-2699A v4, 220GB RAM, T4 The CPUs and RAM are irrelevant for inference. The GPU is what matters. Starting from nothing, budget ~$1000-1500 for the GPUs, NICs, and a DAC cable. The reason to own the hardware isn\u0026rsquo;t cost, it\u0026rsquo;s access. You can\u0026rsquo;t debug RDMA GID tables on a managed cloud instance. The real learning happens when things fail in ways no simulator models.\nMaking RDMA Work The Easy Part Install the ConnectX-4 cards, connect with the DAC cable, assign IPs:\nendor: 10.0.0.1/24 on enp65s0np0 (mlx5_0) dagobah: 10.0.0.2/24 on ens1np0 (mlx5_0) Verify with ib_send_bw:\n#bytes #iterations BW peak[MB/sec] BW average[MB/sec] 65536 5000 2893.47 2893.14 23.14 Gbps, that\u0026rsquo;s 92.5% of the theoretical 25GbE line rate. Latency: 1.27 microseconds. The inbox kernel driver just works on Fedora. No OFED, no firmware flashing, no drama.\nThe Not-Easy Part Getting RDMA to work inside Kubernetes pods is where it got interesting.\nProblem 1: ib_umad. The RDMA device plugin (k8s-rdma-shared-dev-plugin) couldn\u0026rsquo;t discover devices because the ib_umad kernel module wasn\u0026rsquo;t loaded. No error message, just silent failure. Devices showed as 0 allocatable. Fix: modprobe ib_umad and persist via /etc/modules-load.d/.\nProblem 2: hostNetwork. The device plugin gives pods access to /dev/infiniband/ devices, but not the network routing needed to actually use them. The RDMA traffic needs to flow over the host\u0026rsquo;s physical NIC, not through a CNI virtual network. Fix: hostNetwork: true on both vLLM pods, plus dnsPolicy: ClusterFirstWithHostNet so DNS still works.\nProblem 3: GID table. The RDMA NIC on dagobah had no IP configured at boot. It would get assigned later by my scripts. But RoCE v2 needs an IP in the GID table to function. An empty GID table means the NIC falls back to RoCE v1, and when one side is v2 and the other is v1, the NIXL handshake fails silently. Fix: configure the IP persistently via NetworkManager so it\u0026rsquo;s there before any pods start.\nProblem 4: UCX device naming. NIXL uses UCX under the hood for RDMA transport. The environment variable UCX_NET_DEVICES needs the InfiniBand device name (mlx5_0:1), not the host network device name (ens1np0). Get it wrong and UCX falls back to TCP with no warning. Thankfully, UCX lists available devices in its error output when the specified device doesn\u0026rsquo;t exist.\nEvery one of these issues took a fair amount of time to debug. None of them are documented in the llm-d getting-started guides, because those guides assume you\u0026rsquo;re on a cloud provider where the RDMA infrastructure is pre-configured. But every one of these issues will hit any team deploying llm-d on bare metal or on-prem.\nBuilding the Kubernetes Layer k3s on both nodes. Lightweight, single-binary, gets out of the way. The control plane runs on endor, and dagobah joins as an agent.\nendor: k3s server (control plane + decode workloads) dagobah: k3s agent (prefill workloads) NVIDIA GPU Operator exposes the GPUs. The key setting: driver.enabled=false because both nodes already have the NVIDIA driver installed at the host level (akmod-nvidia on Fedora). The operator just needs to install the container toolkit, not the driver.\nOne gotcha that cost me an afternoon: the upstream vLLM container image (vllm/vllm-openai) ships CUDA 12.9. My driver is 580.x which provides CUDA 13.0. On datacenter GPUs, NVIDIA\u0026rsquo;s forward-compatibility libraries handle this mismatch. On consumer GPUs like the RTX 3060, they don\u0026rsquo;t. I had to build a custom image based on nvidia/cuda:13.0.1-devel-ubuntu24.04 with vLLM installed from PyPI.\nThe devel base (not runtime) matters too. FlashInfer needs nvcc for JIT compilation on the T4\u0026rsquo;s sm_75 architecture. Precompiled kernels only exist for sm_80+.\nThe Gateway Stack llm-d uses the Kubernetes Gateway API with an inference extension. The stack:\nClient -\u0026gt; agentgateway (Envoy) -\u0026gt; ext-proc -\u0026gt; EPP (scheduler) -\u0026gt; vLLM pod The EPP (Endpoint Picker) is the brain. It decides which pod should handle each request based on KV cache state, queue depth, and pod roles. For disaggregated P/D, it runs two scheduling passes: pick the decode pod first, then decide if a separate prefill pod is needed.\nI hit one naming evolution that caused confusion: kgateway was rebranded to agentgateway between releases. The kgateway v2.2.0 image I initially deployed had the inference extension disabled by default, with an env var (KGW_ENABLE_AGENTGATEWAY) that required CRDs from a domain (agentgateway.dev) that didn\u0026rsquo;t exist yet in that release. Switching to the agentgateway chart (v1.0.0-alpha.4) fixed everything.\nDisaggregated Prefill/Decode This is the payoff. The core idea: prefill (processing the input prompt) is compute-bound and benefits from a strong GPU. Decode (generating output tokens one at a time) is memory-bandwidth-bound. By splitting them across specialized pods, you can optimize each independently.\nIn my lab:\nT4 (dagobah) for prefill. 16GB VRAM, decent tensor core throughput for INT8. Processes the input prompt and builds the KV cache. RTX 3060 (endor) for decode. 12GB VRAM, 360 GB/s memory bandwidth. Generates output tokens using the transferred KV cache. The KV cache transfer happens over RDMA via NIXL (NVIDIA\u0026rsquo;s Inference Xfer Library). After the prefill pod processes the prompt, it tells the decode pod \u0026ldquo;here are the KV cache block IDs.\u0026rdquo; The decode pod\u0026rsquo;s routing sidecar triggers a NIXL transfer, moving GPU memory from dagobah to endor over the 25GbE RDMA link.\nThe Attention Backend Trap This one was subtle and took the longest to debug.\nThe T4 (sm_75) can\u0026rsquo;t run FlashAttention2, which requires sm_80+. The RTX 3060 (sm_86) picks FA2 by default. When each pod uses a different attention backend, they produce KV caches with different shapes, meaning different memory layouts for the key and value tensors.\nNIXL transfers raw bytes. It doesn\u0026rsquo;t know or care about tensor shapes. So the transfer succeeds, but the decode pod interprets the bytes with the wrong layout, and inference produces garbage.\nThe fix: force --attention-backend FLASHINFER on both pods. FlashInfer supports both sm_75 and sm_86 (via JIT on the T4), producing compatible KV cache layouts.\nThis is the kind of issue that never appears in homogeneous cloud deployments where every GPU is the same model. Mixed GPU architectures surface it immediately.\nTesting It A chat completion through the gateway:\nGATEWAY_IP=$(kubectl get gateway inference-gateway -o jsonpath=\u0026#39;{.status.addresses[0].value}\u0026#39;) curl -s http://${GATEWAY_IP}/v1/chat/completions \\ -H \u0026#34;Content-Type: application/json\u0026#34; \\ -d \u0026#39;{ \u0026#34;model\u0026#34;: \u0026#34;Qwen/Qwen2.5-7B-Instruct-AWQ\u0026#34;, \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: \u0026#34;Explain RDMA in one paragraph.\u0026#34;}], \u0026#34;max_tokens\u0026#34;: 100 }\u0026#39; | python3 -m json.tool Checking that the EPP is routing to the right roles:\nkubectl logs -f deploy/vllm-pool-epp | grep -i \u0026#34;prefill\\|decode\u0026#34; And verifying the NIXL transfer is actually happening over RDMA:\nkubectl logs deploy/vllm-decode -c vllm --tail=50 | grep \u0026#34;external prefix\u0026#34; INFO: External prefix cache hit rate: 100.0% Results Model: Qwen2.5-7B-Instruct-AWQ (awq_marlin, max-model-len=1024) Cache hit rate: 100% external prefix cache hit rate NIXL transfer: avg 19.5ms, 36% of small transfers \u0026lt;5ms (RDMA zero-copy) Peak decode: 96.8 tokens/sec The numbers aren\u0026rsquo;t impressive by datacenter standards. That\u0026rsquo;s not the point. The point is that the full disaggregated P/D stack, agentgateway, EPP with PdProfileHandler, NIXL over RDMA, routing sidecar, and prefix cache scoring, all works end-to-end on two servers that cost less than a single H100.\nMaximizing Older Hardware Quantization is the great equalizer. AWQ at INT4 cuts a 7B model from 14GB to ~3.5GB. Any sm_75+ GPU with 8GB+ VRAM can serve a useful model.\nDisaggregation helps small GPUs more than big ones. An H100 has 80GB for model + KV cache. An RTX 3060 has 12GB,barely enough for the model alone. By offloading prefill, the decode GPU doesn\u0026rsquo;t need to hold the full prompt\u0026rsquo;s KV cache. Disaggregation is more valuable at the margins.\nCPU memory is cheap, GPU memory isn\u0026rsquo;t. vLLM supports CPU offloading for KV cache. Slower, but it extends your effective context length significantly. First knob to turn on consumer hardware.\nThe debugging skills transfer directly. A $30 NIC and a used datacenter GPU teach you the same concepts that apply at H200 scale. The failure modes are the same, just at different throughput.\nWhy Build It Yourself This lab exists because I wanted to understand llm-d from the inside out. Seeing RDMA handshake failures, attention backend mismatches, and silent TCP fallbacks in person is how you build real opinions and find real problems to solve.\nIf you\u0026rsquo;re getting into infrastructure or distributed systems: build something. Get some old servers, break things, fix them, and write about what you learned.\nTakeaways The hard problems weren\u0026rsquo;t inference. vLLM, EPP, and the gateway stack worked as documented. The hard problems were all RDMA: empty GID tables, silent protocol fallbacks, device naming that differs between kernel and userspace. None of this is covered in any getting-started guide because those guides assume cloud infrastructure.\nMixed GPU architectures made it worse and better at the same time. Worse because mismatched attention backends produce garbage with no error. Better because it forced me to actually understand KV cache layouts instead of just trusting defaults.\nWhat\u0026rsquo;s Next Deploy the KV Cache Indexer to enable prefix-aware routing (currently the EPP uses simpler scoring without block-level cache tracking) Benchmark baseline (both pods running full P+D) vs disaggregated (split roles) Profile RDMA bandwidth utilization during cache transfer to see how close to the 25GbE line rate we actually get under real workloads Test with larger context lengths once H200 access is available Explore how to make this small model punch above its weight through fine-tuning and LoRA adapters The lab setup details, including manifests and quickstart guides, are in my lab notes. I plan to open-source the relevant bits once the setup stabilizes.\nI\u0026rsquo;m joining the llm-d team at Red Hat. These are my notes from the onboarding process.\n","permalink":"https://hexfusion.io/posts/disaggregated-pd-consumer-gpus/","summary":"Running llm-d\u0026rsquo;s disaggregated prefill/decode architecture across an RTX 3060 and a Tesla T4 connected by 25GbE RDMA. What worked, what broke, and what I learned about KV cache transfer at the edge of what consumer hardware can do.","title":"Disaggregated Prefill/Decode on Consumer GPUs"},{"content":"Distributed systems engineer at Red Hat, building llm-d. Previously etcd, OpenShift, Flight Control. Writing about Go, LLM inference, RDMA, and Kubernetes from the systems side.\n","permalink":"https://hexfusion.io/about/","summary":"About Sam Batschelet","title":"About"}]