GPU Inference Without the Cloud Tax

A single L4 GPU on GCP SPOT costs $0.28 per hour. The same GPU through a managed inference service costs $0.80. Run it four hours a day and the difference is $760 a year, for one GPU. Scale that to a team running inference workloads across multiple models, and you're looking at thousands of dollars a month in what is essentially a convenience tax.

Most GPU infrastructure content falls into two camps: use a managed service like Replicate or Modal and don't think about it, or assume you have an unlimited cloud budget and spin up on-demand instances. The middle ground, self-managed GPU compute that's both cheap and production-ready, gets surprisingly little coverage. I've been running inference workloads on GCP SPOT instances for months now, and the patterns that make it work are more straightforward than you'd expect.

The Economics

SPOT instances are spare GCP capacity sold at a steep discount. The tradeoff is that Google can reclaim them with 30 seconds notice when capacity gets tight. For inference workloads, where each request is independent and stateless, that tradeoff is almost always worth taking.

Here's what the numbers actually look like (us-west1, as of late 2025):

GPU	Machine Type	On-Demand	SPOT	Savings
L4 (24GB)	g2-standard-4	$0.71/hr	$0.28/hr	61%
L4 (24GB)	g2-standard-8	$0.85/hr	$0.34/hr	60%
A100 (40GB)	a2-highgpu-1g	$3.67/hr	$1.44/hr	61%

Now compare SPOT against the managed inference platforms:

Service	L4 Equivalent	A100 Equivalent
GCP SPOT	$0.28/hr	$1.44/hr
Replicate	$0.81/hr	$5.04/hr
Modal	$0.80/hr	—
RunPod (Flex)	$0.68/hr	$2.74/hr

The gap is significant. Managed services charge 2.4x to 3.5x more per GPU-hour than SPOT, and that's before accounting for cold start overhead and per-request minimums. If your inference workload runs for more than a few hours a day, self-managed SPOT pays for the engineering investment within weeks.

The question is whether you can make SPOT instances reliable enough to actually run on. The rest of this post covers how.

Terraform Automation

The foundation is reproducible infrastructure. Everything lives in Terraform so an environment can be rebuilt from scratch in minutes, which matters when SPOT instances can disappear.

The SPOT configuration itself is straightforward:

HCL

resource "google_compute_instance" "inference" {
  name         = "${var.project_prefix}-inference"
  machine_type = var.machine_type
  zone         = var.zone
 
  scheduling {
    preemptible         = var.use_spot
    provisioning_model  = var.use_spot ? "SPOT" : "STANDARD"
    automatic_restart   = var.use_spot ? false : var.automatic_restart
    on_host_maintenance = "TERMINATE"  # Required for all GPU instances
  }
 
  guest_accelerator {
    type  = var.gpu_type
    count = 1
  }
 
  boot_disk {
    initialize_params {
      image = data.google_compute_image.dlvm.self_link
      size  = 100
      type  = "pd-balanced"
    }
  }
 
  network_interface {
    network    = google_compute_network.vpc.name
    subnetwork = google_compute_subnetwork.private.name
    # No access_config block = no public IP
  }
 
  # ...
}

A few things worth noting. The on_host_maintenance = "TERMINATE" flag is required for every GPU instance on GCP regardless of whether it's SPOT or on-demand. GPUs can't live-migrate. The automatic_restart ternary handles an important constraint: GCP rejects SPOT instances with auto-restart enabled, so the config enforces this at the Terraform level rather than discovering it at apply time.

The base image is from Google's Deep Learning VM family, which ships with NVIDIA drivers pre-installed on the host. This eliminates driver installation from the startup path entirely, which matters for cold-start time.

HCL

data "google_compute_image" "dlvm" {
  family  = "common-cu128-ubuntu-2204-nvidia-570"
  project = "deeplearning-platform-release"
}

Zero-Trust Networking

The inference VM has no public IP address. No SSH keys to manage. No ports exposed to the internet. Access goes through GCP's Identity-Aware Proxy, which authenticates every connection against your Google identity before establishing a tunnel.

The network architecture is minimal:

HCL

resource "google_compute_network" "vpc" {
  name                    = "${var.project_prefix}-vpc"
  auto_create_subnetworks = false
}
 
resource "google_compute_subnetwork" "private" {
  name                     = "${var.project_prefix}-subnet"
  ip_cidr_range            = "10.0.0.0/24"
  network                  = google_compute_network.vpc.name
  private_ip_google_access = true
}

private_ip_google_access = true is the line that makes everything work without a public IP. It lets the VM reach GCS, Artifact Registry, and other Google APIs over Google's internal backbone rather than traversing the public internet.

The only firewall rule admits traffic from IAP's IP range:

HCL

resource "google_compute_firewall" "allow_iap" {
  name    = "${var.project_prefix}-allow-iap"
  network = google_compute_network.vpc.name
 
  allow {
    protocol = "tcp"
    ports    = ["22", "8188"]
  }
 
  source_ranges = ["35.235.240.0/20"]  # Google IAP CIDR
}

35.235.240.0/20 is Google's IAP address range. The VM accepts SSH (22) and the inference service port (8188) exclusively from IAP tunnels. Everything else is dropped. No VPN, no bastion host, no SSH key rotation.

To reach the inference service:

Bash

gcloud compute start-iap-tunnel inference-vm 8188 \
  --local-host-port=localhost:8188 \
  --zone=us-west1-a

This opens a tunnel from your local machine to the VM's port 8188, authenticated via your Google identity. The roles/iap.tunnelResourceAccessor IAM role controls who can open tunnels. Cloud Audit Logs capture every connection attempt. It's the BeyondCorp model applied to GPU infrastructure, and it's simpler to set up than a VPN.

Outbound internet access (for pulling Docker images and model files) goes through Cloud NAT:

HCL

resource "google_compute_router_nat" "nat" {
  name                               = "${var.project_prefix}-nat"
  router                             = google_compute_router.router.name
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "ALL_SUBNETWORKS_ALL_IP_RANGES"
}

The VM can reach the internet for pulls and updates. The internet cannot reach the VM. One-way door.

IAP handles human access, but production workloads need service-to-service connectivity. If your application layer runs on Cloud Run or Cloud Functions, a Serverless VPC Access connector bridges the gap:

HCL

resource "google_vpc_access_connector" "connector" {
  name          = "${var.project_prefix}-connector"
  region        = var.region
  network       = google_compute_network.vpc.name
  ip_cidr_range = "10.8.0.0/28"
}

The connector lets serverless workloads reach the GPU instance's private IP directly over the VPC. Your Cloud Run service sends inference requests to 10.0.0.x:8188 without the traffic ever leaving Google's network. No public endpoint, no API gateway in between. The GPU instance handles the compute-heavy work; the serverless layer handles routing, auth, and autoscaling at the application level. Each does what it's good at.

CUDA Compilation

A single Docker image targeting multiple GPU architectures avoids maintaining separate builds for each GPU type:

Dockerfile

ARG CUDA_BASE_IMAGE=nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04
FROM ${CUDA_BASE_IMAGE} AS base
 
ENV TORCH_CUDA_ARCH_LIST="7.5;8.6;8.9"

The TORCH_CUDA_ARCH_LIST environment variable tells PyTorch's JIT compiler which GPU microarchitectures to compile CUDA kernels for. 7.5 covers T4 (Turing), 8.6 covers A100 (Ampere), and 8.9 covers L4 (Ada Lovelace). One image runs on all three. The tradeoff is larger image size and slightly longer build times, but maintaining three separate images per GPU isn't worth it for most teams.

The base image uses cudnn-runtime rather than cudnn-devel. Runtime includes the cuDNN libraries needed for inference but omits the compiler headers. Smaller image, faster pulls.

PyTorch goes in its own Docker layer, before the application code:

Dockerfile

RUN python3 -m pip install \
    torch==${TORCH_VERSION} \
    torchvision==${TORCHVISION_VERSION} \
    torchaudio==${TORCHAUDIO_VERSION} \
    --index-url ${TORCH_INDEX_URL}
 
# Application code comes after
COPY . /app

Because Docker caches layers, if only the application code changes (not the PyTorch version), this multi-gigabyte layer gets served from cache. Rebuilds drop from 20 minutes to under a minute. Pin specific versions with build args so the layer cache behaves predictably.

Model Storage

GPU inference needs model files, and model files are large. Storing them on the instance's boot disk means re-downloading on every new instance. GCS with gcsfuse makes models persistent across instance lifecycles:

HCL

resource "google_storage_bucket" "models" {
  name                        = "${var.project_prefix}-models-${random_id.suffix.hex}"
  location                    = var.region
  uniform_bucket_level_access = true
  force_destroy               = false
 
  lifecycle_rule {
    condition { age = 90 }
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }
  }
}

The lifecycle rule transitions objects older than 90 days to NEARLINE storage, which is cheaper for infrequently accessed files. Useful for model versions that get uploaded once and rarely touched again. force_destroy = false prevents Terraform from accidentally deleting the bucket and all your models with it.

The startup script mounts GCS as a local filesystem:

Bash

mountpoint -q /mnt/models || gcsfuse \
  --implicit-dirs \
  --cache-dir=/tmp/gcsfuse-cache \
  ${bucket_name} /mnt/models

mountpoint -q prevents double-mounting on restart. --implicit-dirs handles GCS's flat namespace by allowing navigation into directories that exist by convention but weren't explicitly created. --cache-dir puts gcsfuse's local cache on the instance's ephemeral disk for faster repeated reads.

The Docker container then volume-mounts specific paths from the gcsfuse mount point:

Bash

docker run -d \
  --runtime=nvidia \
  --gpus all \
  --ipc=host \
  --network=host \
  --memory=${container_memory_limit} \
  --memory-swap=${container_memory_limit} \
  -v /mnt/models:/app/models \
  ${docker_image}

--ipc=host is required for PyTorch's shared memory operations. --memory-swap set equal to --memory disables swap, which interacts badly with GPU workloads. --network=host avoids Docker NAT overhead and lets IAP tunnels reach the container port directly.

One caveat: gcsfuse has latency characteristics that differ from local disk. Object reads aren't byte-range, so loading a 7GB safetensors file reads the entire object. First load after mount is slower than local NVMe. Subsequent reads hit the local cache. For most inference workloads where models load once at startup, this is fine.

Startup Orchestration

The startup script runs on every boot, not just the first one. This is the SPOT recovery mechanism: when an instance is preempted and manually restarted, the same script re-initializes everything.

Idempotent guards make repeated runs fast:

Bash

#!/bin/bash
set -euo pipefail
 
# Only install on first boot
if ! command -v gcsfuse &> /dev/null; then
  export GCSFUSE_REPO=gcsfuse-$(lsb_release -cs)
  echo "deb [signed-by=/usr/share/keyrings/cloud.google.asc] \
    https://packages.cloud.google.com/apt $GCSFUSE_REPO main" \
    | tee /etc/apt/sources.list.d/gcsfuse.list
  apt-get update && apt-get install -y gcsfuse
fi
 
if ! command -v docker &> /dev/null; then
  apt-get install -y docker.io nvidia-container-toolkit
  systemctl enable docker && systemctl start docker
  nvidia-ctk runtime configure --runtime=docker
  systemctl restart docker
fi

First boot takes 2-3 minutes for package installation. Subsequent boots skip these blocks entirely and go straight to mounting and container launch.

The GPU readiness probe handles a race condition where the NVIDIA driver may not be fully initialized immediately after boot:

Bash

until docker run --rm --gpus all \
  nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi > /dev/null 2>&1; do
  echo "Waiting for GPU in Docker..."
  sleep 5
done

This polls until the NVIDIA runtime is functional inside Docker before launching the inference container. Without it, the container can start before the GPU is ready and fail with cryptic CUDA errors.

Designing for Preemption

GCP sends a 30-second ACPI signal before terminating a SPOT instance. That's not much time. AWS gives two minutes. The design philosophy here matters more than the implementation.

Inference workloads fall into two categories for preemption tolerance. Stateless request-response inference (image generation, text completion, classification) handles preemption naturally. Each request is independent. If the instance dies mid-request, the client retries. No data is lost because there was no data to lose. Stateful or streaming inference (long video generation, real-time audio processing, multi-turn conversations with server-side state) is harder. Preemption mid-stream means lost work, and 30 seconds isn't enough to checkpoint most of these workloads gracefully.

For stateless workloads, the recovery pattern is simple: restart the instance, let the idempotent startup script re-initialize, and resume serving. A manual gcloud compute instances start after preemption brings the instance back in under a minute (skipping first-boot installs, re-mounting GCS, re-pulling the cached image, and launching the container). For workloads that need higher availability, a managed instance group can automate recreation when capacity returns.

The approach I use doesn't implement a preemption handler. No shutdown hook intercepts the 30-second warning. For inference workloads where every request is independent and models are stored in GCS, there's nothing to checkpoint. The instance can die at any point without data loss. Overengineering preemption handling for workloads that don't need it adds complexity for no benefit.

If your workload does need preemption awareness, register a shutdown script that catches the ACPI signal, drains in-flight requests, and logs the event. But design the workload to not need it first.

The Full Stack

Five layers compose into a production-ready GPU inference stack:

Artifact Registry stores the Docker image. Private, regional, scoped IAM.
GCS stores model files. Persistent across instance lifecycles, mounted via gcsfuse.
Service account has exactly three roles: read the registry, manage the bucket, write logs. Nothing more.
IAP networking provides zero-trust access with no public IP, no VPN, no SSH keys to manage.
SPOT compute delivers 60% savings over on-demand and 60-65% savings over managed services.

The entire stack is declarative. terraform apply builds or rebuilds it from scratch. Instance preemption is a restart, not an incident. Models survive in GCS regardless of what happens to the compute layer.

The math is straightforward. If you're running inference workloads for more than a couple hours a day, the gap between managed services and self-managed SPOT covers the engineering cost within the first month. If you're running them full-time, it's not even close. The infrastructure patterns are general purpose. The Terraform, the IAP networking, the startup orchestration, and the container builds work the same whether you're serving image generation, text completion, or embedding computation. The GPU is just a resource. The patterns around it are what make SPOT viable.