Scaling AI Workloads: Our GPU Infrastructure on AWS

GPU Cloud Infrastructure

Processing a single multilingual lipsync campaign requires hundreds of GPU-hours. When multiple campaigns arrive simultaneously, which they do during festival season, we need infrastructure that scales instantly and costs nothing when idle. Here's how we designed our GPU rendering farm on AWS to handle exactly that.

Table of contents:

Architecture Overview

Our infrastructure follows a queue-based fan-out architecture. Jobs arrive via an API, get placed in a Redis queue, and are consumed by GPU worker pods running in an EKS (Elastic Kubernetes Service) cluster. The key design principles:

  • Stateless Workers: Every GPU worker pod is stateless, all input/output goes through S3. This means pods can be killed and recreated at any time without losing work
  • Queue-Driven Scaling: The number of GPU pods scales based on queue depth, not CPU/memory usage. No jobs in the queue = no GPU instances running = zero GPU cost
  • Multi-Model Support: Each worker pod contains all our AI models (Wav2Lip, RetinaFace, Real-ESRGAN) pre-loaded into GPU memory at startup, so there's no per-job model loading overhead
  • Idempotent Processing: Every job can be safely retried. If a pod dies mid-processing, the job returns to the queue and gets picked up by another worker

GPU Instance Selection

Choosing the right GPU instance is the most impactful cost/performance decision. After extensive benchmarking, here's what we use:

For Lipsync Processing: g5.2xlarge

The g5 family uses NVIDIA A10G GPUs (24GB VRAM). For inference at 384×384 face crops with batch size 128, the A10G delivers ~45 FPS, sufficient for real-time processing of 30-second ads. At $1.212/hr on-demand (us-east-1), it's our best cost-per-frame option. We use Spot Instances at ~$0.36/hr (70% savings) for non-urgent batch processing.

For SDXL Generation: g5.12xlarge

SDXL requires 10+ GB VRAM for generation at 1024×1024. The g5.12xlarge gives us 4× A10G GPUs, allowing us to run 4 generation jobs in parallel on a single instance. For large campaigns with 50+ image generations, this is more cost-effective than 4 separate g5.2xlarge instances due to shared CPU and network costs.

For Training/Fine-Tuning: p4d.24xlarge

When we need to fine-tune LoRA models or train custom components, we use p4d.24xlarge instances with 8× A100 GPUs (40GB each). These are expensive ($32.77/hr on-demand) but we typically need them for only 2-4 hours per training run. We always use Spot Instances here ($9.83/hr) since training jobs are checkpointed and can resume from interruptions.

Cloud Infrastructure Architecture
Our GPU Infrastructure on AWS

Docker & CUDA Environment

Every AI model runs inside a Docker container with a precisely controlled CUDA environment. This ensures reproducibility across development, staging, and production:

Base Image Stack

We build on NVIDIA's official CUDA base images with a layered approach:

  • Layer 1: nvidia/cuda:12.2.0-cudnn8-runtime-ubuntu22.04
  • Layer 2: Python 3.11 + pip + venv (avoiding Conda for smaller images)
  • Layer 3: PyTorch 2.1 + torchvision + torchaudio (compiled for CUDA 12.2)
  • Layer 4: Our AI models + inference code + FFmpeg 6.0

The final image is ~12GB, which is large but necessary for GPU workloads. We use ECR (Elastic Container Registry) for storage and pre-pull images on worker nodes to avoid cold-start delays.

Model Pre-loading

At container startup, all models are loaded into GPU memory: lipsync model (~300MB), RetinaFace (~100MB), MediaPipe Face Mesh (~10MB), Real-ESRGAN (~60MB). Total GPU memory usage at idle is ~1.5GB, leaving ~22.5GB of the A10G's 24GB for batch processing. This pre-loading strategy eliminates the 15-30 second model loading overhead that would otherwise occur on every job.

Kubernetes Orchestration

We run everything on Amazon EKS (Elastic Kubernetes Service) with managed node groups. The cluster has two node pools:

CPU Node Pool

m5.2xlarge instances running the API server, Redis, monitoring stack, and non-GPU workloads. This pool has a minimum of 2 nodes (for HA) and scales to 10 based on CPU utilization.

GPU Node Pool

g5.2xlarge instances dedicated to GPU workloads. This pool scales from 0 to 20 nodes based on queue depth. The NVIDIA device plugin for Kubernetes handles GPU allocation, each pod requests nvidia.com/gpu: 1 and gets exclusive access to one A10G.

Key Kubernetes configurations:

  • Pod Priority: Urgent (brand deadline) jobs get PriorityClass 1000, batch jobs get 100. When resources are constrained, low-priority pods get preempted
  • Resource Limits: GPU pods are limited to 8 CPU cores, 32GB RAM, and 1 GPU. This prevents any single job from consuming excessive CPU/memory
  • Pod Disruption Budgets: At most 1 GPU pod can be evicted simultaneously during node scaling or updates, ensuring processing continuity
  • Init Containers: An init container verifies GPU health (nvidia-smi check) before the main container starts. Unhealthy nodes are automatically cordoned

Auto-Scaling with KEDA

Standard Kubernetes HPA (Horizontal Pod Autoscaler) doesn't work well for queue-based workloads because it scales on CPU/memory, not queue depth. We use KEDA (Kubernetes Event-Driven Autoscaling) instead:

KEDA monitors our Redis queue length and scales GPU worker pods accordingly. The scaling rules are:

  • 0 jobs in queue → 0 GPU pods (scale to zero)
  • 1-5 jobs → 2 GPU pods (minimum batch)
  • 6-20 jobs → 1 pod per 3 jobs
  • 20+ jobs → 1 pod per 2 jobs (aggressive scaling for burst workloads)
  • Maximum: 20 GPU pods (cost protection)

KEDA also triggers the Cluster Autoscaler to provision new g5.2xlarge nodes when GPU pods are pending. New nodes join the cluster in 3-5 minutes, and GPU pods start processing within 60 seconds of node readiness (thanks to pre-pulled images).

Infrastructure as Code with Terraform

Every piece of our infrastructure is defined in Terraform. Our repository structure:

  • modules/vpc/: VPC, subnets, NAT gateways, security groups
  • modules/eks/: EKS cluster, node groups, IRSA roles
  • modules/ecr/: Container registries with lifecycle policies
  • modules/s3/: Input/output buckets with lifecycle rules
  • modules/monitoring/: CloudWatch dashboards, SNS alerts
  • environments/prod/: Production configuration
  • environments/staging/: Staging configuration (identical but smaller scale)

Terraform state is stored in S3 with DynamoDB locking. We use Terraform Cloud for plan/apply workflows with mandatory review for production changes. A full environment can be provisioned from scratch in ~20 minutes.

Monitoring & Observability

We monitor everything through a unified observability stack:

Prometheus + Grafana

Custom metrics for GPU utilization (via DCGM exporter), queue depth, processing throughput (jobs/hour), error rates, and average processing time per video segment. Grafana dashboards show real-time cluster status and historical trends.

CloudWatch Alarms

Critical alerts fire to Slack/PagerDuty for: GPU pod failures, queue depth exceeding 50 (indicating scaling issues), processing time exceeding 10 minutes per segment (indicating model issues), and Spot Instance interruption notices.

Structured Logging

All application logs are structured JSON, shipped to CloudWatch Logs via Fluent Bit. Each log entry includes job_id, language, stage, duration, and GPU metrics. This makes debugging failed jobs straightforward, we can trace a single video's journey through every pipeline stage.

Cost Optimization

GPU costs dominate our infrastructure spend. Here's how we keep them under control:

  • Scale to Zero: GPU nodes scale to 0 during off-hours (11 PM - 7 AM IST), saving ~33% of GPU costs
  • Spot Instances: 80% of our GPU workloads run on Spot Instances, saving 60-70% vs. on-demand pricing
  • Right-Sizing: Monthly reviews of GPU utilization data to ensure we're using the optimal instance type
  • S3 Lifecycle Policies: Processed outputs move to S3 Infrequent Access after 30 days, Glacier after 90 days
  • Reserved Instances: For our 2 always-on CPU nodes, we use 1-year Reserved Instances (40% savings)

Our average monthly GPU spend for processing 20-30 campaigns is approximately $2,000-3,000, a fraction of what a single traditional ad shoot costs.

If you're building GPU-intensive AI infrastructure and want to discuss architecture, reach out at [email protected].

Contact

Let's talk.

A direct line to the team behind the work. No account managers, no briefing relay between departments. Tell us about your next project and we'll reply within 24 hours with concrete next steps.

Response Within 24 hours, direct from the team

Available  •  Remote-first, worldwide

Briefing

Send us a short briefing.