GenAI Inference Engineering

Architect multi-provider LLM gateways, implement semantic caching and batch optimization, monitor provider SLAs, and optimize inference costs.

Preview 12 goals free

12 skill groups7 courses614 goals~588 hrs

Verifiable skill graph

12 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

Multi-Provider Gateways, Routing & Failover

Design and operate LLM gateways: routing across providers, request hedging, circuit breakers, failover chains, latency-based routing, rate-limit middleware, and provider SLA/health tracking that drives routing. The LiteLLM / OpenRouter layer.

Cost, Caching & Batch Optimization

Optimize inference spend: token counting and per-provider cost, prompt-caching APIs (Anthropic cache_control, OpenAI prefix, Gemini context), batch APIs, semantic caching, model tiering/cascading, and cache-friendly prompt structure.

Latency, Streaming & Throughput

Measure and improve request latency: TTFT, p50/p95/p99 profiling, SSE/token streaming, streaming-vs-blocking trade-offs, throughput tuning, prefix-cache reasoning, and latency-driven model selection.

Rate-Limit, Quota & Capacity Management

Keep a multi-provider gateway within provider limits: TPM/RPM ceilings, token-bucket admission control, backpressure and queueing under burst, 429 budgeting, multi-key/account/region capacity planning, and load/capacity testing.

Gateway Security & Data Governance

Treat the gateway as the security boundary: PII detection/redaction before third-party egress, prompt-injection and jailbreak filtering, output moderation/guardrails, per-tenant key isolation and secrets, audit logging, and data-residency/zero-retention routing.

Eval & Swap-Safety

Make model/provider swaps safe: offline eval harnesses, golden-set regression tests, canary and shadow traffic on swaps, and A/B quality comparison gating routing decisions.

Operational Observability

Instrument LLM calls with OpenTelemetry, Langfuse, Logfire, and Prometheus: latency/error/cost telemetry, production sampling, and signal emission. (Quality adjudication lives in Eval & Swap-Safety.)

Structured Output & Tool Use

Enforce typed outputs and reliable tool calling: JSON mode, Pydantic-validated schemas, function calling, multi-tool orchestration, parallel tool calls, and error recovery on malformed outputs.

Reasoning & Inference-Time Techniques

Apply reasoning models (o-series, extended thinking), test-time compute (CoT/ToT/best-of-N), context-window and KV/prefix-cache economics, and reasoning-token budget trade-offs.

Hosted LLM API Integration

Call OpenAI, Anthropic, and Gemini SDKs in production: auth, retry with backoff, response parsing, error recovery, and unified multi-provider interfaces.

Python for Inference Engineering

Production-grade Python applied to inference engineering: async/await, Pydantic models, dataclasses, decorators, context managers, generators, pytest. The language fluency required to ship LLM-backed services.

Deploy & Scale the Gateway

Ship the gateway as a service: containerize, deploy, autoscale on RPS/latency (HPA), and manage secrets/config for a stateless I/O-bound LLM proxy.

What you'll ship in production

Core responsibilities this discipline prepares you for.

1
Design LLM gateway infrastructure
routing requests across providers
- Deploy and configure LiteLLM gateway on Kubernetes with provider routing rules and load balancing
- Manage API key rotation, failover policies, and per-provider request distribution
- Validate gateway behavior under failover scenarios and measure routing latency overhead
2
Optimize request latency
through caching, batching, and streaming
- Implement semantic caching with Redis using embedding similarity for cache key matching
- Build request batching strategies and streaming-first response patterns
- Benchmark cache hit rates, measure P50/P95 latency improvements, and tune eviction policies
3
Implement structured output extraction
from LLMs with type safety
- Use Pydantic AI for type-safe LLM interactions with guaranteed schema compliance
- Build structured extraction pipelines with Instructor and DSPy for programmatic optimization
- Validate extraction accuracy across providers and measure schema conformance rates
4
Build cost attribution and FinOps dashboards
tracking token spend
- Track token costs per team, model, and feature using Langfuse cost attribution
- Build Grafana dashboards for cost visualization with Prometheus budget alerting
- Implement cost optimization through semantic caching, model tiering, and prompt compression
5
Monitor inference quality metrics
in production
- Instrument LLM calls with OpenTelemetry spans capturing latency, tokens, and error rates
- Set up Logfire for Python-native tracing and Prometheus for P50/P95/P99 latency monitoring
- Configure alerting rules that detect latency spikes and diagnose root causes from distributed traces
6
Implement intelligent routing
— route queries to model tiers based on complexity
- Build RouteLLM semantic routing with model cascading: cheap models for simple, expensive for complex
- Configure complexity-based dispatch logic with fallback chains across providers
- Demonstrate 60%+ cost savings while maintaining output quality on standardized test datasets
7
Manage API rate limits and quotas
across providers
- Build rate limiting middleware in FastAPI with per-endpoint and per-user throttling
- Configure LiteLLM quota management with per-team token budgets and key rotation policies
- Validate graceful degradation behavior under sustained load with provider quota exhaustion
8
Deploy inference services on K8s
with scaling and health checks
- Configure Kubernetes Deployments with readiness/liveness probes tailored for LLM services
- Set up Horizontal Pod Autoscaler with custom metrics for token throughput scaling
- Validate zero-downtime rolling updates under active inference load

Curriculum

7 courses · each builds on previous goals

12 goals unlocked for preview — click to read. Locked goals need a subscription.

CourseGoalsWeight

Python Essentials for Agent Builders623.4%

Your Dev Environment4

Navigate filesystem with terminal
Manage files from command line
Set up VS Code
Configure terminal in VS Code

Python, Git & Package Management6

Install and verify Python
Write hello world script
Use Python REPL
Initialize Git repository
Track changes with Git
Install packages with pip

Variables & Basic Types5

Create and name variables
Work with strings
Work with numbers
Work with booleans
Format with f-strings

Control Flow4

Make decisions with if/elif/else
Iterate with for loops
Repeat with while loops
Control loop execution

Functions5

Define and call functions
Use parameters
Return values
Document with docstrings
Understand scope

Modules & Imports4

Import standard library
Create custom modules
Understand Python path
Create packages

Lists & Tuples5

Create and access lists
Modify lists
Slice lists
Use list comprehensions
Work with tuples

Dictionaries & Sets5

Create and access dicts
Modify dictionaries
Iterate over dicts
Work with nested dicts
Use sets

Classes & Dataclasses5

Understand class basics
Create dataclasses
Add methods
Use default values
Basic inheritance

Files, JSON & Error Handling5

Read and write files
Work with JSON
Use pathlib
Handle exceptions
Create custom exceptions

Basic Testing4

Use assert statements
Create test functions
Run pytest
Test classes

Environment Variables & Configuration5

Understand environment variables
Use .env files
Load with python-dotenv
Handle missing variables
Organize configuration

Decorators & Context Managers5

Understand decorators
Write simple decorators
Use context managers
Write context managers
Combine patterns

LLM Foundations for Agent Builders9212.6%

Generators & Iterators5

Understand iteration
Create generators
Use generator expressions
Build data pipelines
Use itertools

Async Programming Basics5

Understand async concepts
Write async functions
Run concurrent operations
Use async context managers
Handle async exceptions

Type Hints & Pydantic5

Add basic type hints
Use typing generics
Create Pydantic models
Validate API data
Configure Pydantic

Data Pipelines & Transformations5

Build functional pipelines
Work with tabular data
Transform data shapes
Process LLM data formats
Optimize for performance

HTTP Clients & httpx5

Make GET requests
Make POST requests
Use async httpx
Handle errors
Use sessions

Your First LLM Call5

Set up credentials
Install Gemini SDK
Make first API call
Parse response
Handle API errors

Tokenizer Internals5

Understand tokenization basics
Learn BPE algorithm
Compare tokenizer types
Analyze cross-language efficiency
Count and optimize tokens

Context Windows, KV-Cache & Memory6

Understand context limits
Understand KV-cache basics
Use prompt caching
Track conversation tokens
Implement truncation
Summarize for compression

LLM Architectures - Dense, MoE & KV-Cache Optimizations6

Understand transformers
Explore dense models
Understand MoE
Master KV-cache optimizations
Compare architectures
Choose for task

LLM Inference - Prefill & Decode5

Understand prefill phase
Understand decode phase
Run local models
Measure inference metrics
Optimize for latency

Transformer Layer Anatomy5

Understand self-attention
Compare attention types
Explore layer structure
Trace through layers
Understand depth vs width

FFN Variants & Activation Functions5

Understand FFN basics
Compare activations
Explore gated variants
Analyze modern FFNs
Connect to MoE

Alternative Architectures - SSMs & Hybrids5

Understand SSM basics
Explore Mamba
Understand RWKV
Explore hybrid models
Compare architectures

Sampling Parameters & Output Control5

Understand temperature
Use top-p sampling
Implement determinism
Control output length
Use structured output

Multi-Provider & Prompt Engineering5

Build provider abstraction
Structure conversations
Use few-shot prompting
Implement chain-of-thought
Build prompt templates

Function Calling Fundamentals5

Understand tool use concept
Define tool schemas
Make function calls
Handle tool responses
Compare provider patterns

Cost Awareness & Token Economics5

Understand pricing models
Calculate request costs
Compare provider costs
Identify cost drivers
Basic cost optimization

Retry Patterns with Tenacity5

Understand retry need
Use tenacity basics
Implement exponential backoff
Handle specific exceptions
Combine with async

Kubernetes Essentials for GenAI7212.3%

Containerizing LLM Applications6

Write a Python app that calls the Gemini API and returns structured responses
Write a Dockerfile and build a container image for the LLM app
Run the containerized LLM app with environment-based configuration
Use Docker Compose to run the LLM app with supporting services
Tag images with semantic versions and push to a container registry
Debug containers with exec, logs, and inspect

Your Kubernetes Cluster & First LLM Pod6

Understand K8s architecture and connect to your vCluster
Deploy the LLM app as your first Kubernetes pod
Organize workloads with namespaces
Use labels and selectors to organize and query resources
Understand pod lifecycle and restart policies
Master kubectl debugging: exec, logs, describe, port-forward

Services & the LLM Chat Backend6

Create a ClusterIP service to expose the LLM chat API internally
Deploy a multi-tier LLM chat application
Compare service types: ClusterIP, NodePort, LoadBalancer
Master DNS-based service discovery in Kubernetes
Understand endpoints and traffic routing
Debug service connectivity problems

Deployments, Scaling & Rolling Updates6

Create a Deployment for the LLM chat API
Scale LLM app replicas to handle concurrent requests
Perform a rolling update with zero downtime
Roll back a broken deployment
Compare deployment strategies: RollingUpdate vs Recreate
Manage deployment lifecycle with kubectl rollout

ConfigMaps & Secrets for LLM Apps6

Create ConfigMaps for LLM app settings
Mount ConfigMaps as files for complex configuration
Store LLM proxy credentials securely in Secrets
Manage per-environment configuration for dev, staging, and prod
Handle configuration updates and rolling restarts
Debug configuration issues in LLM app pods

Persistent Storage & StatefulSets6

Create PersistentVolumeClaims for durable storage
Deploy PostgreSQL as a StatefulSet
Connect the LLM chat API to PostgreSQL for conversation persistence
Deploy Redis as a StatefulSet for LLM response caching
Understand StatefulSet scaling and ordering guarantees
Manage PVC lifecycle: expansion, snapshots, and cleanup

Multi-Container Pods: Sidecars & Init Containers6

Add an LLM proxy sidecar to the chat API pod
Use init containers for database setup and config loading
Share data between containers via emptyDir volumes
Implement the ambassador pattern for multi-model LLM routing
Add a logging and metrics sidecar to the LLM app
Debug multi-container pods

Resource Management & Cost Optimization6

Set resource requests and limits for the LLM chat API
Understand QoS classes and their impact on eviction
Enforce resource defaults with LimitRanges
Cap namespace resource usage with ResourceQuotas
Right-size LLM app containers based on actual usage
Diagnose OOMKilled and CPU throttling issues

Packaging with Helm & Kustomize6

Create a Helm chart for the LLM chat application
Parameterize the chart with values.yaml for each environment
Manage Helm release lifecycle: install, upgrade, rollback
Use Kustomize bases and overlays for the LLM app
Use Kustomize patches and generators
Compare Helm vs Kustomize for different deployment scenarios

Networking, Ingress & TLS6

Expose the LLM chat API via an Ingress resource
Add TLS to the Ingress for HTTPS access
Isolate services with NetworkPolicies
Configure Ingress annotations for production traffic
Understand K8s networking: pod IPs, CNI, and service routing
Debug networking and connectivity issues

Health Probes, Autoscaling & Self-Healing6

Add liveness and readiness probes to the LLM chat API
Configure startup probes for containers with slow initialization
Scale the chat API automatically with HPA based on CPU
Create PodDisruptionBudgets for safe maintenance
Implement health check patterns for LLM-dependent services
Combine autoscaling, probes, and PDBs for a resilient LLM service

RBAC, Security & K8s Troubleshooting6

Create RBAC roles for the LLM chat application
Enforce Pod Security Standards
Apply SecurityContext for defense in depth
Debug CrashLoopBackOff and OOMKilled failures
Use kubectl debug and ephemeral containers for live debugging
Troubleshoot LLM-specific issues: timeouts, proxy errors, stale connections

Web APIs for GenAI Engineers607.4%

FastAPI Fundamentals6

Create a FastAPI application with path operations
Define Pydantic request and response models
Implement dependency injection for shared resources
Build CRUD endpoints with proper HTTP semantics
Configure OpenAPI documentation with examples
Handle errors with custom exception handlers

Async Python for APIs6

Convert sync endpoints to async with proper await patterns
Implement background tasks for non-blocking operations
Execute concurrent API calls with asyncio.gather
Manage application lifecycle with lifespan handlers
Build async generators for streaming responses
Control concurrency with semaphores and throttling

Database Integration6

Configure SQLAlchemy async engine with connection pooling
Define ORM models with relationships and constraints
Create and manage database migrations with Alembic
Implement repository pattern for data access
Build transactional endpoints with session lifecycle
Implement filtering, sorting, and full-text search

Authentication & Authorization6

Implement user registration with password hashing
Build OAuth2 password flow with JWT tokens
Implement API key authentication for services
Enforce role-based access control with permissions
Build token refresh and revocation
Compose multiple auth strategies into dependencies

Real-time Streaming6

Build SSE endpoint for streaming LLM responses
Implement WebSocket endpoint with connection lifecycle
Build WebSocket connection manager for broadcasting
Handle backpressure and slow clients
Implement heartbeat and automatic reconnection
Build real-time notification system with Redis pub/sub

Resilience Patterns6

Implement rate limiting with Redis sliding window
Build circuit breaker for LLM provider calls
Configure retry logic with tenacity
Isolate critical paths with bulkhead semaphores
Build fallback responses for degraded mode
Combine resilience patterns into middleware stack

API Gateway & Routing6

Build reverse proxy with path-based routing
Implement load balancing across backend instances
Transform requests and responses through the gateway
Aggregate responses from multiple backends
Implement service discovery with health checking
Build gateway authentication and request enrichment

Testing & Documentation6

Write async endpoint tests with httpx.AsyncClient
Build database fixtures with transaction rollback
Mock external services for deterministic tests
Implement contract tests for API consumers
Measure test coverage and set quality gates
Generate rich OpenAPI documentation with examples

API Versioning & Evolution6

Implement URL-based API versioning with routers
Build header-based version negotiation
Manage deprecation with Sunset and Warning headers
Build request and response adapters for version translation
Detect breaking changes automatically
Generate API changelogs from schema diffs

Deployment & Observability6

Build production Docker images with multi-stage builds
Deploy to Kubernetes with health check probes
Instrument endpoints with Prometheus metrics
Implement distributed tracing with OpenTelemetry
Build structured logging with correlation IDs
Create Grafana dashboards for API monitoring

GenAI Inference Engineering10619.5%

Production Hosted LLM Architecture5

Provider Landscape Analysis
Request Patterns and Streaming
Token Economics
Rate Limit Management
Secure API Management

Prompt Caching Mastery6

Anthropic Prompt Caching
OpenAI Automatic Caching
Google Context Caching
Prompt Structure Optimization
Cache Hit Monitoring
Multi-Turn Conversation Caching

Batch API Strategies6

OpenAI Batch API
Anthropic Message Batches
Google Gemini Batch Processing
Batch vs Real-Time Decisions
Combined Caching and Batching
Batch Job Orchestration

Model Routing and Selection6

Model Tier Strategy
Query Complexity Classification
Routing Frameworks
Quality-Aware Routing
Cost-Quality Optimization
A/B Testing Models

Structured Outputs6

OpenAI Structured Outputs
Anthropic Tool Use with Strict Mode
Gemini Structured Outputs
Schema Design Best Practices
Error Handling and Validation
Cross-Provider Structured Abstraction

Function Calling and Tool Orchestration6

Tool Schema Design
Multi-Tool Orchestration
Anthropic Tool Use Patterns
Parallel Tool Execution
Tool Caching Optimization
Tool Error Recovery

Reasoning Models6

OpenAI Reasoning Strategies
Anthropic Extended Thinking
Gemini Thinking Mode
Reasoning Model Router
Reasoning ROI Analysis
Reasoning Prompt Engineering

Test-Time Compute Strategies6

Chain-of-Thought Prompting
Self-Consistency Voting
Tree-of-Thought Exploration
Self-Reflection and Correction
Best-of-N with Scoring
Compute Budget Optimization

Agentic Patterns6

ReAct Pattern
Planner-Executor Pattern
Actor-Critic Pattern
Multi-Agent Orchestration
State Management
Human-in-the-Loop Patterns

Context and Memory Management6

Context Window Analysis
Selective Context Injection
Rolling Summarization
RAG vs Long Context
Dual-Layer Memory
Context Cache Optimization

Reliability and Resilience6

Rate Limit Handling
Circuit Breakers
Multi-Provider Fallback
Timeout Management
Error Classification
Health Monitoring

Cost Optimization and Observability5

Token Optimization
Combined Cost Strategies
Cost Attribution
Usage Forecasting
Quality Monitoring

Multi-Provider Orchestration6

Build a Multi-Provider Router
Build a Circuit Breaker for Provider Health
Build a Latency-Based Router
Request Hedging
Provider SLA Tracking
Orchestration Pipeline

Semantic Caching6

Redis Exact-Match Cache
Embedding Pipeline
Semantic Similarity Search
Cache Invalidation
Cost Savings Dashboard
Cache Integration Test

Production Streaming Patterns6

SSE Event Formatting
Bounded Token Buffer
Stream Checkpoint and Resume
Stream Broadcasting
Stream Metrics Collection
End-to-End Stream Pipeline

Hosted Inference Observatory6

Custom Metrics Exporter
Provider SLA Dashboard
Cost Attribution Engine
Anomaly Detection Alerts
Distributed Tracing
Executive FinOps Dashboard

Rate Limit Engineering6

Distributed Token Bucket
Priority Queue System
Burst Absorption
Cross-Key Token Pooling
Rate Limit Dashboard
Rate Limit Integration Test

Inference Cost Simulator6

Workload Profiler
Cost Model Engine
What-If Simulator
Batch Optimization Modeling
FinOps Report Generator
Simulator Capstone

Enterprise LLM Customization368.5%

LiteLLM Gateway6

Deploy LiteLLM proxy
Implement failover and circuit breakers
Load test and capacity plan
Build testing and validation for litellm gateway
Optimize performance for litellm gateway
Build operational runbook for litellm gateway

Smart Router6

Build intelligent routing
Implement context caching
Build routing analytics dashboard
Build testing and validation for smart router
Optimize performance for smart router
Build operational runbook for smart router

Provider Caching Comparison6

Implement all caching systems
Build caching strategy recommender
Build cache invalidation and consistency
Build testing and validation for provider caching comparison
Optimize performance for provider caching comparison
Build operational runbook for provider caching comparison

FinOps Controller6

Build budget and spend tracking
Auto-route to Batch API
Build cost anomaly detection
Build testing and validation for finops controller
Optimize performance for finops controller
Build operational runbook for finops controller

Multi-Provider Reliability & Scaling6

Build InferenceReliabilityManager
Build RateLimitPoolManager
Build ProviderSLAMonitor
Build testing and validation for multi-provider reliability & scaling
Optimize scalability for multi-provider reliability & scaling
Build operational runbook for multi-provider reliability & scaling

Streaming & Real-Time AI6

Build StreamingAPI
Build StreamingGuardrails
Build StreamingObservability
Build testing and validation for streaming & real-time ai
Optimize scalability for streaming & real-time ai
Build operational runbook for streaming & real-time ai

GenAI Operations18636.3%

GenAI Failure Catalog6

Classify GenAI failures into five categories: provider, quality, cost, security, and data failures
Instrument a multi-provider LLM gateway to detect each failure category
Build typed failure event models that feed into alerting and incident management
Measure baseline failure rates across OpenAI, Anthropic, and Google providers
Implement failure prediction from leading indicators
Create failure impact assessment system

GenAI SLI Framework6

Define latency SLIs: TTFT, tokens-per-second, end-to-end response time across providers
Define quality SLIs: faithfulness, hallucination rate, format compliance, retrieval precision
Define cost SLIs: cost-per-request, cost-per-token, cache hit rate, budget burn rate
Instrument all SLIs with Prometheus metrics and Langfuse traces
Build SLI aggregation and reporting API
Implement SLI validation and testing

GenAI SLO Engine6

Define SLO targets for latency, quality, and cost SLIs with business-justified thresholds
Compute error budgets and track consumption over rolling windows
Build multi-window burn-rate alerts that detect SLO violations before budget exhaustion
Create SLO status dashboards showing budget remaining and projected exhaustion
Implement SLO negotiation framework
Build cross-SLO dependency tracking

GenAI Toil Analyzer6

Identify GenAI-specific toil patterns: manual model updates, prompt tweaking, provider failover, cache invalidation
Measure toil using time-tracking instrumentation and classify by automation potential
Automate the highest-impact toil items with operational scripts and scheduled workflows
Track toil reduction over time with team-level reporting
Build automation testing and validation
Create toil reduction roadmap generator

GenAI Launch Readiness6

Define operational readiness criteria specific to GenAI services
Build automated readiness checks that verify infrastructure, monitoring, and runbook completeness
Implement launch gate enforcement that blocks deployment without readiness sign-off
Create readiness dashboards and historical tracking for continuous improvement
Implement progressive readiness rollout
Create readiness automation toolkit

GenAI Runbook Engine6

Write structured runbooks for the top 5 GenAI failure modes identified in Ch 1
Build executable runbook steps that link to operational APIs and scripts
Implement runbook testing that validates each step works as documented
Track runbook usage and effectiveness metrics
Build runbook recommendation engine
Implement cross-runbook orchestration

Prompt Registry6

Build Immutable Prompt Storage with Content-Addressable Versioning
Implement Promotion Gates Between Dev, Staging, and Production
Create Prompt Diff and Review Workflow
Track Prompt Lineage Across All Deployments
Implement performance optimization for immutable prompt registry
Build operational documentation for immutable prompt registry

Model Lifecycle Manager6

Deploy MLflow on vCluster for Model Registry and Experiment Tracking
Implement Model Versioning with Stage Promotion Gates
Build Model Deprecation Workflow with Consumer Notification
Track Experiments with Cost and Quality Metrics for Data-Driven Model Selection
Implement performance optimization for model registry and lifecycle
Build operational documentation for model registry and lifecycle

Progressive Delivery Engine6

Deploy Argo Rollouts with Canary Strategy for LiteLLM Model Config Changes
Implement Shadow Deployments for Risk-Free Model Comparison in Production
Build Automated Rollback on Quality Regression During Canary Progression
Monitor Canary vs Baseline Quality, Latency, and Cost Metrics in Real-Time
Implement performance optimization for canary and shadow deployments
Build operational documentation for canary and shadow deployments

Distributed LLM Tracer6

Deploy an OpenTelemetry Collector with Langfuse Exporter
Instrument Multi-Provider Request Chains with Parent-Child Trace Spans
Build Trace Correlation Across RAG Retrieval, LLM Inference, and Guardrail Processing
Create Trace-Based Latency Analysis Dashboards with Drill-Down Capability
Implement performance optimization for end-to-end llm tracing
Build operational documentation for end-to-end llm tracing

Quality Drift Detector6

Implement Output Quality Drift Detection with Rolling Window Comparison
Build Embedding Drift Detection Using Distribution Divergence Metrics
Detect Retrieval Relevance Degradation with RAGAS-Based Monitoring
Configure Automated Alerts for Each Drift Type with Severity Classification
Implement performance optimization for quality drift detection
Build operational documentation for quality drift detection

GenAI Alert System6

Configure Alertmanager with GenAI-Specific Routing Rules and Severity Classification
Deploy Grafana OnCall for On-Call Schedules, Escalation Policies, and Incident Lifecycle
Implement Alert Deduplication and Grouping for Noisy GenAI Metrics
Build Alert Effectiveness Tracking to Reduce Alert Fatigue
Implement performance optimization for alerting strategy
Build operational documentation for alerting strategy

GenAI Dashboard Suite6

Build operational dashboard with SLO status, active incidents, and system health overview
Create business dashboard with usage, cost, and adoption metrics for stakeholders
Build compliance dashboard with guardrail activity, audit coverage, and policy status
Implement dashboard-as-code with Grafana provisioning for version-controlled dashboards
Implement performance optimization for dashboard engineering
Build operational documentation for dashboard engineering

Provider SLA Tracker6

Implement per-provider availability tracking with synthetic probes
Build provider degradation detection using quality and latency SLIs
Create automated escalation chains for provider issues with status page integration
Track provider SLA compliance for vendor management and contract negotiation
Implement performance optimization for provider sla monitoring
Build operational documentation for provider sla monitoring

Cross-Env Comparator6

Implement cross-environment metric comparison for quality regression detection
Build staging-to-prod quality correlation analysis for deployment confidence
Create environment drift detection for configuration parity monitoring
Monitor promotion impact by comparing pre/post metrics across environments
Implement performance optimization for cross-environment observability
Build operational documentation for cross-environment observability

Cost Attribution Engine6

Instrument per-request cost tracking across all pipeline stages
Build cost attribution to teams, projects, and use cases
Create cost allocation models for shared infrastructure components
Implement cost anomaly detection with automated investigation
Implement performance optimization for full-stack cost attribution
Build operational documentation for full-stack cost attribution

Token Budget Controller6

Configure LiteLLM Virtual Keys with Per-Team Budget Limits
Implement Per-Request Token Limits
Build Budget Alerting at 50%, 80%, and 100% Thresholds with Escalation
Create Budget Override Workflows for Emergency Usage Beyond Limits
Implement performance optimization for token budget enforcement
Build operational documentation for token budget enforcement

Cache Economics Analyzer6

Deploy Redis Semantic Cache and Measure Hit Rate vs Cost Savings
Compare Provider Caching Strategies for OpenAI, Anthropic, and Google
Build Cost-Benefit Analysis with Break-Even Calculations
Recommend Optimal Caching Mix Per Use Case
Implement performance optimization for caching roi analysis
Build operational documentation for caching roi analysis

Batch API Scheduler6

Implement workload classification: real-time vs batch-eligible based on latency requirements
Build Batch API job scheduling with priority queues and SLA tracking
Create batch job monitoring with completion time SLAs and failure handling
Measure and report cost savings from batch routing vs synchronous requests
Implement performance optimization for batch api optimization
Build operational documentation for batch api optimization

Capacity Forecaster6

Build token demand forecasting using historical usage patterns and trend analysis
Implement embedding volume projection for storage and compute planning
Create cost projection models for budget planning cycles
Track forecast accuracy and improve models over time with feedback loops
Implement performance optimization for capacity forecasting
Build operational documentation for capacity forecasting

FinOps Governance Platform6

Build Showback and Chargeback Reports per Team and Project with Full Cost Transparency
Create Executive FinOps Dashboards with Trend Analysis for Leadership
Implement Cost Governance Policies with Automated Enforcement
Generate Monthly FinOps Reviews with Optimization Recommendations
Implement performance optimization for finops reporting and governance
Build operational documentation for finops reporting and governance

LLM-Specific Observability6

Hallucination Detection Pipeline
Semantic Drift Monitor
Response Quality Scorer
LLM Observability Dashboard
Quality Degradation Root Cause
LLM Observability Capstone

Token Cost FinOps6

Cost Tracking Pipeline
Budget Management System
Chargeback Reporting
Cost Optimization Engine
FinOps Drill-Down Dashboard
FinOps Capstone

OpenTelemetry for Agentic Systems6

LLM Call Instrumentation
Agent Execution Tracing
Custom Agent Attributes
Langfuse Trace Integration
K8s Distributed Tracing
Tracing Capstone

Eval Gates in CI/CD6

Eval Suite Design
CI Pipeline Integration
Eval Result Storage
Change-Type-Specific Gates
Eval Failure Workflow
Eval Gates Capstone

Prompt Versioning and Deployment6

Operational Prompt Registry
Canary Prompt Deployment
Automated Prompt Rollback
Prompt Deployment Pipeline
Multi-Prompt Coordination
Prompt Ops Capstone

Guardrail Operations6

Guardrail Deployment
Guardrail Tuning
Guardrail Versioning
Effectiveness Monitoring
Guardrail Incident Response
Guardrail Ops Capstone

AI Incident Response6

AI Incident Taxonomy
Automated Detection
AI Incident Runbooks
Post-Incident Review
Escalation Path Design
Incident Response Capstone

Multi-Tenant AI Platform Ops6

Tenant Isolation
Quota Management
Noisy-Neighbor Detection
Per-Tenant SLA Monitoring
Tenant Self-Service
Multi-Tenant Ops Capstone

AI Governance Compliance Ops6

EU AI Act Controls
NIST AI RMF Implementation
SOC2 AI Controls
Compliance Dashboard
Audit Preparation Workflow
Compliance Ops Capstone

Platform Operations Capstone6

Platform Integration
Unified Ops Dashboard
Operational Readiness
Operational Lifecycle Demo
Operations Automation
Platform Ops Capstone Report

GenAI Inference Engineering

Verifiable skill graph

What you'll ship in production

Design LLM gateway infrastructure

Optimize request latency

Implement structured output extraction

Build cost attribution and FinOps dashboards

Monitor inference quality metrics

Implement intelligent routing

Manage API rate limits and quotas

Deploy inference services on K8s

Curriculum