GenAI Safety & Evaluation Engineering

Design automated LLM evaluation pipelines, red-team GenAI systems, build bias detection and fairness benchmarks, implement guardrails.

12 skill groups7 courses702 goals~306 hrs

Verifiable skill graph

12 skill groups · each becomes a signed node on your graph.

Every lab you pass signs a W3C Verifiable Credential on your public skill graph. Completing the labs in each group below mints one node on that graph — the badge you walk away with is a cryptographic record of what you can ship, not a completion certificate.

Share the URL on your résumé or with a hiring manager. They click; they see the discipline, the labs you passed, and the verification signature. No honor system, no broker.

01
LLM-as-Judge & Scoring

LLM-as-judge rubric design, position-bias correction, calibration against human raters, scoring functions, judge reproducibility, multi-judge ensembling. The core automated-evaluation technique.

02
Continuous Eval & CI/CD Gates

Automated eval pipelines: eval-driven CI/CD, quality gates that block deploys, regression suites, champion/challenger A/B, prompt-variant testing — plus eval-pipeline observability, tracing (Langfuse/OpenTelemetry) and cost governance for judge runs.

03
Eval Dataset & Safety Benchmark Design

Golden-set curation, dataset versioning, prompt-variant generation, edge-case mining, stratified sampling — and creating standardized safety benchmarks: jailbreak/refusal taxonomies, harm-category coverage, threshold-gated safety suites.

04
Adversarial Robustness Evaluation & Red-Teaming

Adversarial robustness evaluation: systematic red-team test-suite generation, jailbreak/injection probe batteries, automated red teaming, robustness scoring and attack-success-rate metrics, OWASP LLM Top 10 + MITRE ATLAS. The measured-verdict eval slice, not offensive exploitation.

05
Bias, Fairness & Disparate-Impact Benchmarking

Statistical fairness evaluation: demographic-parity and equalized-odds tests, disparate-impact ratios, subgroup and intersectional slicing, counterfactual-token swaps (BBQ/CrowS-Pairs/WinoBias), and fairness-benchmark construction. Distinct from factual grounding.

06
Compliance & Governance Frameworks

EU AI Act / NIST AI RMF / SOC2 / HIPAA / GDPR control mapping, governance-artifact generation (model cards, eval policies, audit trails, sign-off gates), AI risk classification and governance workflows. Implemented on top of the eval capabilities.

07
Hallucination & Grounding Eval

Factual-grounding evaluation: hallucination detectors, faithfulness/groundedness scoring, NLI-based entailment, attribution and citation precision against source context. Distinct from statistical fairness.

08
Agent Trajectory Evaluation

Step-level agent evaluation: trajectory scoring, tool-call accuracy, plan-quality assessment, human-in-the-loop review gates, golden-trajectory datasets.

09
Content Safety, Guardrails & PII

Runtime safety enforcement: content moderation, guardrails, PII detection + redaction, toxicity classifiers, output sanitization, regulated-content classification. The inference-time defense layer (vs. building the eval suite in G3).

10
RAG & Cross-Model Evaluation

RAGAS + DeepEval + TruLens pipelines, retrieval relevance + faithfulness + answer-relevancy metrics, cross-model comparison harnesses and arena/pairwise model ranking.

11
Hosted LLM API Integration

Provider SDK integration in eval and safety code: judge models, multi-provider scoring, cross-model evaluation harnesses, multi-provider abstraction. Prerequisite plumbing.

12
Python for Eval Engineering

Production-grade Python for eval tooling: async/await, Pydantic models for eval rubrics, typing, dataclasses, pytest harnesses, parametrized testing. Prerequisite.

What you'll ship in production

Core responsibilities this discipline prepares you for.

  1. 1

    Build automated evaluation pipelines

    to continuously measure LLM output quality

    • Design evaluation harnesses with RAGAS, DeepEval, and NeMo Evaluator SDK for multi-metric scoring
    • Create evaluation datasets with ground-truth annotations and run cross-provider comparisons
    • Wire CI gates that automatically block deployments when faithfulness or relevance scores degrade
  2. 2

    Conduct red-team exercises

    — probe LLMs for vulnerabilities

    • Automate adversarial testing with Garak for prompt injection, jailbreak, and data extraction probes
    • Run multi-turn adversarial campaigns with Meta GOAT and DeepTeam for agent vulnerability testing
    • Execute red-team campaigns against realistic systems, discover vulnerabilities, and write actionable findings
  3. 3

    Implement production guardrails

    — content filters, PII detection, jailbreak prevention

    • Configure NeMo Guardrails with Colang policy language, Llama Guard 4, and Prompt Guard 2
    • Add Presidio for PII detection/redaction and Model Armor for Google-native content safety
    • Layer multiple defenses, test against comprehensive attack suites, and quantify safety-vs-helpfulness tradeoffs
  4. 4

    Design GenAI governance frameworks

    aligned with regulations

    • Map EU AI Act risk classification and implement NIST AI RMF control frameworks
    • Build OWASP LLM Top 10 mitigation strategies mapped to technical controls
    • Create governance artifacts, conduct risk assessments, and build automated audit trail pipelines
  5. 5

    Evaluate GenAI agent behavior

    — trajectory quality, tool selection accuracy

    • Build trajectory scoring systems measuring tool selection accuracy and task completion quality
    • Design human preference alignment tests and regression test suites for agent workflows
    • Evaluate multi-step agent executions to identify failure modes and build targeted regression tests
  6. 6

    Monitor bias, fairness, and hallucination rates

    in production

    • Detect bias across protected attributes using statistical fairness metrics and disparity analysis
    • Measure hallucination rates through ground-truth comparison and citation verification
    • Implement continuous bias scanning, hallucination detection, and alerting for metric drift
  7. 7

    Build safety incident response processes

    for deployed GenAI systems

    • Design safety monitoring dashboards with severity-based alert routing and escalation paths
    • Build incident triage workflows with containment procedures and post-incident reporting templates
    • Simulate safety incidents end-to-end and practice the full detection-to-resolution workflow
  8. 8

    Design LlamaFirewall policies

    for agent safety

    • Configure LlamaFirewall middleware for controlling agent tool access and output filtering rules
    • Set up multi-agent safety boundaries with policy-based execution constraints
    • Validate firewall policies against adversarial scenarios where agents attempt to bypass controls

Curriculum

7 courses · each builds on previous goals

11 goals unlocked for preview — click to read. Locked goals need a subscription.