AI Architect / AI Platform Engineer

Comprehensive Interview Preparation Guide

AWS GenAI Platform Security Leadership

⚙ AI Hub — End-to-End Architecture

A production AI Hub on AWS follows a layered architecture with clear separation of concerns. Each layer can be scaled, monitored, and upgraded independently.

Request Flow (Happy Path)

User request arrives at API Gateway with auth token
Gateway applies rate limits, validates schema, logs request
Request routed to Agent Orchestration based on tenant/use case
Agent determines if RAG context is needed
If RAG: hybrid search (vector + keyword) in OpenSearch
Retrieved context sent to Foundation Model via Bedrock
Model output validated by Guardrails (PII, safety, toxicity)
Response logged, cached if applicable, returned to user

Key Design Principles

🔌 Loose Coupling

Each layer can be scaled independently. Swap Bedrock for SageMaker endpoints without touching the gateway or agent layer.

🔍 Observability

Latency, cost, and token usage tracked at every layer. CloudWatch + custom dashboards for per-tenant metrics.

🔒 Tenant Isolation

Multi-tenancy enforced at both IAM (ABAC) and application layers. Data never leaks between tenants.

⚡ Resilience

Circuit breakers for model calls, fallback to smaller/cached models, graceful degradation when services are down.

🛠 Platform Layers — Deep Dive

🔷 AI Gateway / Model Access Layer

The gateway is the single entry point for all AI requests. It handles authentication, rate limiting, request validation, and model routing.

AWS Implementation

Component	AWS Service	Purpose
Routing	API Gateway / ALB	Route requests by tenant, model, use-case
Auth	Cognito + API Keys	JWT validation, API key management
Rate Limiting	API Gateway throttling	Per-tenant, per-model limits
Model Abstraction	Lambda + Bedrock SDK	Unified interface across model providers
Caching	ElastiCache (Redis)	Cache frequent queries, reduce cost

Interview Tip: Emphasize that the gateway abstracts model providers — you can switch from Claude to Llama without application changes. This is a key "build vs buy" decision.

🔷 Agent Orchestration Runtime

The orchestration layer manages multi-step agent workflows with state management, tool selection, and human-in-the-loop capabilities.

Key Decisions

Approach	Best For	Trade-off
LangGraph on ECS	Complex stateful agents	More control, more ops overhead
Bedrock Agents	Simple tool-use agents	Managed, but limited customization
Step Functions + Lambda	Deterministic workflows	Great visibility, but not dynamic

Build vs Buy: Bedrock Agents are good for POCs. For production with complex retry logic, state persistence, and multi-agent coordination, build with LangGraph on ECS/EKS.

🔷 RAG & Knowledge Infrastructure

The knowledge layer handles document ingestion, chunking, embedding, vector storage, and retrieval with re-ranking.

Production Pipeline

Document Upload (S3) │ ├─ Glue ETL: Parse PDF/DOCX/HTML ├─ Chunking: Semantic + Hierarchical (256-1024 tokens) ├─ Embedding: Titan Embeddings / Cohere │ └─ OpenSearch: Index vectors + metadata ├─ Vector index (HNSW) └─ Keyword index (BM25)

Interview Tip: Always mention hybrid search (vector + keyword) with reciprocal rank fusion. This handles both semantic and exact-term matching, covering 90% of retrieval problems.

🔷 Governance, Security & Observability

A cross-cutting concern that spans all layers. Implements guardrails, audit logging, encryption, and monitoring.

Three-Layer Safety

Input Filtering ├─ Prompt injection detection (regex + ML) ├─ PII redaction (SSN, credit card, email) └─ Jailbreak detection Model Execution (Sandboxed) ├─ Rate limiting, token limits, timeouts Output Filtering ├─ Harmful content detection ├─ Hallucination scoring └─ Sensitive data redaction + citation validation

🧠 LLM Internals

Transformer Architecture

Core Components

Component	Function	Details
Tokenization	Text → tokens (BPE)	~1 token ≈ 4 chars. Custom tokenizers for domain models.
Embedding Layer	Token IDs → dense vectors	4096-12288 dims for modern LLMs. Semantic clustering.
Self-Attention	Compute token relationships	Attention(Q,K,V) = softmax(Q·K^T/√d_k)·V — O(n²) complexity
Multi-Head Attention	Parallel attention heads	12-100 heads learning different relationships (syntax, semantics)
FFN	Feature transformation	d_model → 4×d_model → d_model with non-linearity
Layer Norm + Residual	Training stability	Enables training 100+ layer models

Context Windows & Scaling

Model	Context	Year
GPT-2	~1K tokens	2019
GPT-3	4K tokens	2020
Claude 3	200K tokens	2024
Claude 3.5 Sonnet	200K tokens	2025

Quadratic Problem: Attention is O(n²) — doubling context requires 4× resources. Effective context ≠ max tokens (models degrade at >80% capacity). The "lost-in-the-middle" effect means information in the center of long contexts gets less attention.

Fine-Tuning vs Prompting Decision Tree

Start with Prompt Engineering (cheap, fast) │ └─ Quality insufficient after 10 iterations? ├─ Few-Shot Prompting (add examples) → 40-50% improvement └─ Still insufficient? ├─ LoRA / Adapter Tuning → 80-90% of full tuning, 10x faster └─ Full Fine-Tuning (distillation, hallucination correction)

✓ Fine-Tune When

Domain specialization (finance, biotech, proprietary code)
Latency reduction via distillation to smaller model
Hallucination reduction with correct domain data
Cost reduction (fewer tokens needed)

✗ Don't Fine-Tune When

General reasoning (models already excel)
Frequently changing tasks (require retraining)
Limited data (<500 examples)
Need for explainability

📚 RAG Patterns — Naive to Production

Evolution: Naive RAG → Advanced RAG

Naive RAG (Baseline)

User Query → Embed Query → Vector Search → Top-K → Augment Prompt → Generate

Problems: No query understanding, shallow retrieval, no ranking beyond similarity, wasted context window.

Advanced RAG (Production)

User Query ├─ Query Rewriting (rephrase for better retrieval) ├─ Query Expansion (synonyms, related queries) │ ├─ Hybrid Search: │ ├─ Vector search (semantic similarity) │ ├─ BM25/keyword search (exact terms) │ └─ Reciprocal Rank Fusion │ ├─ Candidate Retrieval (100-500 candidates) ├─ Re-ranking (cross-encoder, LLM-based relevance) ├─ Context Compression (filter, summarize, reorder) └─ Generation with Attribution

Query Rewriting Techniques

1. Query Decomposition

Break complex queries into sub-queries. Example: "Tax implications of 2008 crisis on small businesses" becomes: ["2008 financial crisis causes", "tax policy changes 2008", "small business impact"]. Retrieve for each, combine results.

2. Query Expansion

"Machine learning models" expands to: ["ML models", "neural networks", "deep learning", "AI algorithms"]. Improves recall when document terminology varies.

3. HyDE (Hypothetical Document Embeddings)

Generate a hypothetical relevant document for the query, then use its embedding for retrieval instead of the query embedding. Bridges the vocabulary gap between queries and documents.

Chunking Strategies

Strategy	Size	Best For	Trade-off
Fixed-Size	512 tokens, 20% overlap	Baseline, general use	Breaks mid-sentence
Semantic	Variable	Domain-heavy (law, finance)	More expensive (embedding every boundary)
Hierarchical	Multi-level	Production systems	Complex but most effective
Document-Aware	Preserves structure	Structured docs with headers	Requires document parsing

Production Best Practice: Start with hierarchical chunking + hybrid search + re-ranking. This covers 90% of RAG problems at moderate complexity. Re-ranking alone improves relevance 40-60% with <10% latency overhead.

Hybrid Search Implementation

# Parallel queries
semantic_results = vector_search(query_embedding, top_k=100)  # Recall: 80%
keyword_results  = bm25_search(query, top_k=100)               # Recall: 70%

# Fusion methods:
# 1. Reciprocal Rank Fusion (RRF): score = 1/(k + rank), sum scores
# 2. Weighted average: score = 0.7 * vec_score + 0.3 * keyword_score

ranked_results = fuse_and_rank(semantic_results, keyword_results)
final_results  = rerank(ranked_results, query, top_k=10)

🤖 Agent Frameworks

LangChain vs LangGraph vs Semantic Kernel

Dimension	LangChain	LangGraph	Semantic Kernel
Model	Chains (linear)	DAGs with state	Plugins + Planning
Strengths	Wide ecosystem, fast to prototype	Loops, branching, persistence, debuggable	Enterprise tooling, .NET integration
Weaknesses	No native loops/state	Steeper learning curve	Less Python community
Best For	Chatbots, Q&A, simple pipelines	Multi-step agents, approval workflows	Enterprise .NET / Azure
State Mgmt	Manual	Built-in persistence	Kernel context

Decision Framework

Use LangChain When

Linear workflow (chat, Q&A, summarization)
Quick prototyping needed
Heavy integration requirements

Use LangGraph When

Agent with loops & retries
Human-in-the-loop approval
Multi-agent coordination
Need state persistence & recovery

Production Agent Architecture

Request → API Gateway │ LangGraph Agent Orchestrator ├─ Agent node (decide next action) ├─ Tool nodes (execute actions) │ ├─ Database query tool │ ├─ Search tool │ ├─ Calculation tool │ └─ External API tool ├─ Human review node (optional loop) └─ Response formatting node │ Response Cache (Redis) → Response

🌐 Multi-Agent Systems

Orchestration Patterns

1. Orchestrator-Worker

Orchestrator (supervisor) ├─ Task → Agent A ├─ Task → Agent B └─ Synthesize results

Pros: Simple, clear control. Cons: Single point of failure.

2. Peer-to-Peer

Agent A ↔ Agent B ↓ ↓ Agent C ↔ Agent D

Pros: Resilient, scalable. Cons: Harder to debug.

3. Hierarchical

Strategic Agent ├─ Tactical Agent 1 │ └─ Worker Agents └─ Tactical Agent 2

Pros: Separation of concerns. Cons: Complex orchestration.

4. Market-Based

Task Published → Agents Bid ↓ Winner Executes

Pros: Efficient allocation. Cons: Needs reputation system.

Communication Protocols (2025-2026)

Protocol	By	Model	Use Case
MCP	Anthropic	Server-client	LLM access to tools, data sources
A2A	Community	Peer-to-peer	Multi-agent negotiation, delegation
ACP	Community	Formal spec	Heterogeneous multi-agent systems
ANP	Community	Decentralized	Swarm intelligence, distributed decisions

🚀 LLMOps / MLOps Lifecycle

CI/CD Pipeline for AI

Code Changes (Git) │ ├─ [1] Lint & Type Check (Black, MyPy, prompt template validation) ├─ [2] Unit Tests (model init, tool compat, prompt structure) ├─ [3] Integration Tests (E2E workflows, latency checks) ├─ [4] Evaluation Suite (RAGAS, correctness, hallucination rate, cost) ├─ [5] Staging Deployment (shadow traffic, baseline comparison) ├─ [6] Canary Deployment (5% traffic, auto-rollback) └─ [7] Production (blue-green, gradual traffic shift)

Model Evaluation Metrics

Category	Metrics	Tools
Retrieval (RAG)	Precision@K, Recall, NDCG, MRR	RAGAS, Trulens
Generation	BLEU, ROUGE, METEOR	HuggingFace Evaluate
Hallucination	Factuality score, contradiction detection	Vectara, LangSmith
Latency	P50, P95, P99	CloudWatch, Datadog
Cost	Tokens/query, cost/request	Built-in token counting

Drift Detection

Input Drift

Query distribution changes (e.g., billing bot gets technical questions). Detect via statistical tests on embedding distributions.

Output Drift

Model behavior changes (more verbose, different tone). Detect via token length histograms, lexical diversity metrics.

Data Quality Drift

RAG documents updated but indexes stale. Detect via embedding distribution shift. Action: reindex, retrain.

Business Metric Drift

User satisfaction or conversion drops. Detect via user ratings, feedback analysis. Action: prompt refinement.

Retraining Triggers

Automatic

Hallucination rate > 5%
Latency P95 > 2x baseline
Retrieval NDCG drops 20%
New data + 2 weeks passed

Manual

Major domain update (new regulations)
Systemic bias from user feedback
Better model released by provider
Compliance requirement changes

☁ Amazon Bedrock

Architecture Overview

Bedrock Knowledge Bases: Build vs Buy

	Bedrock Knowledge Bases	Custom RAG Pipeline
Setup Time	Hours	Weeks
Chunking	Fixed only	Semantic, hierarchical, custom
Hybrid Search	Limited	Full control (RRF, weighted)
Re-ranking	Not customizable	Cross-encoder, LLM-based
Best For	POCs, simple Q&A	Production, accuracy >95%

Interview Tip: Frame build vs buy as a spectrum. Start with Bedrock KB for POC, then migrate to custom RAG when accuracy requirements increase. This shows pragmatic architecture thinking.

🔬 Amazon SageMaker — Deep Dive

SageMaker is the core ML platform on AWS — it spans the entire ML lifecycle from data labeling through training, hosting, and monitoring. For an AI Hub, SageMaker handles custom model training, fine-tuning, and serving models that Bedrock doesn't offer.

SageMaker vs Bedrock — When to Use Which

Scenario	Use Bedrock	Use SageMaker
Foundation model inference	✓ (managed, multi-provider)	Only for models not on Bedrock
Custom model training	✗	✓ (Training Jobs, distributed)
Fine-tuning LLMs	✓ (limited models, Bedrock fine-tuning)	✓ (full control, any model, LoRA/QLoRA)
Classical ML	✗	✓ (XGBoost, sklearn, etc.)
Embedding models	✓ (Titan Embeddings, Cohere)	✓ (custom embedding models)
RAG	✓ (Knowledge Bases)	Pair with OpenSearch for custom RAG
AI agents	✓ (Bedrock Agents)	✗ (use LangGraph on ECS instead)
GPU/hardware control	✗ (abstracted)	✓ (choose instance type, GPU count)
Cost optimization	Pay per token	Pay per instance-hour (spot = 70% off)

Interview Tip: A mature AI Hub uses both: Bedrock for quick access to foundation models, SageMaker for custom models, fine-tuning, and workloads where you need hardware control or cost optimization at scale.

Inference Endpoints — Deep Dive

Option	Latency	Throughput	Cost Model	Use Case
Real-Time Endpoint	<1s	High	Pay per instance-hour	Interactive APIs, chatbots
Async Inference	Minutes	Very high	Pay per request + instance	Large payload, batch, RAG pipelines
Batch Transform	Hours	Unlimited	Cheapest per inference	Offline scoring, nightly reprocessing
Serverless Inference	1-2s (cold start)	Medium	Pay per invocation + duration	Spiky traffic, dev/test environments
Multi-Model Endpoint	<1s (cached)	High	Pay per instance (shared)	Serve 100s of models on 1 endpoint
Multi-Container Endpoint	<1s	High	Pay per instance	Inference pipeline (preprocess + model + postprocess)

Real-Time Endpoint Architecture

API Request │ Application Load Balancer │ SageMaker Endpoint ├─ Production Variant A (80% traffic) ─ ml.g5.xlarge ├─ Production Variant B (20% traffic) ─ ml.g5.2xlarge [canary] └─ Shadow Variant (0% live, copies traffic) ─ [A/B testing] │ Model Container (ECR image with inference code) ├─ model_fn() ─ Load model weights ├─ input_fn() ─ Deserialize request ├─ predict_fn() ─ Run inference └─ output_fn() ─ Serialize response

Production Variant Configuration

import sagemaker
from sagemaker.model import Model

model = Model(
    image_uri="123456789.dkr.ecr.us-east-1.amazonaws.com/my-model:latest",
    model_data="s3://bucket/model.tar.gz",
    role="arn:aws:iam::123456789:role/SageMakerRole"
)

# Deploy with two variants for A/B testing
predictor = model.deploy(
    initial_instance_count=2,
    instance_type="ml.g5.xlarge",
    endpoint_name="my-model-endpoint",
    data_capture_config=DataCaptureConfig(
        enable_capture=True,
        sampling_percentage=100,         # Capture all for monitoring
        destination_s3_uri="s3://bucket/capture/"
    )
)

Async Inference — For Long-Running AI Workloads

Client → POST request with S3 input location │ SageMaker Async Endpoint ├─ Queues request internally ├─ Returns InferenceId immediately ├─ Processes when capacity available └─ Writes output to S3 │ SNS Notification → Success/Failure callback │ Auto-scales to 0 when idle (cost savings!)

# Async endpoint config — scales to 0 when idle
async_config = AsyncInferenceConfig(
    output_path="s3://bucket/async-output/",
    max_concurrent_invocations_per_instance=4,
    notification_config=AsyncInferenceNotificationConfig(
        success_topic="arn:aws:sns:us-east-1:123:success-topic",
        error_topic="arn:aws:sns:us-east-1:123:error-topic"
    )
)

model.deploy(
    async_inference_config=async_config,
    instance_type="ml.g5.2xlarge",
    initial_instance_count=1
)

Key benefit: Async endpoints can scale to 0 instances — no traffic means no cost. Perfect for intermittent AI workloads like document processing, batch RAG, or nightly model retraining inference.

Multi-Model Endpoints (MME) — Serve Hundreds of Models

Multi-Model Endpoint (single endpoint, single instance fleet) │ ├─ Model A (loaded) ─ Tenant 1 fine-tuned model ├─ Model B (loaded) ─ Tenant 2 fine-tuned model ├─ Model C (on S3) ─ Loaded on-demand when called ├─ Model D (on S3) ─ Loaded on-demand when called └─ ... up to 1000s of models Dynamic loading: frequently-used models stay in memory, cold models loaded from S3 on first request (~seconds)

MME is ideal for multi-tenant AI platforms where each tenant has a fine-tuned model. Instead of one endpoint per tenant ($$$), you serve all tenants from a shared fleet and SageMaker handles model loading/unloading based on traffic patterns.

# Invoke a specific model on a multi-model endpoint
response = sm_runtime.invoke_endpoint(
    EndpointName="multi-model-endpoint",
    TargetModel="tenant-a/model.tar.gz",  # S3 key for this tenant's model
    Body=json.dumps(payload),
    ContentType="application/json"
)

Inference Recommender — Right-Size Your Endpoints

SageMaker Inference Recommender benchmarks your model across different instance types and configurations to find the optimal cost/performance trade-off.

# Run inference recommender to find optimal instance
response = sm_client.create_inference_recommendations_job(
    JobName="my-model-benchmark",
    JobType="Default",  # or "Advanced" for custom traffic patterns
    RoleArn=role_arn,
    InputConfig={
        "ModelPackageVersionArn": model_package_arn,
        "JobDurationInSeconds": 3600
    }
)
# Returns: ranked list of instance types with latency, throughput, cost
# Example: ml.g5.xlarge — P95: 120ms, 40 req/s, $1.41/hr
#          ml.g5.2xlarge — P95: 85ms, 65 req/s, $2.36/hr

Auto-Scaling Endpoints

Scaling Policies for AI Workloads

import boto3
client = boto3.client("application-autoscaling")

# Register the endpoint variant as a scalable target
client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/my-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=2,    # Always at least 2 for availability
    MaxCapacity=20,   # Burst up to 20 instances
)

# Target tracking — scale based on invocations per instance
client.put_scaling_policy(
    PolicyName="InvocationsPerInstance",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/my-endpoint/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,   # Target 70 invocations per instance
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleInCooldown": 300,   # Wait 5min before scaling down
        "ScaleOutCooldown": 60,   # Scale up quickly (1min)
    }
)

Scaling Metrics Comparison

Metric	Best For	Trade-off
`InvocationsPerInstance`	General workloads	Simple, but doesn't account for request complexity
`CPUUtilization`	CPU-bound models	Doesn't correlate with GPU utilization
`GPUUtilization`	GPU-heavy inference	Custom metric via CloudWatch, more accurate for LLMs
`ModelLatency`	Latency-sensitive APIs	Scale when latency degrades above target
Custom (queue depth)	Async inference	Scale based on pending requests in SQS

SageMaker Training Jobs

Training Architecture

Training Job │ ├─ Input: S3 (training data) + ECR (container image) ├─ Compute: ml.p4d.24xlarge (8x A100 GPUs) │ ├─ On-Demand: full price, guaranteed capacity │ └─ Managed Spot: 70% off, can be interrupted │ └─ Checkpointing to S3 every N steps ├─ Distributed: data parallel / model parallel │ ├─ Data Parallel: same model, split data across GPUs │ └─ Model Parallel: split model layers across GPUs (for LLMs) └─ Output: model.tar.gz → S3 → Model Registry

Training Job Code Example

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1-gpu",
    role=role,
    instance_count=4,                   # 4 instances for distributed training
    instance_type="ml.p4d.24xlarge",     # 8x A100 GPUs each = 32 GPUs total
    use_spot_instances=True,             # 70% cost savings
    max_wait=86400,                      # Max 24hr including spot interruptions
    max_run=72000,                       # Max 20hr actual training
    checkpoint_s3_uri="s3://bucket/checkpoints/",
    output_path="s3://bucket/output/",
    hyperparameters={
        "epochs": "10",
        "learning_rate": "5e-5",
        "batch_size": "32",
        "model_name": "bert-base-uncased"
    },
    distribution={
        "torch_distributed": {"enabled": True}  # PyTorch DDP
    },
    metric_definitions=[
        {"Name": "train:loss", "Regex": "train_loss: ([0-9\\.]+)"},
        {"Name": "eval:accuracy", "Regex": "eval_acc: ([0-9\\.]+)"}
    ]
)

estimator.fit({
    "training": "s3://bucket/train/",
    "validation": "s3://bucket/val/"
})

Managed Spot Training — 70% Cost Savings

How It Works

SageMaker uses EC2 Spot Instances (unused capacity, 70% cheaper)
If spot is interrupted, training pauses and resumes from checkpoint
Set max_wait as budget for total time including interruptions
Set max_run as budget for actual training time

Checkpointing Strategy

Save checkpoints to S3 every N steps (not just epochs)
For LLM fine-tuning: checkpoint every 500-1000 steps
Resume from latest checkpoint after spot interruption
Cost savings: 60-90% vs on-demand for training jobs

Hyperparameter Tuning (HPO)

from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name="eval:accuracy",
    objective_type="Maximize",
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(1e-5, 5e-4, scaling_type="Logarithmic"),
        "batch_size": IntegerParameter(16, 128),
        "warmup_steps": IntegerParameter(100, 1000),
    },
    max_jobs=20,          # Run 20 combinations
    max_parallel_jobs=4,   # 4 training jobs in parallel
    strategy="Bayesian",    # Bayesian optimization (smarter than grid/random)
)

tuner.fit({"training": "s3://bucket/train/"})

Tuning Strategies: Bayesian (best for expensive jobs, learns from previous runs) > Random (good baseline, parallelizable) > Grid (exhaustive, expensive). For LLM fine-tuning, Bayesian with 10-20 jobs is usually sufficient.

SageMaker Pipelines — MLOps Orchestration

SageMaker Pipeline (DAG-based, repeatable, versioned) │ ├─ ProcessingStep │ └─ Data cleaning, feature engineering, train/test split │ ├─ TrainingStep │ └─ Model training (GPU, distributed, spot) │ ├─ TuningStep (optional) │ └─ Hyperparameter optimization across N runs │ ├─ EvaluationStep │ └─ Compute metrics (accuracy, F1, RAGAS, custom) │ ├─ ConditionStep │ ├─ accuracy > 0.85? → RegisterModel → Deploy │ └─ else → FailStep (notify team) │ ├─ RegisterModelStep │ └─ Model Registry (version, approval status, lineage) │ └─ CreateEndpointStep (or Lambda for custom deployment) └─ Deploy to endpoint with data capture enabled

Pipeline Code — Full Example

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.step_collections import RegisterModel

# Step 1: Data Processing
processing_step = ProcessingStep(
    name="PreprocessData",
    processor=sklearn_processor,
    code="preprocess.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[ProcessingOutput(output_name="train"), ProcessingOutput(output_name="test")]
)

# Step 2: Training
training_step = TrainingStep(
    name="TrainModel",
    estimator=estimator,
    inputs={"train": TrainingInput(s3_data=processing_step.properties.ProcessingOutputConfig
        .Outputs["train"].S3Output.S3Uri)}
)

# Step 3: Evaluation
eval_step = ProcessingStep(
    name="EvaluateModel",
    processor=sklearn_processor,
    code="evaluate.py",
    inputs=[ProcessingInput(source=training_step.properties.ModelArtifacts.S3ModelArtifacts,
        destination="/opt/ml/processing/model")],
    property_files=[PropertyFile(name="evaluation", output_name="metrics",
        path="evaluation.json")]
)

# Step 4: Conditional deployment
cond = ConditionGreaterThanOrEqualTo(
    left=JsonGet(step_name="EvaluateModel", property_file="evaluation",
        json_path="accuracy"),
    right=0.85
)

register_step = RegisterModel(
    name="RegisterModel",
    estimator=estimator,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["application/json"],
    approval_status="PendingManualApproval"  # Requires human sign-off
)

condition_step = ConditionStep(
    name="CheckAccuracy",
    conditions=[cond],
    if_steps=[register_step],     # Deploy if accurate enough
    else_steps=[FailStep(name="ModelNotGoodEnough")]
)

# Assemble pipeline
pipeline = Pipeline(
    name="AI-Hub-Training-Pipeline",
    steps=[processing_step, training_step, eval_step, condition_step],
    sagemaker_session=session
)

pipeline.upsert(role_arn=role)
pipeline.start()

Model Registry & Approval Workflow

Model Registry Architecture

Model Package Group: "fraud-detection-models" │ ├─ Version 1 (v1.0) ─ Status: Approved ─ In Production │ ├─ Metrics: accuracy=0.92, F1=0.89 │ ├─ Artifacts: s3://models/fraud-v1/model.tar.gz │ └─ Lineage: training job ARN, dataset version, pipeline run │ ├─ Version 2 (v2.0) ─ Status: PendingManualApproval │ ├─ Metrics: accuracy=0.94, F1=0.91 │ └─ Awaiting: compliance review + risk committee │ └─ Version 3 (v3.0) ─ Status: Rejected └─ Reason: bias detected in demographic parity test

# Approve a model version (typically via CI/CD or manual review)
sm_client.update_model_package(
    ModelPackageArn="arn:aws:sagemaker:us-east-1:123:model-package/fraud-v2",
    ModelApprovalStatus="Approved",
    ApprovalDescription="Passed compliance review, bias testing, and risk committee"
)

# EventBridge rule triggers deployment on approval
# Rule: source=sagemaker, detail-type=ModelPackageStateChange, status=Approved
# Target: Lambda function that deploys to endpoint

Feature Store

Dual-Store Architecture

┌────────────────────────────────────────────────┐ │ Feature Group: customer-features │ │ ├─ customer_id (record identifier) │ │ ├─ avg_transaction_amount │ │ ├─ transaction_count_30d │ │ ├─ credit_score │ │ └─ risk_segment │ └────────────────────────────────────────────────┘ │ │ Online Store Offline Store (DynamoDB) (S3 / Parquet) ├─ Millisecond lookups ├─ Historical data ├─ Latest feature values ├─ Training datasets └─ Real-time inference └─ Point-in-time joins

Feature Store Code — Create, Ingest, Retrieve

from sagemaker.feature_store.feature_group import FeatureGroup

# Define feature group
feature_group = FeatureGroup(name="customer-features", sagemaker_session=session)
feature_group.load_feature_definitions(data_frame=df)

# Create with both online + offline stores
feature_group.create(
    s3_uri="s3://bucket/feature-store/",
    record_identifier_name="customer_id",
    event_time_feature_name="event_time",
    role_arn=role,
    enable_online_store=True    # DynamoDB for real-time
)

# Ingest features
feature_group.ingest(data_frame=df, max_workers=4, wait=True)

# Real-time lookup (online store)
record = featurestore_runtime.get_record(
    FeatureGroupName="customer-features",
    RecordIdentifierValueAsString="CUST-12345"
)

# Historical query for training (offline store — Athena)
query = feature_group.athena_query()
query.run(
    query_string="""
        SELECT * FROM "customer-features"
        WHERE event_time BETWEEN '2025-01-01' AND '2025-12-31'
    """,
    output_location="s3://bucket/athena-results/"
)

SageMaker Model Monitor

Four Types of Monitoring

Monitor Type	What It Detects	How It Works
Data Quality	Input data drift, schema violations, missing values	Compares live input distributions against baseline statistics
Model Quality	Accuracy degradation, prediction drift	Compares predictions against ground truth labels (when available)
Bias Drift	Fairness violations across protected groups	Runs SageMaker Clarify bias metrics on live traffic
Feature Attribution Drift	Feature importance changes	Compares SHAP values over time to baseline

Model Monitor Setup Code

from sagemaker.model_monitor import DefaultModelMonitor, DataCaptureConfig

# Step 1: Enable Data Capture on endpoint
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri="s3://bucket/data-capture/",
    capture_options=["Input", "Output"]  # Capture both request + response
)

# Step 2: Create baseline from training data
monitor = DefaultModelMonitor(role=role, instance_type="ml.m5.xlarge")
monitor.suggest_baseline(
    baseline_dataset="s3://bucket/baseline-data/train.csv",
    dataset_format=DatasetFormat.csv(header=True)
)

# Step 3: Schedule monitoring (hourly)
monitor.create_monitoring_schedule(
    monitor_schedule_name="fraud-model-monitor",
    endpoint_input=EndpointInput(
        endpoint_name="fraud-detection-endpoint",
        destination="/opt/ml/processing/input"
    ),
    output_s3_uri="s3://bucket/monitoring-output/",
    schedule_cron_expression="cron(0 * ? * * *)"  # Every hour
)

# Violations trigger CloudWatch alarms → SNS → PagerDuty

SageMaker Clarify — Bias Detection & Explainability

Pre-Training Bias

Class Imbalance (CI): Are protected groups underrepresented?
Difference in Proportions (DPL): Does label distribution differ by group?
KL Divergence: How different are feature distributions across groups?
Run before training to catch data issues early

Post-Training Bias

Demographic Parity (DPPL): Equal positive prediction rates?
Equalized Odds (DI): Equal true/false positive rates?
Accuracy Difference (AD): Does accuracy vary by group?
Run after training and continuously in production

Clarify Explainability — SHAP Values

SageMaker Clarify uses SHAP (SHapley Additive exPlanations) to explain individual predictions. This is critical for BFSI compliance (GDPR right to explanation).

from sagemaker import clarify

shap_config = clarify.SHAPConfig(
    baseline=[[0, 0, 0, 0]],  # Reference point for SHAP
    num_samples=500,           # Number of perturbations
    agg_method="mean_abs"      # Aggregation for feature importance
)

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge"
)

clarify_processor.run_explainability(
    data_config=data_config,
    model_config=model_config,
    explainability_config=shap_config
)
# Output: per-feature importance scores for each prediction
# "This loan was denied primarily due to: credit_score (45%),
#  debt_to_income (30%), employment_length (15%)"

SageMaker JumpStart — Pretrained Models

JumpStart provides 600+ pretrained models that can be deployed or fine-tuned with one click. Key models for an AI Hub:

Model	Type	Use in AI Hub
Llama 3.1 (70B/8B)	LLM	Self-hosted alternative to Bedrock (full control, no token fees)
Falcon (180B/40B)	LLM	Open-source LLM for low-cost inference
BGE / GTE	Embedding	Custom embedding models for domain-specific RAG
Stable Diffusion	Image Gen	Image generation for creative use cases
Whisper	Speech-to-Text	Audio transcription for call center AI
Cross-Encoder	Re-ranking	Re-rank RAG results for higher relevance

Cost Strategy: For high-volume inference, self-hosting open-source LLMs via JumpStart on GPU instances can be 5-10x cheaper than Bedrock per-token pricing. Trade-off: you manage the infrastructure.

JumpStart Deploy Code

from sagemaker.jumpstart.model import JumpStartModel

# Deploy Llama 3.1 8B in one line
model = JumpStartModel(
    model_id="meta-textgeneration-llama-3-1-8b-instruct",
    instance_type="ml.g5.2xlarge",
    role=role
)

predictor = model.deploy(
    initial_instance_count=1,
    endpoint_name="llama-3-1-endpoint"
)

# Fine-tune with your data
model.fit({
    "training": "s3://bucket/fine-tune-data/"
})

SageMaker for LLM Fine-Tuning

Fine-Tuning Options Comparison

Method	Data Needed	Training Time	Cost	Quality
Full Fine-Tuning	1-5K examples	Hours-Days	$$$	Best
LoRA	500-2K examples	Minutes-Hours	$$	90% of full
QLoRA	500-2K examples	Minutes-Hours	$	85% of full
Bedrock Fine-Tuning	1K+ examples	Hours	$$	Good (limited models)

Decision: Use Bedrock fine-tuning for supported models (quick, managed). Use SageMaker for open-source models (Llama, Falcon), LoRA/QLoRA, or when you need full control over training hyperparameters and data pipeline.

LoRA Fine-Tuning on SageMaker

# Hugging Face Estimator with PEFT/LoRA
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    entry_point="train_lora.py",
    instance_type="ml.g5.12xlarge",  # 4x A10G GPUs
    instance_count=1,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
    role=role,
    hyperparameters={
        "model_id": "meta-llama/Llama-3.1-8B-Instruct",
        "lora_r": "16",          # LoRA rank (lower = faster, less capacity)
        "lora_alpha": "32",     # LoRA scaling factor
        "lora_dropout": "0.05",
        "epochs": "3",
        "learning_rate": "2e-4",
        "per_device_train_batch_size": "4",
        "gradient_accumulation_steps": "4",  # Effective batch = 16
    }
)

huggingface_estimator.fit({"training": "s3://bucket/lora-data/"})

Cost Optimization Strategies

Training Cost Savings

Spot instances: 60-90% savings with checkpointing
Right-sizing: Use Inference Recommender to find optimal instance
Mixed precision (FP16/BF16): 2x throughput, half memory
Savings Plans: 1-3yr commitments for 30-60% off
LoRA over full fine-tuning: 10x cheaper training

Inference Cost Savings

Multi-model endpoints: Share instances across models
Async endpoints: Scale to 0 when idle
Serverless inference: Pay only when invoked
Model compilation (Neo): 25% faster inference
Model quantization (INT8): Smaller model, less GPU needed
Inf2 instances: AWS Inferentia chips, 40% cheaper than GPUs

Instance Selection Cheat Sheet: ml.g5.* — best price/performance for most LLM inference. ml.p4d.* — A100 GPUs for large model training. ml.inf2.* — AWS Inferentia for cost-optimized inference (40% cheaper). ml.trn1.* — AWS Trainium for cost-optimized training (50% cheaper than p4d).

SageMaker Interview Questions

Q: "When would you host a model on SageMaker instead of using Bedrock?"

Answer: When you need (1) custom models not available on Bedrock (e.g., domain fine-tuned Llama with LoRA), (2) hardware control (specific GPU type, instance size), (3) cost optimization at high volume (self-hosting is 5-10x cheaper than per-token), (4) models with custom pre/post-processing, or (5) compliance requirements mandating dedicated infrastructure. In practice, most AI Hubs use both: Bedrock for quick multi-model access, SageMaker for custom workloads.

Q: "How do you handle model deployment without downtime?"

Answer: Use production variants for blue/green and canary deployments. Deploy new model as variant B with 5-10% traffic. Monitor latency, accuracy, and error rate via Data Capture + Model Monitor. If metrics are healthy after 1-2 hours, gradually shift traffic (20% → 50% → 100%). If metrics degrade, automatic rollback to variant A. For zero-downtime: use UpdateEndpoint with retain_all_variant_properties to swap models without endpoint recreation.

Q: "How would you design a cost-effective inference architecture for 1M requests/day?"

Answer: At 1M req/day (~12 req/sec avg, higher peak): (1) Use auto-scaling with target tracking on InvocationsPerInstance. (2) Use ml.g5.xlarge for GPU inference (best price/performance). (3) Enable model compilation with Neo for 25% speedup. (4) Consider Inf2 instances for 40% cost reduction. (5) Cache frequent queries in ElastiCache. (6) Use async endpoints for non-interactive workloads. (7) Savings Plans for baseline capacity, auto-scaling for peaks. Estimated cost: $3K-8K/month depending on model size.

⚡ Lambda + Step Functions

AI Orchestration Patterns

Step Functions State Machine │ ├─ Step 1: Parse Query → Lambda (input validation) ├─ Step 2: Retrieve Context → SageMaker (embedding) ├─ Step 3: Rank Results → Lambda (cross-encoder re-ranking) ├─ Step 4: Generate Response → Bedrock Agent ├─ Step 5: Validate Output → Bedrock Guardrails ├─ Step 6: Cache → DynamoDB write └─ Step 7: Return (success/failure path)

Why Step Functions over Lambda-only?

Lambda-Only Problems

Orchestration mixed with business logic
Hard to debug nested if/else
Timeout risks (15min max)
State loss on failure

Step Functions Benefits

Visual workflow with explicit state
Built-in retry/error handling
Long-running workflows (up to 1 year)
Human approval nodes
Cross-account execution

Cost Breakdown

Service	Cost
Bedrock call	$0.001-0.005 per request
SageMaker endpoint	$0.10-0.50 per hour (on-demand)
Step Functions	$0.000025 per state transition (negligible)
Lambda	$0.20 per 1M invocations

📊 Vector Database Selection

Factor	OpenSearch	Aurora pgvector	DynamoDB
Max Vectors	Billions	Millions	Millions
Latency P99	200-500ms	50-200ms	10-50ms
Hybrid Search	Built-in (BM25 + vec)	Separate FTS needed	Not supported
Cost (1M vectors)	$2-5/day	$0.5-1/day	$1-3/day
Best For	Large-scale, analytics	Relational + vectors	DynamoDB-native apps
Operational	Cluster management	Managed	Fully managed

Recommendation by Scale

<100K vectors: Aurora pgvector (simplicity, lowest cost)
100K-10M vectors: Aurora pgvector or OpenSearch (trade-off: cost vs features)
>10M vectors: OpenSearch (scale) + DynamoDB for metadata
Hybrid search required: OpenSearch (only option with built-in BM25 + vector)

🗃 Data Foundation: S3, Glue, Lake Formation

Data Lake Architecture for AI

Data Sources (APIs, databases, IoT, logs) │ ├─ AWS Glue (ETL/ELT) │ ├─ S3 Data Lake (Bronze / Silver / Gold) │ ├─ /bronze (raw ingested data) │ ├─ /silver (cleaned, normalized) │ └─ /gold (analytics-ready, aggregated) │ ├─ Lake Formation (security, governance, ABAC tagging) │ └─ Consumption ├─ Athena (SQL queries) ├─ SageMaker (ML training) ├─ Bedrock (RAG ingestion) └─ Analytics (BI tools)

AI-Specific: Chunking & Embedding Pipeline

# Glue ETL Job for RAG document processing
documents = glue_context.create_dynamic_frame.from_options(
    format_options={"paths": ["s3://data/documents/"]},
    format="pdf"
)

def chunk_document(record):
    text = record["content"]
    chunks = semantic_chunk(text, chunk_size=512)
    return [
        {"doc_id": record["id"], "chunk_id": i,
         "text": chunk, "embedding": embed(chunk)}
        for i, chunk in enumerate(chunks)
    ]

embeddings_frame = documents.map(lambda x: chunk_document(x))
embeddings_frame.write_dynamic_frame.to_s3(
    connection_options={"path": "s3://embeddings-bucket/"},
    format="parquet"
)

Lake Formation Governance for AI

Data Catalog: ├─ Dataset: customer_data │ ├─ PII columns: [email, ssn, phone] → REDACT tag │ ├─ Sensitivity: HIGH → Limited access │ └─ Lake Formation policy: Only data scientists │ └─ Dataset: training_data ├─ Status: APPROVED_FOR_TRAINING ├─ Model lineage: tracked └─ Audit: enabled

🔗 API Gateway Patterns for AI

Three API Patterns

Synchronous

POST /v1/chat/completions

SLO: P50 <200ms, P99 <1s. For interactive chat.

Asynchronous

POST /v1/jobs → 202 Accepted

Webhook callback or polling. For batch/long-running.

Streaming (SSE)

POST /v1/chat/stream

Token-by-token via Server-Sent Events. Best UX for chat.

Streaming Implementation

# FastAPI example (Lambda + function URL)
@app.post("/chat/stream")
async def stream_chat(request: ChatRequest):
    async def generate():
        response = bedrock_client.invoke_model_with_response_stream(
            modelId="claude-3-5-sonnet",
            body={"messages": request.messages}
        )
        for event in response['body']:
            if 'chunk' in event:
                chunk = json.loads(event['chunk']['bytes'])
                yield f"data: {json.dumps(chunk)}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

🏢 Multi-Tenant, Multi-Account Architecture

Three Isolation Models

Model	Pattern	Isolation	Cost	Best For
Silo	Account-per-tenant	Maximum	Highest	Regulated BFSI, healthcare
Bridge	Schema-per-tenant	Good	Moderate	Most enterprise AI platforms
Pool	Row-level isolation	Basic	Lowest	SaaS with low sensitivity

Recommended: Hub-Spoke Architecture

Hub (Shared Services): ├─ Bedrock access layer (guardrails, models) ├─ SageMaker endpoints (shared fine-tuned models) ├─ S3 data lake (with ABAC tagging) └─ OpenSearch (multi-tenant awareness) Spoke (Per-Tenant): ├─ Lambda for tenant-specific logic ├─ DynamoDB for conversation history ├─ RDS for application data ├─ VPC with security group isolation └─ IAM roles (assume role from spoke to hub) Networking: ├─ Transit Gateway connects hubs and spokes ├─ Service control policies (spend limits per tenant) └─ VPC endpoints for private API access

Isolation Enforcement

IAM Policy (ABAC)

{
    "Effect": "Allow",
    "Action": "bedrock:InvokeAgent",
    "Resource": "arn:aws:bedrock:*:*:agent/agent-123",
    "Condition": {
        "StringEquals": {
            "bedrock:tenant": "tenant-a"
        }
    }
}

Application-Layer (Lambda)

def handle_request(event):
    tenant_id = event['requestContext']['authorizer']['claims']['tenant']
    # Validate tenant_id matches request
    if event.get('body', {}).get('tenant_id') != tenant_id:
        raise UnauthorizedException("Tenant mismatch")
    # Inject tenant context into all downstream calls
    result = bedrock_client.invoke_agent(
        agentId='agent-123',
        sessionStateValues={'tenant_id': tenant_id}
    )

Database-Level (RLS)

-- Row-level security policy
CREATE POLICY tenant_isolation ON conversations
    USING (tenant_id = current_setting('app.current_tenant'));

🖥 Distributed Systems Patterns

Consistency Models for AI Platforms

Component	Model	Why
Model Registry	Strong (ACID)	Must be correct for deployment
Conversation History	Causal	Message ordering matters
Embedding Index	Eventual	Stale embeddings are acceptable briefly
User Preferences	Eventually strong	Can be cached, eventually synced
Fine-tuning Dataset	Strong	Must be exact for reproducibility

Event-Driven Architecture

Event Stream (Kinesis / EventBridge) ├─ Event 1: UserQuerySubmitted ├─ Event 2: DocumentsRetrieved ├─ Event 3: ModelInvoked ├─ Event 4: GuardrailsApplied └─ Event 5: ResponseCached │ │ │ Monitor Cache Audit Log Benefits: Complete audit trail, event replay, multiple consumers

Saga Pattern for AI Workflows

Choreography (EventBridge)

Services emit events, others listen. Decoupled, scalable. Best for loosely-coupled microservices.

Orchestration (Step Functions)

Central coordinator manages workflow with explicit error handling and compensating transactions. Best for AI workflows.

☵ Kubernetes / EKS for ML

EKS Capabilities (2025-2026)

Ultra-Scale: Up to 100,000 nodes per cluster (10x increase), supporting 1.6M Trainium chips or 800K GPUs
Journal-backed etcd: Replaces Raft consensus for better performance at scale
Capabilities Management Suite: Argo CD (GitOps), AWS Controllers for K8s (ACK), KRO
Karpenter: Intelligent auto-provisioning with spot instances (70% discount)

ML Workload Patterns

Pattern 1: Distributed Training

# K8s Deployment for multi-GPU training
kind: Deployment
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: trainer
        resources:
          limits:
            nvidia.com/gpu: 8
        env:
        - name: MASTER_ADDR
          value: master-service.default

Pattern 2: Inference Autoscaling (HPA)

Scale inference pods based on CPU/memory/custom metrics. Set minReplicas: 2 for availability, maxReplicas: 50 for burst, target 70% CPU utilization.

Pattern 3: GPU Node Affinity + Spot

Use node affinity to schedule GPU workloads on nvidia-a100 nodes. Combine with Karpenter spot instances for 70% cost reduction. Handle spot interruptions with graceful migration.

🔌 API Design for AI Services

Three Patterns Compared

Pattern	Endpoint	Response	SLO	Use Case
Sync	POST /v1/chat/completions	200 + JSON body	P50 <200ms	Interactive chat
Async	POST /v1/jobs	202 + job_id	Minutes	Long-running RAG, batch
Streaming	POST /v1/chat/stream	SSE (text/event-stream)	TTFB <500ms	Token-by-token chat UX

Versioning Strategy

Recommended: URL path versioning (/v1/models, /v2/models) — clear, cacheable, and widely adopted. Use feature flags for gradual migration between versions.

📦 SDK Development

Python SDK Structure

my-ai-sdk/ ├─ my_ai_sdk/ │ ├─ client.py (main entry point) │ ├─ models.py (Pydantic request/response) │ ├─ exceptions.py (custom exceptions) │ ├─ resources/ │ │ ├─ chat.py (/chat endpoints) │ │ ├─ models.py (/models endpoints) │ │ └─ agents.py (/agents endpoints) │ └─ utils/ │ ├─ auth.py (auth handling) │ ├─ retry.py (retry logic) │ ├─ streaming.py (SSE handling) │ └─ observability.py (OpenTelemetry) ├─ tests/ └─ setup.py

Key SDK Patterns

Authentication

API key from constructor or AI_API_KEY env var. Bearer token in session headers. User-Agent with SDK version.

Error Handling

Hierarchy: AIError → AuthenticationError, RateLimitError, ValidationError. Include retry_after for rate limits.

Observability

OpenTelemetry tracing + metrics. Track request duration, token counts, model used. Spans for each API call.

Backward Compatibility

Deprecation warnings for renamed params. Ignore unknown params with warning. Semantic versioning.

🔒 IAM / RBAC / ABAC

Role Hierarchy for AI Platform

Platform Admin │ Create/destroy environments, manage IAM, view cost │ Team Lead / ML Architect │ Deploy models, configure endpoints, view team metrics │ Data Scientist │ Read training data, train models, log metrics (NO prod deploy) │ ML Engineer │ Invoke models in dev/staging, deploy with approval, monitor │ Auditor Read all logs, cannot modify, compliance reporting

ABAC vs RBAC

RBAC (Role-Based)

Fixed roles (scientist, engineer). Simple but rigid. Requires new roles for new scenarios. Good for small teams.

ABAC (Attribute-Based)

Evaluate multiple attributes in real-time (environment, team, IP, time). Dynamic, scales better. Ideal for multi-tenant AI platforms.

Cross-Account Access Pattern

User Account (AssumeRole) │ └─ AssumeRole with external_id │ AI Platform Account ├─ Role: PlatformRole │ └─ Trust: Allow User Account ├─ Bedrock └─ SageMaker

🛡 AI Guardrails Implementation

Three-Layer Safety Architecture

Bedrock Guardrails Performance

Blocks 88% of harmful content
Identifies correct responses with 99% accuracy
Configurable thresholds per category (violence, hate speech, sexual content)
PII patterns: credit card (BLOCK), SSN (REDACT), email (REDACT)

Custom Guardrail (Lambda)

class CustomGuardrails:
    def filter_input(self, user_input: str) -> Tuple[str, bool]:
        """Filter input for PII. Returns (filtered_text, is_safe)"""
        for pii_type, pattern in self.pii_patterns.items():
            if pii_type == 'credit_card':
                return None, False  # BLOCK entirely
            elif pii_type in ['ssn', 'email']:
                filtered = filtered.replace(match, f"[REDACTED_{pii_type}]")
        return filtered, is_safe

    def filter_output(self, output: str, context: Dict) -> Tuple[str, bool]:
        """Validate output: grounding check + toxicity scoring"""
        if not self._is_grounded_in_context(output, context):
            return output, False  # Block ungrounded responses
        if self._score_toxicity(output) > 0.9:
            return "[Content blocked - unsafe]", False
        return output, True

🔐 Data Security for LLM Workloads

Encryption Strategy

Layer	Method	Details
In Transit	TLS 1.2+, mTLS	All network communication encrypted. mTLS for service-to-service.
At Rest	KMS encryption	S3 buckets, RDS, EBS, DynamoDB all KMS-encrypted.
Model Weights	Encrypted artifacts	Decrypted only in secure enclave. Access logs for who deployed what.
Fine-tuning Data	Time-bound access	Separate encrypted bucket. Deleted after training completes.

Data Sensitivity Classification

PUBLIC → Freely accessible INTERNAL → Within company only CONFIDENTIAL → Limited team access RESTRICTED → Executive approval needed PII → Individual privacy (GDPR/CCPA) PHI → Health information (HIPAA) PCI → Payment information (PCI-DSS)

🏦 BFSI Compliance

Key Regulations

Regulation	Region	Key Requirements
GDPR	EU	Right to explanation, data portability, privacy by design
CCPA	California	Data subject rights, opt-out mechanisms
PCI-DSS	Global	Payment card data security standards
SOX	US	Financial reporting controls and audit trails
GLBA	US	Financial privacy protections
EU AI Act	EU	Risk-based AI governance framework

Model Governance Workflow

1. Development ├─ Build model, test for bias/fairness, document methodology │ 2. Compliance Review ├─ Explainability check, regulatory alignment ├─ Risk assessment, 3-lines-of-defense review │ 3. Approval Gate ├─ Business owner sign-off ├─ Compliance approval ├─ Risk committee review (if high risk) ├─ Legal sign-off │ 4. Production Monitoring ├─ Bias drift detection ├─ Performance monitoring ├─ Usage auditing └─ Annual recertification

Compliance Architecture

Control	Implementation
Model Risk Management	Risk classification (low/med/high), model inventory & versioning, approval workflows
Data Governance	Data lineage tracking, quality monitoring, retention policies
Explainability	Decision explanations (why approved/denied?), audit logs, monitoring dashboards
Fairness Testing	Demographic parity analysis (<5% variance), bias detection across protected groups
Adversarial Testing	Prompt injection tests, jailbreak attempts, robustness validation
Audit Logging	Immutable S3 logs (Glacier, 7-year retention), KMS encryption, every model decision logged

Interview Tip: In BFSI, always mention the "3 lines of defense" model: (1) Business units own risk, (2) Compliance/Risk teams provide oversight, (3) Internal audit provides independent assurance. Show you understand regulated industry governance.

💡 Behavioral & Technical Leadership

Ownership Mindset — Platform Builder, Not Project Executor

This role expects you to think like a product owner, not just an engineer fulfilling tickets. Demonstrate this by talking about:

Vision: How your platform decisions enable business outcomes (cost reduction, time-to-market, compliance)
Trade-offs: Conscious build-vs-buy decisions with clear reasoning (not just "we used AWS because it was easy")
Iteration: How you evolved the platform based on real usage data, not just upfront design
Team enablement: How you made it easy for other teams to onboard (self-service, documentation, SDKs)

Translating AI Systems to Business Value

Technical to Business

"RAG reduces hallucination" → "Customers get accurate answers, reducing support tickets by 30%"
"Multi-tenant isolation" → "Each business unit's data is secure, enabling regulatory compliance"
"Canary deployment" → "We catch issues before they impact all customers"

Metrics That Matter

Time to onboard a new AI use case
Cost per AI inference request
Model accuracy / hallucination rate
Platform uptime and P99 latency
Number of teams self-servicing on platform

Building & Scaling Engineering Teams

Be ready to discuss how you've structured and grown teams. Key points to cover:

Team Topology: Platform team (infra + shared services) + Feature teams (use-case specific) + ML/AI team (model development)
Hiring: What you look for in AI platform engineers (distributed systems + ML interest, not just ML PhDs)
Culture: Blameless post-mortems, documentation as a first-class citizen, inner-source contributions
Scaling: From 3-person team to 15+ — how you organized, delegated, and maintained quality

Stakeholder Communication

Audience	What They Care About	How to Communicate
C-Suite	ROI, competitive advantage, risk	Business metrics, cost savings, risk mitigation
Product Managers	Features, timelines, capabilities	Roadmap, what's possible vs what's not, trade-offs
Engineering Leads	Architecture, reliability, tech debt	Architecture diagrams, SLOs, migration plans
Compliance/Legal	Data privacy, audit trails, regulations	Compliance matrices, governance workflows, audit reports

💬 Common Interview Questions

Q1: "Design a production-grade RAG system for a financial services company"

Answer Framework: Start with requirements (accuracy >95%, compliance, multi-tenant). Architecture: Hierarchical chunking → hybrid search (OpenSearch BM25 + vector) → cross-encoder re-ranking → Bedrock with guardrails (PII redaction, hallucination detection) → immutable audit logging (S3 Glacier). Explain why Bedrock Knowledge Bases are insufficient for BFSI (no custom re-ranking, limited chunking). Mention RAGAS evaluation with >0.8 on faithfulness, relevance, and precision metrics.

Q2: "How would you handle multi-tenancy in an AI platform?"

Answer Framework: Hub-spoke architecture. Hub: shared Bedrock, SageMaker, OpenSearch. Spoke: per-tenant Lambda, DynamoDB, VPC. Three-layer isolation: IAM (ABAC with tenant tags), Application (tenant_id validation in every Lambda), Database (RLS policies). For BFSI, lean toward silo model for highest-sensitivity tenants, bridge model for others. Transit Gateway for networking, SCPs for cost control.

Q3: "Our LLM chatbot is hallucinating too much. What would you do?"

Answer Framework: Systematic debugging: (1) Measure current hallucination rate with RAGAS faithfulness metric. (2) Check RAG quality — is retrieval returning relevant docs? Measure Precision@K. (3) If retrieval is poor: add re-ranking, switch to hybrid search, improve chunking. (4) If retrieval is good but generation hallucinates: add output guardrails (grounding check), consider fine-tuning on domain data, reduce temperature. (5) Add monitoring: track hallucination rate per query type, set up alerts.

Q4: "Compare LangChain, LangGraph, and Bedrock Agents for production"

Answer Framework: LangChain: good for simple linear chains (chat, Q&A), wide ecosystem but no native state management. LangGraph: best for complex agents with loops, retries, human-in-the-loop — explicit state graph enables auditing and recovery. Bedrock Agents: managed, fast to set up, but limited to 15 iterations, less customizable. For production in regulated industries, LangGraph on ECS with custom state persistence in DynamoDB — gives full control and auditability.

Q5: "Design the monitoring and alerting for an AI platform"

Answer Framework: Four drift types to monitor: input drift (embedding distribution), output drift (token length, lexical diversity), data quality drift (embedding shifts), business metric drift (user satisfaction). Stack: CloudWatch for infrastructure, Evidently AI for drift detection, custom dashboards for token economics and cost per request. Automatic alerts: hallucination rate >5%, P95 latency >2x baseline, NDCG drop >20%. Include cost monitoring: tokens/query, cache hit rate, model cost breakdown.

Q6: "Walk me through building an AI platform in a regulated BFSI environment"

Answer Framework: Start with governance: 3-lines-of-defense model, model risk classification. Architecture: multi-account isolation (silo for sensitive, bridge for others). Every model decision logged immutably (S3 Glacier, 7-year retention). Fairness testing before deployment (demographic parity <5% variance). Explainability reports for credit decisions (GDPR right to explanation). Adversarial testing (prompt injection, jailbreak). Annual recertification of all models. Show the compliance workflow: Dev → Compliance Review → Approval Gate → Production Monitoring.

Q7: "How do you decide between building custom vs using managed AWS services?"

Answer Framework: Decision tree: (1) Is this a core differentiator? If yes, build. (2) Does the managed service meet accuracy/performance requirements? (3) Is customization needed beyond what the managed service offers? (4) What's the ops cost of self-hosting? Example: Bedrock Knowledge Bases for POC, custom RAG pipeline for production. SageMaker for training, custom endpoints on EKS for specialized inference. Always prototype with managed, migrate to custom when requirements demand it.

📜 Infrastructure as Code (Terraform / CloudFormation)

Terraform vs CloudFormation

Dimension	Terraform	CloudFormation
Language	HCL (HashiCorp Config Language)	JSON / YAML
Multi-Cloud	Yes (AWS, GCP, Azure, etc.)	AWS only
State Management	Remote state (S3 + DynamoDB lock)	Managed by AWS
Drift Detection	`terraform plan`	Stack drift detection
Modularity	Modules (reusable, versioned)	Nested stacks / CDK constructs
Best For	Multi-cloud, complex infra	AWS-only, tight integration

AI Platform IaC Patterns

Modular Structure

modules/ ├─ vpc/ (networking) ├─ eks/ (K8s cluster) ├─ bedrock/ (model access) ├─ opensearch/ (vector store) ├─ sagemaker/ (endpoints) └─ monitoring/ (CloudWatch)

Environment Promotion

environments/ ├─ dev/ (terraform.tfvars) ├─ staging/(terraform.tfvars) └─ prod/ (terraform.tfvars) Same modules, different configs. CI/CD: plan → review → apply

Prepared for Suman • AI Architect Interview Prep • March 2026

Covers: AI Hub Architecture • GenAI (LLMs, RAG, Agents) • AWS (Bedrock, SageMaker, EKS) • Platform Engineering • Governance & BFSI Compliance • Leadership