A production AI Hub on AWS follows a layered architecture with clear separation of concerns. Each layer can be scaled, monitored, and upgraded independently.
Request Flow (Happy Path)
User request arrives at API Gateway with auth token
Request routed to Agent Orchestration based on tenant/use case
Agent determines if RAG context is needed
If RAG: hybrid search (vector + keyword) in OpenSearch
Retrieved context sent to Foundation Model via Bedrock
Model output validated by Guardrails (PII, safety, toxicity)
Response logged, cached if applicable, returned to user
Key Design Principles
🔌 Loose Coupling
Each layer can be scaled independently. Swap Bedrock for SageMaker endpoints without touching the gateway or agent layer.
🔍 Observability
Latency, cost, and token usage tracked at every layer. CloudWatch + custom dashboards for per-tenant metrics.
🔒 Tenant Isolation
Multi-tenancy enforced at both IAM (ABAC) and application layers. Data never leaks between tenants.
⚡ Resilience
Circuit breakers for model calls, fallback to smaller/cached models, graceful degradation when services are down.
🛠 Platform Layers — Deep Dive
🔷 AI Gateway / Model Access Layer
The gateway is the single entry point for all AI requests. It handles authentication, rate limiting, request validation, and model routing.
AWS Implementation
Component
AWS Service
Purpose
Routing
API Gateway / ALB
Route requests by tenant, model, use-case
Auth
Cognito + API Keys
JWT validation, API key management
Rate Limiting
API Gateway throttling
Per-tenant, per-model limits
Model Abstraction
Lambda + Bedrock SDK
Unified interface across model providers
Caching
ElastiCache (Redis)
Cache frequent queries, reduce cost
Interview Tip: Emphasize that the gateway abstracts model providers — you can switch from Claude to Llama without application changes. This is a key "build vs buy" decision.
🔷 Agent Orchestration Runtime
The orchestration layer manages multi-step agent workflows with state management, tool selection, and human-in-the-loop capabilities.
Key Decisions
Approach
Best For
Trade-off
LangGraph on ECS
Complex stateful agents
More control, more ops overhead
Bedrock Agents
Simple tool-use agents
Managed, but limited customization
Step Functions + Lambda
Deterministic workflows
Great visibility, but not dynamic
Build vs Buy: Bedrock Agents are good for POCs. For production with complex retry logic, state persistence, and multi-agent coordination, build with LangGraph on ECS/EKS.
🔷 RAG & Knowledge Infrastructure
The knowledge layer handles document ingestion, chunking, embedding, vector storage, and retrieval with re-ranking.
Interview Tip: Always mention hybrid search (vector + keyword) with reciprocal rank fusion. This handles both semantic and exact-term matching, covering 90% of retrieval problems.
🔷 Governance, Security & Observability
A cross-cutting concern that spans all layers. Implements guardrails, audit logging, encryption, and monitoring.
12-100 heads learning different relationships (syntax, semantics)
FFN
Feature transformation
d_model → 4×d_model → d_model with non-linearity
Layer Norm + Residual
Training stability
Enables training 100+ layer models
Context Windows & Scaling
Model
Context
Year
GPT-2
~1K tokens
2019
GPT-3
4K tokens
2020
Claude 3
200K tokens
2024
Claude 3.5 Sonnet
200K tokens
2025
Quadratic Problem: Attention is O(n²) — doubling context requires 4× resources. Effective context ≠ max tokens (models degrade at >80% capacity). The "lost-in-the-middle" effect means information in the center of long contexts gets less attention.
Fine-Tuning vs Prompting Decision Tree
Start with Prompt Engineering (cheap, fast)
│
└─ Quality insufficient after 10 iterations?
├─ Few-Shot Prompting (add examples) → 40-50% improvement
└─ Still insufficient?
├─ LoRA / Adapter Tuning → 80-90% of full tuning, 10x faster
└─ Full Fine-Tuning (distillation, hallucination correction)
Break complex queries into sub-queries. Example: "Tax implications of 2008 crisis on small businesses" becomes: ["2008 financial crisis causes", "tax policy changes 2008", "small business impact"]. Retrieve for each, combine results.
2. Query Expansion
"Machine learning models" expands to: ["ML models", "neural networks", "deep learning", "AI algorithms"]. Improves recall when document terminology varies.
3. HyDE (Hypothetical Document Embeddings)
Generate a hypothetical relevant document for the query, then use its embedding for retrieval instead of the query embedding. Bridges the vocabulary gap between queries and documents.
Chunking Strategies
Strategy
Size
Best For
Trade-off
Fixed-Size
512 tokens, 20% overlap
Baseline, general use
Breaks mid-sentence
Semantic
Variable
Domain-heavy (law, finance)
More expensive (embedding every boundary)
Hierarchical
Multi-level
Production systems
Complex but most effective
Document-Aware
Preserves structure
Structured docs with headers
Requires document parsing
Production Best Practice: Start with hierarchical chunking + hybrid search + re-ranking. This covers 90% of RAG problems at moderate complexity. Re-ranking alone improves relevance 40-60% with <10% latency overhead.
Query distribution changes (e.g., billing bot gets technical questions). Detect via statistical tests on embedding distributions.
Output Drift
Model behavior changes (more verbose, different tone). Detect via token length histograms, lexical diversity metrics.
Data Quality Drift
RAG documents updated but indexes stale. Detect via embedding distribution shift. Action: reindex, retrain.
Business Metric Drift
User satisfaction or conversion drops. Detect via user ratings, feedback analysis. Action: prompt refinement.
Retraining Triggers
Automatic
Hallucination rate > 5%
Latency P95 > 2x baseline
Retrieval NDCG drops 20%
New data + 2 weeks passed
Manual
Major domain update (new regulations)
Systemic bias from user feedback
Better model released by provider
Compliance requirement changes
☁ Amazon Bedrock
Architecture Overview
Bedrock Knowledge Bases: Build vs Buy
Bedrock Knowledge Bases
Custom RAG Pipeline
Setup Time
Hours
Weeks
Chunking
Fixed only
Semantic, hierarchical, custom
Hybrid Search
Limited
Full control (RRF, weighted)
Re-ranking
Not customizable
Cross-encoder, LLM-based
Best For
POCs, simple Q&A
Production, accuracy >95%
Interview Tip: Frame build vs buy as a spectrum. Start with Bedrock KB for POC, then migrate to custom RAG when accuracy requirements increase. This shows pragmatic architecture thinking.
🔬 Amazon SageMaker — Deep Dive
SageMaker is the core ML platform on AWS — it spans the entire ML lifecycle from data labeling through training, hosting, and monitoring. For an AI Hub, SageMaker handles custom model training, fine-tuning, and serving models that Bedrock doesn't offer.
SageMaker vs Bedrock — When to Use Which
Scenario
Use Bedrock
Use SageMaker
Foundation model inference
✓ (managed, multi-provider)
Only for models not on Bedrock
Custom model training
✗
✓ (Training Jobs, distributed)
Fine-tuning LLMs
✓ (limited models, Bedrock fine-tuning)
✓ (full control, any model, LoRA/QLoRA)
Classical ML
✗
✓ (XGBoost, sklearn, etc.)
Embedding models
✓ (Titan Embeddings, Cohere)
✓ (custom embedding models)
RAG
✓ (Knowledge Bases)
Pair with OpenSearch for custom RAG
AI agents
✓ (Bedrock Agents)
✗ (use LangGraph on ECS instead)
GPU/hardware control
✗ (abstracted)
✓ (choose instance type, GPU count)
Cost optimization
Pay per token
Pay per instance-hour (spot = 70% off)
Interview Tip: A mature AI Hub uses both: Bedrock for quick access to foundation models, SageMaker for custom models, fine-tuning, and workloads where you need hardware control or cost optimization at scale.
Inference Endpoints — Deep Dive
Option
Latency
Throughput
Cost Model
Use Case
Real-Time Endpoint
<1s
High
Pay per instance-hour
Interactive APIs, chatbots
Async Inference
Minutes
Very high
Pay per request + instance
Large payload, batch, RAG pipelines
Batch Transform
Hours
Unlimited
Cheapest per inference
Offline scoring, nightly reprocessing
Serverless Inference
1-2s (cold start)
Medium
Pay per invocation + duration
Spiky traffic, dev/test environments
Multi-Model Endpoint
<1s (cached)
High
Pay per instance (shared)
Serve 100s of models on 1 endpoint
Multi-Container Endpoint
<1s
High
Pay per instance
Inference pipeline (preprocess + model + postprocess)
Real-Time Endpoint Architecture
API Request
│
Application Load Balancer
│
SageMaker Endpoint
├─ Production Variant A (80% traffic) ─ ml.g5.xlarge
├─ Production Variant B (20% traffic) ─ ml.g5.2xlarge [canary]
└─ Shadow Variant (0% live, copies traffic) ─ [A/B testing]
│
Model Container (ECR image with inference code)
├─ model_fn() ─ Load model weights
├─ input_fn() ─ Deserialize request
├─ predict_fn() ─ Run inference
└─ output_fn() ─ Serialize response
Production Variant Configuration
import sagemaker
from sagemaker.model import Model
model = Model(
image_uri="123456789.dkr.ecr.us-east-1.amazonaws.com/my-model:latest",
model_data="s3://bucket/model.tar.gz",
role="arn:aws:iam::123456789:role/SageMakerRole"
)
# Deploy with two variants for A/B testing
predictor = model.deploy(
initial_instance_count=2,
instance_type="ml.g5.xlarge",
endpoint_name="my-model-endpoint",
data_capture_config=DataCaptureConfig(
enable_capture=True,
sampling_percentage=100, # Capture all for monitoring
destination_s3_uri="s3://bucket/capture/"
)
)
Async Inference — For Long-Running AI Workloads
Client → POST request with S3 input location
│
SageMaker Async Endpoint
├─ Queues request internally
├─ Returns InferenceId immediately
├─ Processes when capacity available
└─ Writes output to S3
│
SNS Notification → Success/Failure callback
│
Auto-scales to 0 when idle (cost savings!)
Key benefit: Async endpoints can scale to 0 instances — no traffic means no cost. Perfect for intermittent AI workloads like document processing, batch RAG, or nightly model retraining inference.
Multi-Model Endpoints (MME) — Serve Hundreds of Models
Multi-Model Endpoint (single endpoint, single instance fleet)
│
├─ Model A (loaded) ─ Tenant 1 fine-tuned model
├─ Model B (loaded) ─ Tenant 2 fine-tuned model
├─ Model C (on S3) ─ Loaded on-demand when called
├─ Model D (on S3) ─ Loaded on-demand when called
└─ ... up to 1000s of models
Dynamic loading: frequently-used models stay in memory,
cold models loaded from S3 on first request (~seconds)
MME is ideal for multi-tenant AI platforms where each tenant has a fine-tuned model. Instead of one endpoint per tenant ($$$), you serve all tenants from a shared fleet and SageMaker handles model loading/unloading based on traffic patterns.
# Invoke a specific model on a multi-model endpoint
response = sm_runtime.invoke_endpoint(
EndpointName="multi-model-endpoint",
TargetModel="tenant-a/model.tar.gz", # S3 key for this tenant's model
Body=json.dumps(payload),
ContentType="application/json"
)
Inference Recommender — Right-Size Your Endpoints
SageMaker Inference Recommender benchmarks your model across different instance types and configurations to find the optimal cost/performance trade-off.
# Run inference recommender to find optimal instance
response = sm_client.create_inference_recommendations_job(
JobName="my-model-benchmark",
JobType="Default", # or "Advanced" for custom traffic patterns
RoleArn=role_arn,
InputConfig={
"ModelPackageVersionArn": model_package_arn,
"JobDurationInSeconds": 3600
}
)
# Returns: ranked list of instance types with latency, throughput, cost# Example: ml.g5.xlarge — P95: 120ms, 40 req/s, $1.41/hr# ml.g5.2xlarge — P95: 85ms, 65 req/s, $2.36/hr
Auto-Scaling Endpoints
Scaling Policies for AI Workloads
import boto3
client = boto3.client("application-autoscaling")
# Register the endpoint variant as a scalable target
client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId="endpoint/my-endpoint/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=2, # Always at least 2 for availability
MaxCapacity=20, # Burst up to 20 instances
)
# Target tracking — scale based on invocations per instance
client.put_scaling_policy(
PolicyName="InvocationsPerInstance",
ServiceNamespace="sagemaker",
ResourceId="endpoint/my-endpoint/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 70.0, # Target 70 invocations per instance"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleInCooldown": 300, # Wait 5min before scaling down"ScaleOutCooldown": 60, # Scale up quickly (1min)
}
)
Scaling Metrics Comparison
Metric
Best For
Trade-off
InvocationsPerInstance
General workloads
Simple, but doesn't account for request complexity
CPUUtilization
CPU-bound models
Doesn't correlate with GPU utilization
GPUUtilization
GPU-heavy inference
Custom metric via CloudWatch, more accurate for LLMs
ModelLatency
Latency-sensitive APIs
Scale when latency degrades above target
Custom (queue depth)
Async inference
Scale based on pending requests in SQS
SageMaker Training Jobs
Training Architecture
Training Job
│
├─ Input: S3 (training data) + ECR (container image)
├─ Compute: ml.p4d.24xlarge (8x A100 GPUs)
│ ├─ On-Demand: full price, guaranteed capacity
│ └─ Managed Spot: 70% off, can be interrupted
│ └─ Checkpointing to S3 every N steps
├─ Distributed: data parallel / model parallel
│ ├─ Data Parallel: same model, split data across GPUs
│ └─ Model Parallel: split model layers across GPUs (for LLMs)
└─ Output: model.tar.gz → S3 → Model Registry
Training Job Code Example
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1-gpu",
role=role,
instance_count=4, # 4 instances for distributed training
instance_type="ml.p4d.24xlarge", # 8x A100 GPUs each = 32 GPUs total
use_spot_instances=True, # 70% cost savings
max_wait=86400, # Max 24hr including spot interruptions
max_run=72000, # Max 20hr actual training
checkpoint_s3_uri="s3://bucket/checkpoints/",
output_path="s3://bucket/output/",
hyperparameters={
"epochs": "10",
"learning_rate": "5e-5",
"batch_size": "32",
"model_name": "bert-base-uncased"
},
distribution={
"torch_distributed": {"enabled": True} # PyTorch DDP
},
metric_definitions=[
{"Name": "train:loss", "Regex": "train_loss: ([0-9\\.]+)"},
{"Name": "eval:accuracy", "Regex": "eval_acc: ([0-9\\.]+)"}
]
)
estimator.fit({
"training": "s3://bucket/train/",
"validation": "s3://bucket/val/"
})
If spot is interrupted, training pauses and resumes from checkpoint
Set max_wait as budget for total time including interruptions
Set max_run as budget for actual training time
Checkpointing Strategy
Save checkpoints to S3 every N steps (not just epochs)
For LLM fine-tuning: checkpoint every 500-1000 steps
Resume from latest checkpoint after spot interruption
Cost savings: 60-90% vs on-demand for training jobs
Hyperparameter Tuning (HPO)
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter
tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name="eval:accuracy",
objective_type="Maximize",
hyperparameter_ranges={
"learning_rate": ContinuousParameter(1e-5, 5e-4, scaling_type="Logarithmic"),
"batch_size": IntegerParameter(16, 128),
"warmup_steps": IntegerParameter(100, 1000),
},
max_jobs=20, # Run 20 combinations
max_parallel_jobs=4, # 4 training jobs in parallel
strategy="Bayesian", # Bayesian optimization (smarter than grid/random)
)
tuner.fit({"training": "s3://bucket/train/"})
Tuning Strategies:Bayesian (best for expensive jobs, learns from previous runs) > Random (good baseline, parallelizable) > Grid (exhaustive, expensive). For LLM fine-tuning, Bayesian with 10-20 jobs is usually sufficient.
SageMaker Pipelines — MLOps Orchestration
SageMaker Pipeline (DAG-based, repeatable, versioned)
│
├─ ProcessingStep
│ └─ Data cleaning, feature engineering, train/test split
│
├─ TrainingStep
│ └─ Model training (GPU, distributed, spot)
│
├─ TuningStep (optional)
│ └─ Hyperparameter optimization across N runs
│
├─ EvaluationStep
│ └─ Compute metrics (accuracy, F1, RAGAS, custom)
│
├─ ConditionStep
│ ├─ accuracy > 0.85? → RegisterModel → Deploy
│ └─ else → FailStep (notify team)
│
├─ RegisterModelStep
│ └─ Model Registry (version, approval status, lineage)
│
└─ CreateEndpointStep (or Lambda for custom deployment)
└─ Deploy to endpoint with data capture enabled
Model Package Group: "fraud-detection-models"
│
├─ Version 1 (v1.0) ─ Status: Approved ─ In Production
│ ├─ Metrics: accuracy=0.92, F1=0.89
│ ├─ Artifacts: s3://models/fraud-v1/model.tar.gz
│ └─ Lineage: training job ARN, dataset version, pipeline run
│
├─ Version 2 (v2.0) ─ Status: PendingManualApproval
│ ├─ Metrics: accuracy=0.94, F1=0.91
│ └─ Awaiting: compliance review + risk committee
│
└─ Version 3 (v3.0) ─ Status: Rejected
└─ Reason: bias detected in demographic parity test
# Approve a model version (typically via CI/CD or manual review)
sm_client.update_model_package(
ModelPackageArn="arn:aws:sagemaker:us-east-1:123:model-package/fraud-v2",
ModelApprovalStatus="Approved",
ApprovalDescription="Passed compliance review, bias testing, and risk committee"
)
# EventBridge rule triggers deployment on approval# Rule: source=sagemaker, detail-type=ModelPackageStateChange, status=Approved# Target: Lambda function that deploys to endpoint
Accuracy Difference (AD): Does accuracy vary by group?
Run after training and continuously in production
Clarify Explainability — SHAP Values
SageMaker Clarify uses SHAP (SHapley Additive exPlanations) to explain individual predictions. This is critical for BFSI compliance (GDPR right to explanation).
from sagemaker import clarify
shap_config = clarify.SHAPConfig(
baseline=[[0, 0, 0, 0]], # Reference point for SHAP
num_samples=500, # Number of perturbations
agg_method="mean_abs"# Aggregation for feature importance
)
clarify_processor = clarify.SageMakerClarifyProcessor(
role=role, instance_count=1, instance_type="ml.m5.xlarge"
)
clarify_processor.run_explainability(
data_config=data_config,
model_config=model_config,
explainability_config=shap_config
)
# Output: per-feature importance scores for each prediction# "This loan was denied primarily due to: credit_score (45%),# debt_to_income (30%), employment_length (15%)"
SageMaker JumpStart — Pretrained Models
JumpStart provides 600+ pretrained models that can be deployed or fine-tuned with one click. Key models for an AI Hub:
Model
Type
Use in AI Hub
Llama 3.1 (70B/8B)
LLM
Self-hosted alternative to Bedrock (full control, no token fees)
Falcon (180B/40B)
LLM
Open-source LLM for low-cost inference
BGE / GTE
Embedding
Custom embedding models for domain-specific RAG
Stable Diffusion
Image Gen
Image generation for creative use cases
Whisper
Speech-to-Text
Audio transcription for call center AI
Cross-Encoder
Re-ranking
Re-rank RAG results for higher relevance
Cost Strategy: For high-volume inference, self-hosting open-source LLMs via JumpStart on GPU instances can be 5-10x cheaper than Bedrock per-token pricing. Trade-off: you manage the infrastructure.
JumpStart Deploy Code
from sagemaker.jumpstart.model import JumpStartModel
# Deploy Llama 3.1 8B in one line
model = JumpStartModel(
model_id="meta-textgeneration-llama-3-1-8b-instruct",
instance_type="ml.g5.2xlarge",
role=role
)
predictor = model.deploy(
initial_instance_count=1,
endpoint_name="llama-3-1-endpoint"
)
# Fine-tune with your data
model.fit({
"training": "s3://bucket/fine-tune-data/"
})
SageMaker for LLM Fine-Tuning
Fine-Tuning Options Comparison
Method
Data Needed
Training Time
Cost
Quality
Full Fine-Tuning
1-5K examples
Hours-Days
$$$
Best
LoRA
500-2K examples
Minutes-Hours
$$
90% of full
QLoRA
500-2K examples
Minutes-Hours
$
85% of full
Bedrock Fine-Tuning
1K+ examples
Hours
$$
Good (limited models)
Decision: Use Bedrock fine-tuning for supported models (quick, managed). Use SageMaker for open-source models (Llama, Falcon), LoRA/QLoRA, or when you need full control over training hyperparameters and data pipeline.
Multi-model endpoints: Share instances across models
Async endpoints: Scale to 0 when idle
Serverless inference: Pay only when invoked
Model compilation (Neo): 25% faster inference
Model quantization (INT8): Smaller model, less GPU needed
Inf2 instances: AWS Inferentia chips, 40% cheaper than GPUs
Instance Selection Cheat Sheet:ml.g5.* — best price/performance for most LLM inference.
ml.p4d.* — A100 GPUs for large model training.
ml.inf2.* — AWS Inferentia for cost-optimized inference (40% cheaper).
ml.trn1.* — AWS Trainium for cost-optimized training (50% cheaper than p4d).
SageMaker Interview Questions
Q: "When would you host a model on SageMaker instead of using Bedrock?"
Answer: When you need (1) custom models not available on Bedrock (e.g., domain fine-tuned Llama with LoRA), (2) hardware control (specific GPU type, instance size), (3) cost optimization at high volume (self-hosting is 5-10x cheaper than per-token), (4) models with custom pre/post-processing, or (5) compliance requirements mandating dedicated infrastructure. In practice, most AI Hubs use both: Bedrock for quick multi-model access, SageMaker for custom workloads.
Q: "How do you handle model deployment without downtime?"
Answer: Use production variants for blue/green and canary deployments. Deploy new model as variant B with 5-10% traffic. Monitor latency, accuracy, and error rate via Data Capture + Model Monitor. If metrics are healthy after 1-2 hours, gradually shift traffic (20% → 50% → 100%). If metrics degrade, automatic rollback to variant A. For zero-downtime: use UpdateEndpoint with retain_all_variant_properties to swap models without endpoint recreation.
Q: "How would you design a cost-effective inference architecture for 1M requests/day?"
Answer: At 1M req/day (~12 req/sec avg, higher peak): (1) Use auto-scaling with target tracking on InvocationsPerInstance. (2) Use ml.g5.xlarge for GPU inference (best price/performance). (3) Enable model compilation with Neo for 25% speedup. (4) Consider Inf2 instances for 40% cost reduction. (5) Cache frequent queries in ElastiCache. (6) Use async endpoints for non-interactive workloads. (7) Savings Plans for baseline capacity, auto-scaling for peaks. Estimated cost: $3K-8K/month depending on model size.
# Glue ETL Job for RAG document processing
documents = glue_context.create_dynamic_frame.from_options(
format_options={"paths": ["s3://data/documents/"]},
format="pdf"
)
defchunk_document(record):
text = record["content"]
chunks = semantic_chunk(text, chunk_size=512)
return [
{"doc_id": record["id"], "chunk_id": i,
"text": chunk, "embedding": embed(chunk)}
for i, chunk in enumerate(chunks)
]
embeddings_frame = documents.map(lambda x: chunk_document(x))
embeddings_frame.write_dynamic_frame.to_s3(
connection_options={"path": "s3://embeddings-bucket/"},
format="parquet"
)
Lake Formation Governance for AI
Data Catalog:
├─ Dataset: customer_data
│ ├─ PII columns: [email, ssn, phone] → REDACT tag
│ ├─ Sensitivity: HIGH → Limited access
│ └─ Lake Formation policy: Only data scientists
│
└─ Dataset: training_data
├─ Status: APPROVED_FOR_TRAINING
├─ Model lineage: tracked
└─ Audit: enabled
🔗 API Gateway Patterns for AI
Three API Patterns
Synchronous
POST /v1/chat/completions
SLO: P50 <200ms, P99 <1s. For interactive chat.
Asynchronous
POST /v1/jobs → 202 Accepted
Webhook callback or polling. For batch/long-running.
Streaming (SSE)
POST /v1/chat/stream
Token-by-token via Server-Sent Events. Best UX for chat.
Streaming Implementation
# FastAPI example (Lambda + function URL)
@app.post("/chat/stream")
async defstream_chat(request: ChatRequest):
async defgenerate():
response = bedrock_client.invoke_model_with_response_stream(
modelId="claude-3-5-sonnet",
body={"messages": request.messages}
)
for event in response['body']:
if'chunk'in event:
chunk = json.loads(event['chunk']['bytes'])
yield f"data: {json.dumps(chunk)}\n\n"return StreamingResponse(generate(), media_type="text/event-stream")
🏢 Multi-Tenant, Multi-Account Architecture
Three Isolation Models
Model
Pattern
Isolation
Cost
Best For
Silo
Account-per-tenant
Maximum
Highest
Regulated BFSI, healthcare
Bridge
Schema-per-tenant
Good
Moderate
Most enterprise AI platforms
Pool
Row-level isolation
Basic
Lowest
SaaS with low sensitivity
Recommended: Hub-Spoke Architecture
Hub (Shared Services):
├─ Bedrock access layer (guardrails, models)
├─ SageMaker endpoints (shared fine-tuned models)
├─ S3 data lake (with ABAC tagging)
└─ OpenSearch (multi-tenant awareness)
Spoke (Per-Tenant):
├─ Lambda for tenant-specific logic
├─ DynamoDB for conversation history
├─ RDS for application data
├─ VPC with security group isolation
└─ IAM roles (assume role from spoke to hub)
Networking:
├─ Transit Gateway connects hubs and spokes
├─ Service control policies (spend limits per tenant)
└─ VPC endpoints for private API access
Scale inference pods based on CPU/memory/custom metrics. Set minReplicas: 2 for availability, maxReplicas: 50 for burst, target 70% CPU utilization.
Pattern 3: GPU Node Affinity + Spot
Use node affinity to schedule GPU workloads on nvidia-a100 nodes. Combine with Karpenter spot instances for 70% cost reduction. Handle spot interruptions with graceful migration.
🔌 API Design for AI Services
Three Patterns Compared
Pattern
Endpoint
Response
SLO
Use Case
Sync
POST /v1/chat/completions
200 + JSON body
P50 <200ms
Interactive chat
Async
POST /v1/jobs
202 + job_id
Minutes
Long-running RAG, batch
Streaming
POST /v1/chat/stream
SSE (text/event-stream)
TTFB <500ms
Token-by-token chat UX
Versioning Strategy
Recommended: URL path versioning (/v1/models, /v2/models) — clear, cacheable, and widely adopted. Use feature flags for gradual migration between versions.
All network communication encrypted. mTLS for service-to-service.
At Rest
KMS encryption
S3 buckets, RDS, EBS, DynamoDB all KMS-encrypted.
Model Weights
Encrypted artifacts
Decrypted only in secure enclave. Access logs for who deployed what.
Fine-tuning Data
Time-bound access
Separate encrypted bucket. Deleted after training completes.
Data Sensitivity Classification
PUBLIC → Freely accessible
INTERNAL → Within company only
CONFIDENTIAL → Limited team access
RESTRICTED → Executive approval needed
PII → Individual privacy (GDPR/CCPA)
PHI → Health information (HIPAA)
PCI → Payment information (PCI-DSS)
🏦 BFSI Compliance
Key Regulations
Regulation
Region
Key Requirements
GDPR
EU
Right to explanation, data portability, privacy by design
Immutable S3 logs (Glacier, 7-year retention), KMS encryption, every model decision logged
Interview Tip: In BFSI, always mention the "3 lines of defense" model: (1) Business units own risk, (2) Compliance/Risk teams provide oversight, (3) Internal audit provides independent assurance. Show you understand regulated industry governance.
💡 Behavioral & Technical Leadership
Ownership Mindset — Platform Builder, Not Project Executor
This role expects you to think like a product owner, not just an engineer fulfilling tickets. Demonstrate this by talking about:
Vision: How your platform decisions enable business outcomes (cost reduction, time-to-market, compliance)
Trade-offs: Conscious build-vs-buy decisions with clear reasoning (not just "we used AWS because it was easy")
Iteration: How you evolved the platform based on real usage data, not just upfront design
Team enablement: How you made it easy for other teams to onboard (self-service, documentation, SDKs)
Translating AI Systems to Business Value
Technical to Business
"RAG reduces hallucination" → "Customers get accurate answers, reducing support tickets by 30%"
"Multi-tenant isolation" → "Each business unit's data is secure, enabling regulatory compliance"
"Canary deployment" → "We catch issues before they impact all customers"
Metrics That Matter
Time to onboard a new AI use case
Cost per AI inference request
Model accuracy / hallucination rate
Platform uptime and P99 latency
Number of teams self-servicing on platform
Building & Scaling Engineering Teams
Be ready to discuss how you've structured and grown teams. Key points to cover:
Team Topology: Platform team (infra + shared services) + Feature teams (use-case specific) + ML/AI team (model development)
Hiring: What you look for in AI platform engineers (distributed systems + ML interest, not just ML PhDs)
Culture: Blameless post-mortems, documentation as a first-class citizen, inner-source contributions
Scaling: From 3-person team to 15+ — how you organized, delegated, and maintained quality
Stakeholder Communication
Audience
What They Care About
How to Communicate
C-Suite
ROI, competitive advantage, risk
Business metrics, cost savings, risk mitigation
Product Managers
Features, timelines, capabilities
Roadmap, what's possible vs what's not, trade-offs
Q1: "Design a production-grade RAG system for a financial services company"
Answer Framework: Start with requirements (accuracy >95%, compliance, multi-tenant). Architecture: Hierarchical chunking → hybrid search (OpenSearch BM25 + vector) → cross-encoder re-ranking → Bedrock with guardrails (PII redaction, hallucination detection) → immutable audit logging (S3 Glacier). Explain why Bedrock Knowledge Bases are insufficient for BFSI (no custom re-ranking, limited chunking). Mention RAGAS evaluation with >0.8 on faithfulness, relevance, and precision metrics.
Q2: "How would you handle multi-tenancy in an AI platform?"
Answer Framework: Hub-spoke architecture. Hub: shared Bedrock, SageMaker, OpenSearch. Spoke: per-tenant Lambda, DynamoDB, VPC. Three-layer isolation: IAM (ABAC with tenant tags), Application (tenant_id validation in every Lambda), Database (RLS policies). For BFSI, lean toward silo model for highest-sensitivity tenants, bridge model for others. Transit Gateway for networking, SCPs for cost control.
Q3: "Our LLM chatbot is hallucinating too much. What would you do?"
Answer Framework: Systematic debugging: (1) Measure current hallucination rate with RAGAS faithfulness metric. (2) Check RAG quality — is retrieval returning relevant docs? Measure Precision@K. (3) If retrieval is poor: add re-ranking, switch to hybrid search, improve chunking. (4) If retrieval is good but generation hallucinates: add output guardrails (grounding check), consider fine-tuning on domain data, reduce temperature. (5) Add monitoring: track hallucination rate per query type, set up alerts.
Q4: "Compare LangChain, LangGraph, and Bedrock Agents for production"
Answer Framework: LangChain: good for simple linear chains (chat, Q&A), wide ecosystem but no native state management. LangGraph: best for complex agents with loops, retries, human-in-the-loop — explicit state graph enables auditing and recovery. Bedrock Agents: managed, fast to set up, but limited to 15 iterations, less customizable. For production in regulated industries, LangGraph on ECS with custom state persistence in DynamoDB — gives full control and auditability.
Q5: "Design the monitoring and alerting for an AI platform"
Answer Framework: Four drift types to monitor: input drift (embedding distribution), output drift (token length, lexical diversity), data quality drift (embedding shifts), business metric drift (user satisfaction). Stack: CloudWatch for infrastructure, Evidently AI for drift detection, custom dashboards for token economics and cost per request. Automatic alerts: hallucination rate >5%, P95 latency >2x baseline, NDCG drop >20%. Include cost monitoring: tokens/query, cache hit rate, model cost breakdown.
Q6: "Walk me through building an AI platform in a regulated BFSI environment"
Answer Framework: Start with governance: 3-lines-of-defense model, model risk classification. Architecture: multi-account isolation (silo for sensitive, bridge for others). Every model decision logged immutably (S3 Glacier, 7-year retention). Fairness testing before deployment (demographic parity <5% variance). Explainability reports for credit decisions (GDPR right to explanation). Adversarial testing (prompt injection, jailbreak). Annual recertification of all models. Show the compliance workflow: Dev → Compliance Review → Approval Gate → Production Monitoring.
Q7: "How do you decide between building custom vs using managed AWS services?"
Answer Framework: Decision tree: (1) Is this a core differentiator? If yes, build. (2) Does the managed service meet accuracy/performance requirements? (3) Is customization needed beyond what the managed service offers? (4) What's the ops cost of self-hosting? Example: Bedrock Knowledge Bases for POC, custom RAG pipeline for production. SageMaker for training, custom endpoints on EKS for specialized inference. Always prototype with managed, migrate to custom when requirements demand it.
📜 Infrastructure as Code (Terraform / CloudFormation)