devops-interview-handbook

AWS Interview Questions

EC2 & Compute
S3 & Storage
VPC & Networking
IAM & Security
Lambda & Serverless
RDS & Databases
CloudWatch & Monitoring

EC2 & Compute

Q1: What is the difference between EC2 instance types and when would you choose one over another?

Difficulty: Mid

Answer:

EC2 instance types are optimized for different use cases:

General Purpose (t3, m5): Balanced compute, memory, and networking. Good for web servers, small databases, development environments.
Compute Optimized (c5, c6i): High-performance processors. Ideal for compute-intensive workloads like batch processing, gaming servers, HPC.
Memory Optimized (r5, x1e): High memory-to-vCPU ratio. Perfect for in-memory databases (Redis), real-time big data analytics, high-performance databases.
Storage Optimized (i3, d2): High sequential read/write performance. Best for NoSQL databases, data warehousing, log processing.
Accelerated Computing (p3, g4): GPUs and FPGAs. Used for machine learning, graphics rendering, video encoding.

Real-world Context: Choose based on workload characteristics. A web application might use t3.medium, while a Redis cache cluster would use r5.large.

Follow-up: How do you determine the right instance size? (Use CloudWatch metrics, load testing, start small and scale)

Q2: Explain the difference between EBS volume types and their use cases.

Difficulty: Mid

Answer:

EBS volume types differ in performance characteristics:

gp3 (General Purpose SSD): Baseline 3,000 IOPS, up to 16,000 IOPS. Default choice for most workloads. Cost-effective with consistent performance.
gp2 (General Purpose SSD): Performance scales with size (3 IOPS/GB). Good for boot volumes and small databases.
io1/io2 (Provisioned IOPS SSD): Up to 64,000 IOPS. For I/O-intensive databases (Oracle, SQL Server) requiring consistent low latency.
st1 (Throughput Optimized HDD): 500 MB/s throughput. For big data, data warehouses, log processing. Cannot be boot volume.
sc1 (Cold HDD): Lowest cost. For infrequently accessed data, archives. Cannot be boot volume.

Real-world Context: Production database → io2. Web server boot volume → gp3. Log aggregation → st1.

Follow-up: How do you migrate from gp2 to gp3? (Create snapshot, restore to gp3, or modify volume type)

Q3: What is an Auto Scaling Group and how does it work?

Difficulty: Mid

Answer:

An Auto Scaling Group (ASG) automatically adjusts the number of EC2 instances based on demand or schedules.

Components:

Launch Template/Configuration: Defines what instances to launch (AMI, instance type, security groups)
Scaling Policies: Rules for scaling (target tracking, step scaling, simple scaling)
Health Checks: EC2 or ELB health checks to replace unhealthy instances
Cooldown Period: Prevents rapid scaling actions

How it works:

CloudWatch alarms trigger scaling policies
ASG launches/terminates instances based on metrics (CPU, memory, custom)
New instances are registered with load balancer
Unhealthy instances are replaced automatically

Real-world Context: Web application behind ALB. Scale from 2 to 10 instances during peak hours based on CPU utilization.

Follow-up: What’s the difference between target tracking and step scaling? (Target tracking maintains a metric at target value, step scaling uses multiple thresholds)

Q4: Explain VPC, Subnets, Internet Gateway, and NAT Gateway.

Difficulty: Mid

Answer:

VPC (Virtual Private Cloud): Isolated network environment in AWS. You control IP ranges, subnets, routing, security.

Subnets: Logical subdivision of VPC IP range. Can be public (route to IGW) or private (route to NAT).

Internet Gateway (IGW): Allows communication between VPC and internet. One per VPC. Provides 1:1 NAT for public IPs.

NAT Gateway: Allows private subnets to access internet for updates/downloads while remaining inaccessible from internet. Managed service, highly available in one AZ (or use NAT Instance for multi-AZ).

Architecture:

Public Subnet: Route table → 0.0.0.0/0 → IGW
Private Subnet: Route table → 0.0.0.0/0 → NAT Gateway

Real-world Context: Web servers in public subnets, databases in private subnets. NAT Gateway allows DB instances to download patches.

Follow-up: What’s the difference between NAT Gateway and NAT Instance? (NAT Gateway is managed, scales automatically, more expensive. NAT Instance is EC2-based, requires management, cheaper for low traffic)

Q5: What is the difference between Security Groups and NACLs?

Difficulty: Mid

Answer:

Security Groups:

Stateful firewall (return traffic automatically allowed)
Applied at instance level
Rules: allow only (default deny all)
Evaluates all rules before deciding
Can reference other security groups

NACLs (Network ACLs):

Stateless firewall (must allow both inbound and outbound)
Applied at subnet level
Rules: allow and deny (evaluated in order, first match wins)
Can block specific IPs
Default: allows all traffic

Use Cases:

Security Groups: Primary defense, instance-level protection
NACLs: Subnet-level protection, compliance requirements, blocking specific IPs

Real-world Context: Security Groups protect instances. NACLs add extra layer - block known malicious IPs at subnet level.

Follow-up: If you block port 80 in NACL but allow it in Security Group, what happens? (Traffic is blocked - NACL is evaluated first)

S3 & Storage

Q6: Explain S3 storage classes and when to use each.

Difficulty: Mid

Answer:

S3 Standard: 99.99% availability, 99.999999999% durability. For frequently accessed data. Low latency, high throughput.

S3 Intelligent-Tiering: Automatically moves objects between access tiers. For unpredictable access patterns. Small monitoring fee.

S3 Standard-IA (Infrequent Access): Lower cost for infrequently accessed data. 99.9% availability. Minimum 30-day storage, retrieval fee.

S3 One Zone-IA: Like Standard-IA but stored in single AZ. 99.5% availability. 20% cheaper. For non-critical, reproducible data.

S3 Glacier Instant Retrieval: Archive with millisecond retrieval. For rarely accessed data requiring immediate access. Minimum 90-day storage.

S3 Glacier Flexible Retrieval: 3 retrieval options (expedited 1-5 min, standard 3-5 hours, bulk 5-12 hours). For archives, backups.

S3 Glacier Deep Archive: Lowest cost. 12-hour retrieval. For long-term compliance archives.

Real-world Context: Active application data → Standard. Logs older than 30 days → Standard-IA. Compliance archives → Glacier Deep Archive.

Follow-up: How does lifecycle policy work? (Automatically transitions objects between classes based on age/prefix)

Q7: What is S3 versioning and why is it important?

Difficulty: Junior

Answer:

S3 versioning keeps multiple versions of an object with the same key. Each version has a unique version ID.

Benefits:

Protect against accidental deletion (can restore previous version)
Recover from overwrites
Maintain object history
Compliance requirements

How it works:

When versioning enabled, PUT creates new version (or new version ID)
DELETE creates delete marker (object appears deleted but versions remain)
Can restore by deleting delete marker or copying previous version

Real-world Context: Developer accidentally overwrites production config file. With versioning, restore previous version immediately.

Follow-up: How do you enable versioning on existing bucket? (Enable versioning, existing objects become version with null version ID)

Q8: Explain S3 cross-region replication and its use cases.

Difficulty: Mid

Answer:

S3 Cross-Region Replication (CRR) automatically replicates objects to another region.

Requirements:

Versioning enabled on source and destination buckets
IAM permissions for replication
Source and destination buckets in different regions

Use Cases:

Disaster Recovery: Backup in another region
Compliance: Data residency requirements
Low Latency: Copy data closer to users
Data Migration: Move data between regions

How it works:

Configure replication rule (source prefix, destination bucket, IAM role)
New objects matching rule are automatically replicated
Existing objects not replicated (unless use S3 Batch Replication)

Real-world Context: GDPR requires EU data in EU region. Replicate from us-east-1 to eu-west-1.

Follow-up: What happens if you delete an object? (Delete marker is replicated, object appears deleted in both regions)

VPC & Networking

Q9: What is a VPC Peering connection and how does it work?

Difficulty: Mid

Answer:

VPC Peering connects two VPCs using private IP addresses, as if they’re in the same network.

Characteristics:

One-to-one connection (not transitive)
Can peer VPCs in same or different accounts/regions
No single point of failure
No bandwidth bottleneck
CIDR blocks must not overlap

Setup:

Request peering connection (initiator)
Accept peering connection (accepter)
Update route tables in both VPCs
Update security groups/NACLs to allow traffic

Use Cases:

Connect VPCs in same region
Share resources between accounts
Hub-and-spoke architecture (each spoke peers with hub)

Real-world Context: Development VPC needs access to shared services VPC (databases, monitoring).

Follow-up: How do you connect 3 VPCs? (Need 3 peering connections - not transitive. Or use Transit Gateway)

Q10: What is AWS Transit Gateway and when would you use it?

Difficulty: Senior

Answer:

Transit Gateway is a network transit hub that connects VPCs, VPNs, and Direct Connect.

Benefits:

Centralized management
Transitive routing (VPC A → TGW → VPC B)
Supports up to 5,000 VPCs
Route tables for segmentation
Cross-region peering

Use Cases:

Hub-and-spoke architecture
Connecting many VPCs (simpler than VPC peering mesh)
Centralized network security inspection
Multi-account networking

Architecture:

Create Transit Gateway
Attach VPCs/VPNs
Configure route tables
Set up routing (propagate or static routes)

Real-world Context: Organization with 50 VPCs. Instead of 1,225 peering connections, use Transit Gateway with one attachment per VPC.

Follow-up: How do you implement network segmentation with Transit Gateway? (Use multiple route tables, attach VPCs to different route tables)

IAM & Security

Q11: Explain IAM roles vs IAM users and when to use each.

Difficulty: Mid

Answer:

IAM Users:

Long-term credentials (access keys, passwords)
For humans or applications that need long-term access
Can have MFA enabled
Best for: Developers, CI/CD systems, service accounts

IAM Roles:

Temporary credentials (assumed via STS)
No passwords or access keys
Can be assumed by users, services, or other roles
Best for: EC2 instances, Lambda functions, cross-account access

Key Differences:

Roles provide temporary credentials (1 hour default, up to 12 hours)
Roles can be assumed by multiple entities
Roles support cross-account access
Users have permanent credentials (until rotated)

Best Practice: Use roles for everything possible. Only use users when absolutely necessary.

Real-world Context: EC2 instance needs S3 access → IAM Role attached to instance. Developer needs console access → IAM User with MFA.

Follow-up: How do you assume a role? (Use AWS STS AssumeRole API, or attach role to service)

Q12: What is the principle of least privilege in IAM?

Difficulty: Junior

Answer:

Principle of least privilege: Grant only the minimum permissions necessary to perform a task.

Implementation:

Start with no permissions
Add permissions only as needed
Use specific actions, not wildcards (*)
Scope to specific resources (ARNs), not all resources
Use conditions (IP, time, tags) to further restrict

Example:

❌ Bad: s3:* on *
✅ Good: s3:GetObject on arn:aws:s3:::my-bucket/data/*

Benefits:

Reduces attack surface
Limits impact of compromised credentials
Easier to audit and understand
Compliance requirements

Real-world Context: Lambda function only needs to read from one S3 bucket. Grant s3:GetObject on that bucket only, not s3:*.

Follow-up: How do you audit IAM permissions? (Use IAM Access Analyzer, CloudTrail, or IAM policy simulator)

Q13: Explain IAM policy evaluation logic.

Difficulty: Senior

Answer:

IAM evaluates policies in this order:

Default Deny: All requests denied by default
Explicit Deny: Any explicit Deny overrides everything
Explicit Allow: Grants permission
Default Deny: If no Allow, request denied

Evaluation Process:

Check all applicable policies (identity-based, resource-based, boundary, service control)
If any policy has explicit Deny → Deny
If any policy has explicit Allow → Allow
Otherwise → Deny

Key Points:

Explicit Deny always wins
Need at least one Allow to proceed
Resource-based policies evaluated separately
Permissions boundaries limit maximum permissions

Real-world Context: User has Allow in user policy, but Deny in group policy → Request denied (explicit Deny wins).

Follow-up: What’s the difference between permissions boundary and policy? (Boundary sets maximum permissions, policy grants permissions within boundary)

Lambda & Serverless

Q14: What are Lambda cold starts and how do you minimize them?

Difficulty: Mid

Answer:

Cold Start: Time to initialize Lambda execution environment (download code, initialize runtime, run initialization code).

Factors Affecting Cold Starts:

Runtime (Python/Node.js faster than Java/.NET)
Package size (larger = slower)
VPC configuration (adds ENI setup time)
Provisioned Concurrency (eliminates cold starts)

Minimization Strategies:

Optimize package size: Remove unused dependencies, use layers for common code
Choose faster runtime: Node.js/Python over Java
Avoid VPC if possible: Adds 10+ seconds
Use Provisioned Concurrency: Pre-warms execution environments
Optimize initialization: Move heavy initialization outside handler
Keep functions warm: CloudWatch Events ping function periodically

Real-world Context: API Gateway → Lambda. Cold start adds 2-3 seconds. Use Provisioned Concurrency for production APIs.

Follow-up: When would you use Provisioned Concurrency? (When you need consistent low latency and can predict traffic)

Q15: Explain Lambda concurrency and reserved concurrency.

Difficulty: Mid

Answer:

Concurrency: Number of execution environments running simultaneously.

Default Limits:

Account-level: 1,000 concurrent executions (can increase)
Per function: Unlimited (unless reserved concurrency set)

Reserved Concurrency:

Guarantees minimum concurrency for a function
Limits maximum concurrency for a function
Other functions cannot use reserved concurrency
Prevents one function from consuming all account concurrency

Use Cases:

Reserve concurrency: Critical function needs guaranteed capacity
Limit concurrency: Function that calls downstream API with rate limits
Cost control: Limit expensive function execution

Throttling:

When concurrency limit reached, requests throttled (429 error)
Can configure DLQ for throttled events
Use reserved concurrency to prevent throttling

Real-world Context: Lambda calls external API with 100 req/sec limit. Set reserved concurrency to 100 to prevent exceeding API limit.

Follow-up: What happens when account concurrency limit is reached? (All functions throttled unless reserved concurrency set)

Q16: What is Lambda@Edge and what are its use cases?

Difficulty: Senior

Answer:

Lambda@Edge runs Lambda functions at CloudFront edge locations, closer to users.

Use Cases:

Request/Response Manipulation: Modify headers, redirects, A/B testing
Authentication/Authorization: Validate requests before origin
Custom Error Pages: Generate custom error responses
Geographic Personalization: Serve different content by location
Bot Detection: Block malicious requests at edge

Limitations:

Node.js and Python runtimes only
Smaller package size (1 MB for viewer, 50 MB for origin)
Shorter execution time (viewer: 5s, origin: 30s)
Limited access to AWS services

Execution Points:

Viewer Request: Before CloudFront forwards to origin
Origin Request: Before CloudFront forwards to origin (can cache)
Origin Response: After origin responds
Viewer Response: Before CloudFront responds to viewer

Real-world Context: Redirect mobile users to mobile site, add security headers, implement geo-blocking.

Follow-up: What’s the difference between Viewer Request and Origin Request? (Viewer Request runs on every request, Origin Request runs before cache check)

RDS & Databases

Q17: Explain RDS Multi-AZ and Read Replicas.

Difficulty: Mid

Answer:

Multi-AZ:

Synchronous replication to standby in different AZ
Automatic failover (60-120 seconds)
High availability, not scalability
Same endpoint (DNS switches on failover)
Use for: Production databases requiring HA

Read Replicas:

Asynchronous replication (eventual consistency)
Can be in different region
Read scaling (offload read traffic)
Manual promotion to primary
Use for: Read-heavy workloads, disaster recovery, cross-region replication

Key Differences:

Multi-AZ: HA, synchronous, automatic failover
Read Replicas: Scalability, asynchronous, manual promotion

Can Combine: Multi-AZ primary with Read Replicas for both HA and read scaling.

Real-world Context: E-commerce site. Multi-AZ for primary DB (HA), Read Replicas for reporting/analytics (read scaling).

Follow-up: What’s the RTO/RPO for Multi-AZ? (RTO: 60-120s, RPO: 0 - no data loss)

Q18: What is RDS Proxy and why would you use it?

Difficulty: Senior

Answer:

RDS Proxy is a fully managed database connection pooler and proxy.

Benefits:

Connection Pooling: Reuses connections, reduces database load
Failover Handling: Handles failovers without application changes
IAM Authentication: Use IAM instead of passwords
Query Filtering: Block or allow specific SQL statements
Enhanced Monitoring: CloudWatch metrics for connections

Use Cases:

Serverless applications (Lambda) that create many connections
Applications with many concurrent connections
Need IAM authentication
Reduce connection overhead

How it works:

Application connects to RDS Proxy endpoint
Proxy manages connection pool to database
Reuses connections across Lambda invocations
Handles failover transparently

Real-world Context: Lambda functions connecting to RDS. Without proxy, each invocation creates new connection. With proxy, connections reused.

Follow-up: What’s the difference between RDS Proxy and connection pooling in application? (RDS Proxy is managed, shared across applications, handles failover)

CloudWatch & Monitoring

Q19: Explain CloudWatch Logs, Metrics, and Alarms.

Difficulty: Mid

Answer:

CloudWatch Logs:

Centralized log storage
Log groups (application) and log streams (instance)
Retention: 1 day to never
Log Insights for querying
Export to S3, stream to Elasticsearch

CloudWatch Metrics:

Time-series data points
Namespace (AWS/EC2, Custom/MyApp)
Dimensions (InstanceId, AutoScalingGroupName)
Standard resolution: 1 minute, High resolution: 1 second
Custom metrics via PutMetricData API

CloudWatch Alarms:

Monitor metrics and trigger actions
States: OK, ALARM, INSUFFICIENT_DATA
Actions: SNS, Auto Scaling, EC2 actions, Lambda
Evaluation periods, datapoints to alarm

Real-world Context: Monitor CPU utilization → CloudWatch Metric. Alert when > 80% → CloudWatch Alarm → SNS → Email. Auto Scale based on alarm → ASG.

Follow-up: What’s the difference between CloudWatch and CloudTrail? (CloudWatch: metrics/logs, CloudTrail: API calls/audit)

Q20: What is CloudWatch Logs Insights and how do you use it?

Difficulty: Mid

Answer:

CloudWatch Logs Insights is a query language to search and analyze log data.

Query Syntax:

fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by bin(5m)

Common Use Cases:

Search for errors: filter @message like /ERROR/
Parse JSON logs: parse @message "user=*," as user
Aggregate metrics: stats count() by statusCode
Time-based analysis: stats count() by bin(5m)

Fields:

@timestamp: Log timestamp
@message: Log message
@logStream: Log stream name
Custom fields from parsed logs

Real-world Context: Application logs contain JSON. Query: fields @timestamp, level, error | filter level = "ERROR" | stats count() by error.

Follow-up: How do you create a dashboard from Logs Insights query? (Save query, add to dashboard, or export to CloudWatch Metrics)

Summary

These questions cover fundamental AWS concepts across compute, storage, networking, security, serverless, databases, and monitoring. Practice explaining these concepts clearly and relate them to real-world scenarios you’ve encountered or designed.

Next Steps:

Practice drawing architecture diagrams
Work through AWS hands-on labs
Review AWS Well-Architected Framework
Study for AWS certifications (Solutions Architect, DevOps Engineer)

This site is open source. Improve this page.

devops-interview-handbook

AWS Interview Questions

Table of Contents

EC2 & Compute

Q1: What is the difference between EC2 instance types and when would you choose one over another?

Q2: Explain the difference between EBS volume types and their use cases.

Q3: What is an Auto Scaling Group and how does it work?

Q4: Explain VPC, Subnets, Internet Gateway, and NAT Gateway.

Q5: What is the difference between Security Groups and NACLs?

S3 & Storage

Q6: Explain S3 storage classes and when to use each.

Q7: What is S3 versioning and why is it important?

Q8: Explain S3 cross-region replication and its use cases.

VPC & Networking

Q9: What is a VPC Peering connection and how does it work?

Q10: What is AWS Transit Gateway and when would you use it?

IAM & Security

Q11: Explain IAM roles vs IAM users and when to use each.

Q12: What is the principle of least privilege in IAM?

Q13: Explain IAM policy evaluation logic.

Lambda & Serverless

Q14: What are Lambda cold starts and how do you minimize them?

Q15: Explain Lambda concurrency and reserved concurrency.

Q16: What is Lambda@Edge and what are its use cases?

RDS & Databases

Q17: Explain RDS Multi-AZ and Read Replicas.

Q18: What is RDS Proxy and why would you use it?

CloudWatch & Monitoring

Q19: Explain CloudWatch Logs, Metrics, and Alarms.

Q20: What is CloudWatch Logs Insights and how do you use it?

Summary