EC2 — Elastic Compute Cloud
What it is:Virtual machines (instances) running in AWS data centers
Purchasing:On-Demand, Reserved (1/3yr), Spot, Savings Plans, Dedicated Host
Billing:Per second (Linux/Ubuntu) or per hour (Windows, RHEL)
Families:t/m (general), c (compute), r/x (memory), p/g (GPU), i/d (storage)
Key feature:User Data scripts run on first boot to auto-configure instances
AMI:Amazon Machine Image — template (OS + software) used to launch instances
S3 — Simple Storage Service
What it is:Object storage for files, images, backups, logs, static websites
Durability:99.999999999% (eleven nines) — replicated across ≥ 3 AZs
Max object size:5 TB per object (multipart upload required above 5 GB)
Storage classes:Standard, IA, One Zone-IA, Glacier Instant/Flexible/Deep Archive, Intelligent-Tiering
Not for:Block storage or OS drives — use EBS for that
Buckets:Globally unique name; data stored in a specific region
IAM — Identity & Access Management
What it is:Controls who can access which AWS resources and how
Entities:Users (people/apps), Groups (collections of users), Roles (temporary credentials), Policies (JSON permission docs)
Root account:Never use for daily tasks; lock with MFA immediately
Principle:Least privilege — grant only the permissions required
Roles:Used by AWS services (e.g., EC2 to access S3); avoids hardcoding credentials
Global:IAM is not region-specific — applies across all regions
VPC — Virtual Private Cloud
What it is:Your own logically isolated network within AWS
Subnets:Public (has route to IGW) vs Private (no direct internet access)
IGW:Internet Gateway — attaches to VPC, enables internet access for public subnets
NAT Gateway:Lets private subnet instances reach internet outbound; blocks inbound
Default VPC:Each AWS account gets one default VPC per region (CIDR: 172.31.0.0/16)
CIDR:IP range of the VPC, e.g. 10.0.0.0/16 (65,536 IPs)
RDS — Relational Database Service
What it is:Fully managed SQL databases — AWS handles patching, backups, HA
Engines:MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Aurora
Multi-AZ:Synchronous standby replica in another AZ; auto failover (HA, not read scaling)
Read Replicas:Asynchronous; for read scaling; can be cross-region; not for failover
Backups:Automated backups 1–35 day retention + manual snapshots
Not for:NoSQL or key-value workloads — use DynamoDB instead
Lambda
What it is:Run code without provisioning or managing servers (serverless/FaaS)
Triggers:S3 events, API Gateway, SQS, SNS, DynamoDB Streams, CloudWatch, EventBridge
Pricing:Pay per request ($0.20/1M requests) + duration rounded to 1ms
Limits:Max 15 min timeout; 128 MB – 10 GB memory; 512 MB–10 GB ephemeral /tmp
Concurrency:Default 1,000 concurrent executions per region (soft limit; can increase)
Cold start:First invocation latency; mitigate with Provisioned Concurrency
Interview tip: Know the compute model differences — EC2 (IaaS, you manage OS), Elastic Beanstalk (PaaS, AWS manages infra), Lambda (FaaS, serverless), ECS/EKS (containers). They'll ask "which would you use for X?"
TypeSavingsCommitmentBest for
On-DemandNoneShort-term, unpredictable workloads; testing
Reserved Instances (Standard)Up to 72%1 or 3 yearsSteady-state, predictable usage (specific instance type/region)
Reserved Instances (Convertible)Up to 54%1 or 3 yearsSteady-state but need flexibility to change instance type
Savings Plans (Compute)Up to 66%1 or 3 yearsFlexible — applies to EC2, Lambda, Fargate across any region/family
Savings Plans (EC2 Instance)Up to 72%1 or 3 yearsLike Standard RI but more flexible (any size/OS in a family/region)
Spot InstancesUp to 90%None (interruptible)Fault-tolerant batch jobs, CI/CD, big data, stateless web
Dedicated InstancesOn-Demand pricingSingle-tenant hardware; shared hardware within your account
Dedicated HostsOn-Demand or ReservedCompliance, BYOL (Bring Your Own License) requirements
Instance Types
t, m:General purpose — balanced CPU/memory/network Default choice
c:Compute optimized — high CPU-to-memory ratio; gaming, HPC, batch
r, x, z:Memory optimized — high RAM; in-memory DBs, real-time big data
p, g, inf, trn:Accelerated computing — GPU, ML training/inference
i, d, h:Storage optimized — high sequential I/O; NVMe SSD, Hadoop
Naming:e.g. m5.xlarge — family(m) + generation(5) + size(xlarge)
Other Compute Services
Elastic Beanstalk:PaaS — deploy code, AWS handles load balancing/scaling/patching
ECS:Elastic Container Service — managed Docker orchestration (AWS proprietary)
EKS:Elastic Kubernetes Service — managed Kubernetes control plane
Fargate:Serverless containers — no EC2 instances to manage; works with ECS & EKS
Lightsail:Simple VPS — fixed monthly price; good for small projects/beginners
AWS Batch:Fully managed batch computing at any scale using EC2/Spot/Fargate
Outposts:AWS hardware in your on-prem data center for hybrid cloud
Storage typePersistenceScopeUse case
Instance StoreEphemeral — lost on stop/terminateLocal to hostTemp buffer, cache; highest I/O performance
EBS gp3 (SSD)PersistentOne AZ, one instance at a time*OS volumes, databases, general workloads
EBS io2 Block ExpressPersistentOne AZLatency-sensitive databases (SAP HANA, Oracle)
EBS st1 / sc1 (HDD)PersistentOne AZThroughput-heavy, sequential (log processing, cold data)
EFS (NFS)PersistentMulti-AZ, multi-instanceShared file system across many EC2 instances
S3PersistentRegional, globally accessibleObject storage — backups, media, static content
Key distinction: EBS is locked to one AZ (you CAN use EBS Multi-Attach with io1/io2 on a limited basis). EFS spans multiple AZs automatically. S3 is not mounted like a drive — it's accessed via API/URL.
ClassRetrieval speedMin storageBest for
S3 StandardInstant (ms)NoneFrequently accessed data; web assets, active content
S3 Intelligent-TieringInstant (ms)NoneUnknown or changing access patterns; auto-optimizes cost
S3 Standard-IAInstant (ms, retrieval fee applies)30 daysInfrequent access; disaster recovery backups
S3 One Zone-IAInstant (ms, retrieval fee applies)30 daysInfrequent, reproducible data; only one AZ (risk of loss)
S3 Glacier Instant RetrievalMilliseconds90 daysArchived data accessed ~once a quarter
S3 Glacier Flexible RetrievalMinutes to 12 hours90 daysLong-term backup archives with occasional retrieval
S3 Glacier Deep Archive12–48 hours180 days7–10 year regulatory retention; lowest cost storage in AWS
Access & Security
Bucket policies:JSON resource-based policies — grant/deny access to bucket/objects
Block Public Access:Account or bucket level setting — always enable unless you have a specific reason not to
Server-side encryption:SSE-S3 (AWS-managed keys), SSE-KMS (your KMS key), SSE-C (customer-provided key)
Presigned URLs:Time-limited, temporary access URLs for private objects (e.g. download link that expires)
ACLs:Legacy per-object access control; AWS recommends bucket policies instead
Functionality
Versioning:Keep all versions of every object; protects against accidental deletion/overwrites
Lifecycle rules:Automatically transition objects to cheaper classes or expire them by age
Replication:CRR (cross-region replication) for compliance/latency; SRR (same-region) for log aggregation
Static website hosting:Serve HTML/CSS/JS files directly from S3 with a public endpoint
Event notifications:Trigger Lambda, SQS, or SNS on object PUT/DELETE events
S3 Select:Query CSV/JSON/Parquet data in-place with SQL — no download needed
Block & File Storage
EBS volumes:gp2/gp3 (SSD, general), io1/io2 (provisioned IOPS), st1/sc1 (HDD)
EBS Snapshots:Point-in-time backups stored in S3; incremental; can copy cross-region
EFS:Elastic NFS file system; auto-scales; Linux workloads; two modes: Standard and One Zone
FSx for Windows:Managed Windows File Server (SMB protocol, Active Directory integration)
FSx for Lustre:High-performance parallel file system for HPC, ML, video processing
Data Transfer & Hybrid
Storage Gateway:Bridge on-prem to AWS — File Gateway (S3), Volume Gateway (EBS), Tape Gateway (Glacier)
AWS DataSync:Automate data transfer between on-prem/S3/EFS/FSx; up to 10x faster than manual
Snowcone:Smallest Snow device — 8 TB usable; portable, rugged
Snowball Edge:80 TB usable; Storage Optimized or Compute Optimized variant; edge processing
Snowmobile:Exabyte-scale — 100 PB per truck; for massive data center migrations
AWS Backup:Centralized, policy-based backup across EC2, RDS, DynamoDB, EFS, S3, etc.
ServiceTypeBest forKey fact
RDSRelational (SQL)OLTP; traditional apps needing ACID complianceManaged; supports 6 engines
AuroraRelational (MySQL/PostgreSQL compatible)High-throughput SQL; production workloadsUp to 5× faster than MySQL; auto-healing, 6-copy replication across 3 AZs
Aurora Serverless v2RelationalVariable/unpredictable workloadsScales in fractions of ACUs; pay per use
DynamoDBNoSQL (key-value + document)Serverless, single-digit ms latency, massive scaleFully managed; auto-scales; no server to manage
ElastiCache for RedisIn-memory cacheSession management, leaderboards, pub/sub, cachingSub-millisecond; supports rich data types
ElastiCache for MemcachedIn-memory cacheSimple distributed caching (key-value only)Multi-threaded; simpler than Redis
RedshiftData warehouse (columnar SQL)OLAP — analytics, BI, large-scale reportingPetabyte-scale; Redshift Spectrum queries S3 directly
DocumentDBDocument (MongoDB-compatible)JSON document storage; content catalogs, user profilesNot actual MongoDB — AWS-built compatible engine
NeptuneGraphSocial networks, fraud detection, knowledge graphsSupports Gremlin and SPARQL
TimestreamTime-seriesIoT sensor data, telemetry, operational metricsAuto-scales; faster and cheaper than relational for time-series
QLDBLedger (immutable)Financial audit trails, supply chain provenanceCryptographically verifiable transaction log
KeyspacesWide column (Cassandra-compatible)Cassandra workloads without managing infrastructureServerless; pay per use
RDS — Multi-AZ vs Read Replicas
Multi-AZ:Synchronous standby; auto-failover ~60s; for HA not performance
Read Replicas:Asynchronous copy; scale read traffic; up to 5 per DB
Cross-region RR:Read Replicas can be in different regions (disaster recovery + low latency reads)
Promote RR:Can promote a Read Replica to a standalone DB (breaks replication)
Aurora Extras
Storage:Auto-grows in 10 GB increments up to 128 TB; 6 copies across 3 AZs
Read Replicas:Up to 15 Aurora Read Replicas with sub-10ms replication lag
Failover:Automatic, faster than RDS Multi-AZ (~30s)
Global Database:Primary + up to 5 read-only regions; <1s replication lag
DynamoDB Key Concepts
Primary key:Partition key alone, or partition key + sort key (composite)
Capacity modes:Provisioned (RCU/WCU) or On-Demand (pay per request)
GSI/LSI:Global Secondary Index (any attributes, cross-partition); Local (same partition, sort key only)
DynamoDB Streams:Ordered log of item-level changes — trigger Lambda in real time
DAX:DynamoDB Accelerator — in-memory cache; microsecond reads
TTL:Automatically delete items by timestamp attribute (no RCU charge)
Remember: RDS = managed SQL (you choose engine). Aurora = AWS-optimized MySQL/PostgreSQL (faster, better HA). DynamoDB = NoSQL, serverless, infinite horizontal scale. ElastiCache sits in front of any DB to cache hot reads.
VPC Building Blocks
Subnet:Divides VPC CIDR into smaller ranges; tied to one AZ; public (route to IGW) or private
Route Table:Rules controlling where traffic is directed; each subnet associates with one route table
Internet Gateway (IGW):Horizontally scaled, HA gateway; attaches to VPC for internet access
NAT Gateway:Managed service in public subnet; lets private instances initiate internet connections (outbound only)
Elastic IP:Static public IPv4 address; associated with an instance or NAT Gateway
VPC Endpoints:Private connection to AWS services (S3, DynamoDB) without internet; Gateway or Interface type
VPC Connectivity
VPC Peering:1-to-1 private connection between VPCs (same or different account/region); not transitive
Transit Gateway:Hub-and-spoke router; connect thousands of VPCs + on-prem; IS transitive
VPN (Site-to-Site):Encrypted IPSec tunnel from on-prem to AWS over public internet
Client VPN:OpenVPN-based; individual users connect to VPC securely
Direct Connect:Dedicated private fiber link to AWS; consistent bandwidth; NOT over internet; 1 Gbps or 10 Gbps
PrivateLink:Expose your service to other VPCs privately via Interface VPC Endpoints
FeatureSecurity GroupsNetwork ACLs (NACLs)
LevelInstance level (ENI)Subnet level
StateStateful — return traffic automatically allowedStateless — must explicitly allow inbound AND outbound
RulesAllow rules only; no deny rulesAllow AND deny rules; evaluated in order by rule number
Default behaviorDeny all inbound, allow all outboundDefault NACL allows all; custom NACL denies all until rules added
Rule evaluationAll rules evaluated togetherRules evaluated lowest number first; stops at first match
AssociationMultiple SGs per instanceOne NACL per subnet
Elastic Load Balancer (ELB)
ALB:HTTP/HTTPS Layer 7; path-based & host-based routing; ideal for microservices
NLB:TCP/UDP Layer 4; ultra-low latency; handles millions of requests/sec; static IP
GWLB:Layer 3; distributes traffic to 3rd-party virtual appliances (firewalls, IDS/IPS)
CLB:Classic LB (legacy) — Layer 4 & 7; avoid for new architectures
CloudFront (CDN)
What:Content Delivery Network — caches content at 400+ global Edge Locations
Origins:S3, ALB, EC2, or any custom HTTP endpoint
Cache:Reduce latency + offload origin traffic; configurable TTL per behavior
Security:Integrates with WAF, Shield; HTTPS only; OAC/OAI to restrict direct S3 access
Route 53
What:Managed authoritative DNS; domain registration; health checks
Simple:One record → one value; no health checks
Weighted:Split traffic by percentage — A/B testing, canary deployments
Latency-based:Route to region with lowest network latency
Failover:Active-passive; switch to secondary if health check fails
Geolocation/Geoproximity:Route based on user's geographic location
Stateless vs Stateful explained: Security Groups are stateful — if you allow port 443 inbound, the response traffic on ephemeral ports is automatically allowed back out. NACLs are stateless — you must explicitly create both the inbound allow rule AND the outbound allow rule for the same connection.
AWS RESPONSIBILITY — "Security OF the Cloud"
Physical data centers & buildings  ·  Power & cooling & network hardware  ·  Hypervisor & host OS  ·  Managed service software (e.g. RDS engine patching, S3 hardware)  ·  Global fiber network & edge infrastructure
CUSTOMER RESPONSIBILITY — "Security IN the Cloud"
Guest OS patching on EC2  ·  Application code security  ·  IAM users, roles & permissions  ·  Data encryption (at rest and in transit)  ·  Security Group & NACL configuration  ·  Customer data & classification  ·  Network configuration (VPC, subnets, routing)
The line moves with managed services: For RDS, AWS patches the DB engine (their responsibility), but you configure Security Groups, IAM access, and encryption (your responsibility). For Lambda, AWS manages everything except your code and IAM permissions.
IAM Entities
Root user:Created with AWS account; has full access; lock away with MFA; never use for daily tasks
IAM User:Long-term credentials (username/password + access keys); represents a person or application
IAM Group:Logical collection of IAM Users; attach policies to group, not individual users
IAM Role:Temporary credentials; assumed by services, users, or cross-account principals; no long-term keys
IAM Policy:JSON document with Effect (Allow/Deny), Action, Resource, Condition fields
Explicit Deny:Always overrides any Allow — most specific denial wins
IAM Best Practices
Enable MFA:On root account and all privileged/admin users; virtual, hardware, or U2F key
Least privilege:Start with no permissions; grant only what is required for the task
No root for daily use:Create admin IAM user for day-to-day administration
Rotate access keys:Regularly rotate programmatic access keys; delete unused ones
Use Roles, not keys:Attach IAM Role to EC2 instead of embedding access keys in code
SCP (Organizations):Service Control Policies — org-wide guardrails; can restrict what even admins can do
ServiceWhat it doesKey detail
KMSCreate, manage, and control encryption keys (CMKs)Integrated with most AWS services; audit usage via CloudTrail
CloudHSMDedicated hardware security module in your VPCFIPS 140-2 Level 3; you manage keys; KMS can use CloudHSM as backing store
Secrets ManagerStore, rotate, and retrieve secrets (DB passwords, API keys)Auto-rotates RDS passwords; native Lambda rotation; replaces SSM Parameter Store for secrets
SSM Parameter StoreHierarchical key-value store for config + secretsFree tier for standard params; SecureString uses KMS; no auto-rotation
GuardDutyIntelligent threat detection — ML-based anomaly detectionAnalyzes CloudTrail, VPC Flow Logs, DNS logs; no agents; works even if logging is disabled on resources
InspectorAutomated vulnerability scanningScans EC2 (via SSM agent) and ECR container images for CVEs and network exposure
MacieDiscover and protect sensitive data in S3ML-powered; finds PII, financial data, credentials; sends findings to Security Hub
Shield StandardDDoS protection for all AWS customersFree; protects against common L3/L4 attacks (SYN floods, reflection attacks)
Shield AdvancedEnhanced DDoS protection with 24/7 DRT access~$3,000/month; cost protection; works with ALB, CloudFront, Route 53, EC2, EIP
WAFWeb Application Firewall — filter HTTP/S trafficProtects against SQLi, XSS, rate limiting, IP blocking; applies to ALB, CloudFront, API Gateway
Security HubCentral security findings aggregator and compliance dashboardAggregates from GuardDuty, Inspector, Macie; maps to CIS, PCI-DSS, NIST standards
DetectiveInvestigate and analyze security findingsUses ML + graph analysis on CloudTrail, VPC Flow, GuardDuty data; for forensics
CognitoUser identity and authentication for web/mobile appsUser Pools (user directory, sign-up/in); Identity Pools (federate access to AWS services)
CloudWatch — Performance Monitoring
Metrics:Collect & track time-series data (CPU, NetworkIn, DiskOps) from AWS services
Custom Metrics:Push your own app/infra metrics via PutMetricData API or CloudWatch Agent
Alarms:Alert when metric breaches threshold; trigger SNS, Auto Scaling, EC2 actions
Logs:Collect, store, query log data; Log Groups → Log Streams; Logs Insights for SQL-like queries
Dashboards:Cross-region, cross-account customizable monitoring views
Events/EventBridge:React to state changes in real time; schedule cron-like tasks
Agent:CloudWatch Agent required to collect memory/disk metrics from EC2 (not built-in)
CloudTrail — API Audit Logging
What:Records every API call made in your AWS account — who did what, when, from where
Captures:Management events (control plane) + Data events (S3 object ops, Lambda invocations) + Insight events
Retention:90-day event history free in console; deliver to S3 for indefinite retention
Integrity:Log file validation — detects if logs were tampered with (SHA-256 digest files)
Multi-region:Single trail can cover all regions; always enable in all regions
Use for:Security audits, compliance, "who deleted that resource?" investigations
AWS Config — Resource Compliance
What:Track and record configuration changes to AWS resources over time
Config Rules:Evaluate resources against desired configurations; AWS-managed or custom Lambda rules
Timeline:See config history for any resource — what changed, when, who triggered it
Remediation:Auto-remediate non-compliant resources via SSM Automation documents
Aggregation:Config Aggregator — multi-account, multi-region compliance view
Use for:Compliance (PCI-DSS, HIPAA, SOC2), drift detection, change management
ServicePurposeKey detail
X-RayDistributed request tracing for microservices and LambdaVisualize service maps; find bottlenecks; debug latency; requires X-Ray SDK in app
Trusted AdvisorAutomated best practice recommendations5 pillars: Cost, Performance, Security, Fault Tolerance, Service Limits; Business/Enterprise plan unlocks all checks
AWS Health / Personal Health DashboardAWS service health + your account-specific eventsService Health Dashboard = global AWS status; Personal Health = your resources affected by AWS events
Compute OptimizerRight-sizing recommendations to reduce cost/improve performanceAnalyzes EC2, ASG, Lambda, EBS using ML; shows over-provisioned resources
Systems Manager (SSM)Operations management for EC2 and on-prem instancesSession Manager (SSH without SSH/bastion), Run Command, Patch Manager, Parameter Store, Automation
CloudFormationInfrastructure as Code — define AWS resources in JSON or YAMLStack = group of resources; drift detection; rollback on failure; Change Sets to preview changes
CDK (Cloud Development Kit)Define cloud infrastructure using Python, TypeScript, Java, etc.Compiles to CloudFormation; higher-level abstractions (L1 = raw CFN, L2 = opinionated constructs)
Classic interview trap: CloudWatch = monitor HOW your resources are performing (CPU, latency, errors). CloudTrail = WHO made API calls (audit log). Config = WHAT changed in your config over time (compliance/drift). All three are different and complementary.
Lambda — Deep Dive
Runtimes:Node.js, Python, Java, Go, Ruby, .NET; or Custom Runtime (any language via bootstrap binary)
Memory:128 MB – 10,240 MB; CPU scales proportionally with memory allocation
Timeout:Max 15 minutes (900 seconds) per invocation
Ephemeral storage:/tmp — 512 MB (default) up to 10 GB; use for temp files during execution
Layers:Share code/dependencies across functions; up to 5 layers per function
Concurrency:Default 1,000 concurrent executions per region; Reserved or Provisioned Concurrency available
Invocation types:Synchronous (API GW, CLI) — wait for response; Asynchronous (S3, SNS) — fire and forget; Event Source Mapping (SQS, DynamoDB Streams, Kinesis)
API Gateway
What:Create, deploy, manage, and secure REST, HTTP, and WebSocket APIs at scale
REST API:Feature-rich — request/response transformation, usage plans, API keys, caching
HTTP API:Lower latency, lower cost; fewer features; good for Lambda proxy & HTTP backends
WebSocket API:Persistent connections for real-time apps (chat, dashboards, gaming)
Auth options:IAM, Amazon Cognito User Pools, Lambda Authorizer (custom JWT/OAuth)
Throttling:10,000 requests/sec default steady-state (5,000 burst); configurable per stage
Stages:Deploy to named stages (dev, staging, prod) with independent settings
SQS — Simple Queue Service
Standard queue:At-least-once delivery; best-effort ordering; nearly unlimited throughput
FIFO queue:Exactly-once processing; strict ordering; 300 msg/s (3,000 with batching)
Visibility timeout:Period a received message is hidden from other consumers (default 30s, max 12hr)
Message retention:Default 4 days; configurable 1 minute to 14 days
Max message size:256 KB (use S3 + Extended Client Library for larger payloads)
DLQ:Dead Letter Queue — messages that fail processing N times are moved here for inspection
Long Polling:Wait up to 20s for messages to arrive — reduces empty responses and cost
SNS — Simple Notification Service
What:Pub/Sub — publish a message once, deliver to many subscribers simultaneously
Subscribers:Lambda, SQS, HTTP/S endpoints, Email, SMS, Mobile Push (APNS, GCM)
Fan-out pattern:SNS topic → multiple SQS queues — parallel processing with different consumers
FIFO topics:Ordered, deduplication; only SQS FIFO queues can subscribe
Message filtering:Subscription filter policies — each subscriber receives only relevant messages
SQS vs SNS:SQS = pull-based queue (one consumer processes each message). SNS = push-based, all subscribers get every message.
Kinesis — Real-time Streaming
Kinesis Data Streams:Real-time data ingestion; shards (1 MB/s in, 2 MB/s out each); retention 1–365 days
Kinesis Data Firehose:Fully managed delivery to S3, Redshift, OpenSearch, Splunk; near real-time (~60s)
Kinesis Data Analytics:SQL or Apache Flink queries on streaming data in real time
vs SQS:Kinesis = multiple consumers, ordered per shard, replay-able. SQS = one consumer group processes & deletes.
Step Functions
What:Orchestrate multi-step workflows as JSON-defined state machines; visual workflow designer
State types:Task, Choice (branching), Wait, Parallel, Map (iterate over array), Pass, Succeed, Fail
Standard:Up to 1 year duration; exactly-once; at-most-once execution per transition; for long-running workflows
Express:Up to 5 min; at-least-once; higher throughput; for high-volume, short-lived workflows
EventBridge
What:Serverless event bus; route events between AWS services, SaaS apps, custom apps
Event buses:Default (AWS services), custom (your app), partner (SaaS: Zendesk, Stripe, etc.)
Rules:Match events by pattern; route to Lambda, SQS, SNS, Step Functions, API Gateway, etc.
Scheduler:Cron and rate-based schedules; replaced CloudWatch Events Scheduled Rules
Schema Registry:Discover, create, and manage event schemas; auto-generates code bindings
Serverless pattern to know: API Gateway → Lambda → DynamoDB is the classic serverless CRUD backend. Add Cognito for auth, CloudFront in front of API GW for caching/global edge, and SQS between Lambda functions for decoupling.
Region
What:Geographic cluster of data centers; 33+ regions worldwide
Choosing a region:Data residency laws, latency to users, service availability, pricing
Isolated:Regions are completely independent; disaster in one does NOT affect another
Availability Zone (AZ)
What:One or more discrete, isolated data centers within a region; 105+ AZs globally
Connected:Low-latency, high-bandwidth private fiber links between AZs in a region
Best practice:Deploy across ≥ 2 AZs for high availability
Edge Locations
What:CloudFront CDN PoPs; 400+ globally; cache and serve content close to users
Local Zones:AWS infra extensions to metro areas for ultra-low latency (gaming, live streaming)
Wavelength:AWS compute embedded in 5G networks; sub-10ms latency for mobile apps
ASG Core Concepts
Purpose:Automatically add or remove EC2 instances based on demand or schedule
Launch Template:Defines what instance to launch — AMI, instance type, security groups, user data, etc.
Min / Max / Desired:Always configure all three; Desired is current target; Min protects against scale-in removing everything
Multi-AZ:ASG distributes instances across AZs and rebalances automatically
ELB integration:ASG registers new instances with ELB; removes and drains unhealthy ones
Cooldown period:Default 300s after scaling action; prevents rapid repeated scaling
Scaling Policies
Target Tracking:Maintain a target metric (e.g. CPU = 50%); AWS adjusts capacity automatically — most common
Step Scaling:Scale by a configured amount when a CloudWatch alarm triggers; different steps for different breach levels
Simple Scaling:Single step on alarm; waits for cooldown before re-evaluating; legacy option
Scheduled Scaling:Scale based on known patterns (e.g. add 5 instances every weekday at 9am)
Predictive Scaling:ML-based; proactively scales before anticipated demand spikes
StrategyRTORPOCostHow it works
Backup & RestoreHoursHours $ Regular backups to S3/Glacier; restore infra from scratch via CloudFormation on disaster
Pilot LightTens of minutesMinutes $$ Core systems (DBs) always running in DR region; scale out app servers only after failover
Warm StandbyMinutesSeconds $$$ Reduced-scale replica running continuously in DR region; scale to full size on failover
Multi-Site Active/ActiveNear zero (<60s)Near zero $$$$ Full-scale environment in 2+ regions simultaneously; Route 53 Weighted or Latency routing splits traffic
Recovery Objectives
RPO (Recovery Point Objective):Max acceptable data loss measured in time — "how old can the data be when we recover?"
RTO (Recovery Time Objective):Max acceptable time to restore service — "how long can we be down?"
Fault Tolerant:System continues operating with zero downtime despite component failure (harder, more expensive)
Highly Available:System recovers quickly from failure with minimal downtime (practical HA target)
HA Patterns
Multi-AZ RDS:Synchronous standby; auto failover; typically <60–120 seconds RTO
S3 CRR:Cross-Region Replication — async; enables cross-region HA for object storage
Route 53 Failover:Active-passive with health checks; automatically routes DNS to secondary endpoint
Aurora Global DB:RPO < 1s; RTO < 1 min; global write forwarding; cross-region reads
DynamoDB Global Tables:Active-active multi-region; auto-replication; single-digit ms reads anywhere
HA ≠ DR: High Availability addresses AZ-level failures (one data center goes down). Disaster Recovery addresses region-level failures (entire region is unavailable). Both are separate design concerns and interviewers love this distinction.
PILLAR 01
Operational Excellence
Run and monitor systems to deliver business value while continuously improving supporting processes. Key practices: Infrastructure as Code (CloudFormation/CDK), CI/CD pipelines, small & frequent reversible changes, anticipate failure (game days), runbooks and playbooks, annotate documentation. Key services: CodePipeline, CodeDeploy, Systems Manager, CloudFormation.
PILLAR 02
Security
Protect data, systems, and assets. Key practices: Strong identity foundation (IAM least privilege, MFA), enable traceability (CloudTrail, Config), apply security at all layers (SGs, NACLs, WAF), automate security best practices, encrypt data in transit and at rest (KMS), protect people from making mistakes (SCPs). Key services: IAM, KMS, GuardDuty, Security Hub, Shield, WAF, Macie.
PILLAR 03
Reliability
Recover from infrastructure or service failures, dynamically acquire computing resources to meet demand. Key practices: Test recovery procedures, automatically recover from failure (Auto Scaling, Multi-AZ), scale horizontally, stop guessing capacity, manage change with automation. Key services: Auto Scaling, ELB, Multi-AZ RDS, Route 53, Backup, CloudFormation.
PILLAR 04
Performance Efficiency
Use computing resources efficiently and maintain efficiency as demand changes. Key practices: Democratize advanced technologies (use managed services), go global in minutes (CloudFront, Multi-Region), use serverless architecture, experiment more often, mechanical sympathy (choose the right resource type). Key services: Lambda, Fargate, CloudFront, ElastiCache, DynamoDB, Compute Optimizer.
PILLAR 05
Cost Optimization
Avoid unnecessary costs. Key practices: Implement cloud financial management, adopt a consumption model (pay only for what you use), measure overall efficiency (CloudWatch, Cost Explorer), stop spending money on undifferentiated heavy lifting (use managed services), analyze and attribute expenditure (tagging, Cost Allocation Tags). Key services: Cost Explorer, Budgets, Savings Plans, Reserved Instances, Compute Optimizer, Trusted Advisor.
PILLAR 06
Sustainability
Minimize environmental impacts of running cloud workloads. Key practices: Understand your impact, establish sustainability goals, maximize utilization (right-size, reduce idle resources), anticipate and adopt more efficient offerings (Graviton/ARM processors), use managed services (AWS achieves higher utilization than individual customers), reduce downstream impact. Key services: Graviton instances, serverless, Compute Optimizer, Sustainability in the Well-Architected Tool.
What it is:Free AWS Console tool to review your architecture against the 6 pillars using questionnaires
Output:Improvement plan with prioritized recommendations and links to guidance
Lens library:Specialized lenses for SaaS, Serverless, Machine Learning, Analytics, Government, etc.
Partner programs:AWS Well-Architected Partner Program — APN partners can run formal reviews
Mnemonic: O · S · R · P · C · S → "Oh So Reliable, Performance Costs Something" — Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability
Pay for what you use
No upfront costs for most services; pay per unit of consumption (seconds, GB, requests)
Compute billed per-second (Linux EC2, Lambda) or per-hour (Windows)
Pay less as you use more
Tiered pricing — the more you use, the less you pay per unit
Applies to S3 storage, EC2 data transfer, and many other services
Save with reservations
Commit to 1 or 3 years for significant discounts vs On-Demand
Reserved Instances (up to 72%), Savings Plans (up to 66%)
ToolPurposeKey capability
Cost ExplorerVisualize, analyze, and forecast spend & usage12-month historical view; 12-month forecast; RI/SP recommendations; filter by service/tag/account
AWS BudgetsSet cost & usage thresholds with alertsAlert at X% of budget; trigger SNS/email; Budgets Actions can stop EC2/RDS automatically
Cost & Usage Report (CUR)Most granular cost data availableHourly or daily CSV; every resource/usage type; load into Athena or Redshift for analysis
Pricing CalculatorEstimate cost for new architectures before buildingcalculator.aws — model any combination of services; export as CSV or share link
Savings PlansFlexible discount model — commit $/hr compute spendCompute SP (most flexible — EC2/Lambda/Fargate any region/family); EC2 SP (higher discount, specific family)
Reserved InstancesCommit to specific instance for 1 or 3 yearsStandard RI (up to 72%, fixed); Convertible RI (up to 54%, can change family/OS/tenancy)
Cost Allocation TagsTag resources to attribute costs to teams/projectsAWS-generated + user-defined; activated in Billing Console; appears in CUR
Compute OptimizerRight-sizing recommendationsIdentifies over-provisioned EC2, Lambda, EBS; estimates potential savings
PlanStarting priceCritical case responseKey benefits
BasicFree (all accounts)No technical supportDocumentation, forums, Trusted Advisor (7 core checks), Personal Health Dashboard
Developer$29/month (or 3% of monthly spend)Not available1 primary contact, business hours email support, general guidance <24 hr, system impaired <12 hr
Business$100/month (or % of spend)Production system down: <1 hourUnlimited contacts, 24/7 phone/chat/email, full Trusted Advisor, AWS Health API, Infrastructure Event Management (extra fee)
Enterprise On-Ramp$5,500/month (or % of spend)Business-critical system down: <30 minPool of Technical Account Managers, Concierge Support Team, annual architecture reviews, proactive programs
Enterprise$15,000/month (or % of spend)Business-critical system down: <15 minDedicated TAM, Concierge Support, proactive guidance, Well-Architected Reviews, training credits
Always Free (no expiry)
Lambda:1 million requests/month + 400,000 GB-seconds compute/month
DynamoDB:25 GB storage + 25 WCU + 25 RCU
CloudFront:1 TB data transfer out + 10 million HTTP/S requests/month
CloudWatch:10 custom metrics + 10 alarms + 1 million API requests
SNS:1 million publish API requests; email delivery always free
SQS:1 million requests/month
Cognito:50,000 monthly active users in User Pools
12-Month Free (new accounts only)
EC2:750 hours/month of t2.micro or t3.micro (Linux/Windows)
S3:5 GB standard storage + 20,000 GET + 2,000 PUT requests
RDS:750 hours/month of db.t2.micro or db.t3.micro (MySQL, PostgreSQL, MariaDB)
CloudFront:1 TB data transfer out
EBS:30 GB of SSD storage (gp2/gp3 or magnetic)
Elastic Load Balancer:750 hours/month of Classic or Application Load Balancer
If the question says…
"serverless"→ Lambda, Fargate, DynamoDB, Aurora Serverless, S3, API Gateway
"containers"→ ECS (Docker, AWS orchestrator), EKS (Kubernetes), Fargate (serverless containers), ECR (registry)
"shared file system / NFS"→ EFS (not EBS — EBS is one AZ, one instance)
"decouple / async / buffer"→ SQS (queue for decoupling), SNS (pub/sub fan-out)
"real-time data streaming"→ Kinesis Data Streams; Kinesis Firehose to deliver to S3/Redshift
"CDN / edge caching"→ CloudFront
"DNS / routing"→ Route 53
"cache / reduce DB load"→ ElastiCache (Redis or Memcached); DAX for DynamoDB specifically
"data warehouse / BI / OLAP"→ Redshift; Redshift Spectrum to query S3
"audit trail / who called what API"→ CloudTrail (NOT CloudWatch)
"performance monitoring / metrics"→ CloudWatch (metrics, alarms, logs, dashboards)
"config drift / compliance"→ AWS Config
"best practice recommendations"→ Trusted Advisor
"graph database"→ Neptune
"time-series / IoT data"→ Timestream
"immutable audit ledger"→ QLDB
More keywords…
"encrypt / manage keys"→ KMS (managed, software-based); CloudHSM (dedicated hardware FIPS 140-2 L3)
"detect threats / intrusion detection"→ GuardDuty (analyses logs, ML-based)
"scan for vulnerabilities / CVE"→ Amazon Inspector (EC2 and container images)
"find PII in S3"→ Amazon Macie
"DDoS protection"→ Shield Standard (free, automatic); Shield Advanced (paid, 24/7 DRT)
"block SQLi / XSS / web attacks"→ WAF (Web Application Firewall)
"store secrets / DB password rotation"→ Secrets Manager (auto-rotates); SSM Parameter Store (manual, cheaper)
"IaC / infrastructure as code"→ CloudFormation (JSON/YAML); CDK (Python/TypeScript/etc. compiles to CFN)
"hybrid cloud / on-prem integration"→ Direct Connect (fiber), Storage Gateway, Outposts (AWS in your DC)
"migrate database to AWS"→ DMS (Database Migration Service); Schema Conversion Tool (SCT) for heterogeneous migrations
"move petabytes physically"→ Snowball Edge (80 TB); Snowmobile (100 PB per truck)
"orchestrate multi-step workflow"→ Step Functions
"SSH without bastion / SSH keys"→ Systems Manager Session Manager
"user sign-up / authentication for your app"→ Cognito (User Pools = user directory; Identity Pools = AWS access federation)
"cross-account access"→ IAM Roles (assume role from another account); AWS Organizations + SCPs for guardrails
Q: What's the difference between S3 and EBS?
S3 = object storage, accessed via HTTP API/URL, not mountable as a drive, infinite scale, globally accessible. EBS = block storage (like a hard drive), attached to one EC2 instance, tied to one AZ, low-latency random I/O. Choose EBS for OS and databases; choose S3 for static files, backups, media, and large-scale data.
Q: What's the difference between SQS and SNS?
SQS = pull-based queue; one consumer (or consumer group) processes and deletes each message; great for decoupling services. SNS = push-based pub/sub; one message published → all subscribers receive it simultaneously. Often used together: SNS fan-out to multiple SQS queues for parallel processing.
Q: What's a Region vs an Availability Zone?
Region = independent geographic location (e.g. us-east-1 = Northern Virginia). Has multiple AZs. Data does not leave the region unless you explicitly configure it. AZ = one or more physically separate data centers within a region, connected by low-latency links. Deploy across ≥2 AZs for HA; across ≥2 Regions for DR.
Q: How do EC2 instances securely access S3 without hardcoding credentials?
Attach an IAM Role to the EC2 instance profile with the appropriate S3 permissions. The EC2 instance retrieves temporary credentials automatically via the Instance Metadata Service (IMDS). The application uses the SDK which automatically finds these credentials. Never store access keys in code, environment variables baked into AMIs, or in S3 buckets.
Q: What's "stateless" application design and why does it matter in AWS?
A stateless app doesn't store session data in the instance's local memory or disk. Any server can handle any request. This enables Auto Scaling — you can add/remove instances freely. Store state externally in ElastiCache (sessions), DynamoDB (data), or S3 (files). The opposite (stateful) makes scaling and failover much harder.
Q: What is the AWS Shared Responsibility Model?
AWS is responsible for "Security OF the Cloud" — the physical infrastructure, hardware, hypervisor, managed service software. Customers are responsible for "Security IN the Cloud" — OS patching on EC2, application code, IAM configuration, data encryption, network/firewall config. The boundary shifts for managed services: RDS means AWS patches the DB engine, but you configure Security Groups and encryption.
Q: Multi-AZ vs Read Replicas in RDS — what's the difference?
Multi-AZ = synchronous replication to a standby in another AZ; automatic failover; the standby cannot serve read traffic; purely for high availability. Read Replicas = asynchronous replication; can serve read queries to scale read throughput; can be cross-region; can be promoted to standalone DB; NOT for automatic failover.
33+ Regions 105+ AZs 400+ Edge Locations S3 durability: 11 nines (99.999999999%) Lambda max timeout: 15 minutes Lambda memory: 128 MB – 10 GB Lambda concurrency default: 1,000/region S3 max object: 5 TB S3 multipart above: 5 GB EC2 Reserved savings: up to 72% Spot savings: up to 90% Savings Plans savings: up to 66% DynamoDB: single-digit ms latency RDS backup retention: 1–35 days Aurora: up to 15 read replicas SQS max retention: 14 days SQS max msg size: 256 KB SQS FIFO: 300 msg/s (3,000 w/ batching) CloudTrail free retention: 90 days Snowball Edge: ~80 TB usable Snowmobile: 100 PB per truck