EC2

Instance
  • on launch:
    • can enable termination protection
    • can enable detailed monitoring (1-minute)
    • can join to a directory (Windows instances only)
    • can enable Elastic GPU (Windows instances only)
    • can attach AmazonEc2RoleforSSM role for connection with Systems Manager (Session Manager)
  • charging by the hour
  • on Linux, charged by the second
  • traditional EC2 instance types provide fixed CPU resources
  • instance types:
    • General Purpose
    • Compute Optimized: high performance processors
    • Memory Optimized: large data sets
    • Accelerated Computing: calculations, graphics processing, data pattern matching
    • Storage Optimized: low latency, high random I/O performance
EC2 burst balances
  • provide baseline level of CPU utilization with ability to burst CPU utilization above
  • pay only for baseline CPU plus any additional burst
    • lower compute costs
  • burstable performance instances use credits for CPU usage
  • earn credits when below CPU baseline
  • no earns when equal CPU baseline
  • spend credits when higher than CPU baseline
  • accrued credits: earned credits
    • can be used later to bust above baseline
  • more credit spent than earned:
    • standard mode: instance uses accrued credits
      • if no accrued credits, instance slow to baseline CPU
    • unlimited mode: instance uses accrued credits
      • if no accrued credits, instance spends surplus credits
      • future earns will pay surplus credits (sort of debt)
  • .t2 unlimited allows applications to burst past CPU performance baselines
    • can configure on instance launch
    • free for 12 month if new accounts
  • chargeable
Change physical host
  • can stop and start EC2 instance to move it to a different physical host if EC2 status checks are failing or there is planned maintenance on current physical host
  • when stopped can modify
    • instance type
    • user data
    • kernel
    • RAM disk
IP address
  • public: reassigned if instance is stopped/started
  • private: assigned automatically to all instances to primary network interface (eth0)
  • Elastic IP: static public IP address (charged if unused)
    • max 5 per Region (can increase by contacting AWS Support)
    • region-specific
    • DNS records for Elastic IPs can be configured by filling a form
    • can associate single private IPv4 with single Elastic IP and vice versa
  • BYOIP: bring part or all publicly routable IPv4/IPv6 address range from on-premises network to AWS
    • not available in all Regions and for all resources
Bastion host
  • access VPC instances for management (SSH or RDP)
  • need security group with the relevant permissions
  • can restrict IP addresses CIDRs that can access the bastion host
  • autoscaling groups for HA
  • BEST PRACTICE: deploy Linux bastion hosts in two AZs
    • use auto-scaling and Elastic IP
  • can use Systems Manager Session Manager instead
Tag
  • can enforce standardized tagging via AWS Config or custom script
    • e.g. EC2 instances not property tagged are stopped or terminated daily
  • up to 50 tags
Resource group
  • mapping of assets defined by tags
    • for metrics, alarms, config details, etc.
Instance store
  • provides temporary (non-persistent) block-level storage for instance
  • differs from EBS which provides persistent storage and is also a block storage service that can be root or additional volume
  • located on disks that are physically attached to the host computer
  • ideal for temporary storage of information that changes frequently
    • buffers, caches, scratch data, other temporary content
  • can specify instance store volumes for an instance only on launch
  • cannot move to another instance
  • instance type determines the size and the type of hardware of storage
  • some instance types use NVMe or SATA-based SSD to deliver high random I/O performance
  • good option when need storage with very low latency but don’t need data to persist when instance is terminated
  • good for distributed or replicated databases that need high I/O
  • included as part of instance’s usage cost
    • can be more cost-effective than EBS Provisioned IOPS
Monitor and logging
  • status checks: 1-minute checks that return pass or fail status
    • if all checks pass, status of instance is OK
    • otherwise is impaired
    • StatusCheckFailed_System: problems with instance that require AWS involvement
      • loss of network connectivity
      • loss of system power
      • software issues on physical host
      • hardware issues on physical host that impact network reachability
      • can recover, stop, terminate or reboot the instance
    • StatusCheckFailed_Instance: problems with instance that require user involvement
      • failed system status checks
      • incorrect networking or startup configuration
      • exhausted memory
      • corrupted file system
      • incompatible kernel
      • can stop, terminate or reboot the instance
  • unified CloudWatch Agent: see related section of CloudWatch
  • integrated with CloudTrail
    • can create a trail to enable continuous delivery of CloudTrail events to S3 bucket, including events for EC2 and EBS
    • a trail enables to store records indefinitely
    • if trail not configured, can view past 90 days events in Event history
    • can use CloudTrail info to determine the request, the IP address, who made the request, when was made and additional details
  • detailed monitoring is chargeable
Regional Data Transfer
  • data between instances in different regions is charged
  • rates apply if:
    • other instance in a different AZ, regardless type of address used
    • public or Elastic IP addresses are used, regardless AZ of the other instance
On-demand instance
  • ideal for unpredictable workloads (dev/test)
    • max On-Demand instances running across instance family: 20
Spot instance
  • take advantage of the unused capacity in the cloud
  • up to a 90% discount
  • can use for stateless, fault-tolerant of flexible applications such as
    • big data
    • containerized workloads
    • CI/CD
    • web servers
    • high-performance computing
  • pricing is determined by long term trends in supply and demand for EC2 spare capacity:
    • don’t have to bid, just pay for the current hour
    • two-minutes interruption notice when instances are about to be reclaimed by EC2 because EC2 needs capacity back
    • not interrupted because of higher competing bids
  • BEST PRACTICE: diversify to reduce impact of interruptions
    • can use RequestSpotFleet API to launch thousands of Spot Instances and diversify
    • can setup spot instances and spot fleets to respond to an interruption notice by stopping rather than terminating
  • each instance family, AZ, instance size in every Region is in separate spot pool
  • dynamic spot limit per Region
Reserved instance (RI)
  • when instance matches RI automatically discount is applied
  • Standard: commitment of 1 or 3 years, charged whether it’s on or off
    • 40-60% discount
    • can change AZ, instance size, networking type with ModifyReservedInstance API
    • cannot change instance family, OS, tenancy, payment options
  • Scheduled: reserved for specific periods of time
    • charged hourly, billend in monthly increments over the term (1 year)
    • match capacity reservation to a predictable recurring schedule
  • Convertible: commitment of 1 or 3 years, charged whether it’s on or off
    • 31-54% discount
    • can change AZ, instance size, networking type with ExchangeReservedInstance API
    • can change instance family, OS, tenancy and payment options
  • discount apply to selected AZ
  • if no AZ is specified no reservation is created but discount applies to any instance in the family in the Region (Regional RI)
  • max Reserved Instances purchase: 20
Dedicated host
  • physical servers dedicated
  • available as On-Demand or Dedicated Host Reservation
  • useful for server-bound software licenses that use metrics like per-core, per-socket or per-VM
  • each dedicated host can only run 1 EC2 instance size and type
  • good for regulatory compliance or licensing requirements
  • complete isolation and predictable performance
  • most expensive option
  • billing is per host
Dedicated instance
  • virtualized instances on hardware
  • available as On-Demand, Reserved Instances or Spot Instances
  • also uses physically dedicated EC2 servers
  • does not provide additional visibility and control of dedicated hosts
  • billing per instance
  • may share hardware with other non-dedicated instances in same account
  • cost additional $2 per hour per Region

EBS

Persistence
  • persistent
  • can attach multiple volumes to an instance
  • most use cases: shared volume across EC2 instances
  • replicated across multiple servers in an AZ
  • failure rate of 0.1%-0.2%
  • can enable termination protection
  • root devices created under “/dev/sda1” or “/dev/xvda”
Volume
  • AZ specific
Volume type
  • SSD General Purpose (gp2/gp3)
    • balance price/performance
    • for boot volumes, low-latency interactive apps, dev and test
    • 1GB - 16TB
    • gp2
      • min 100 IOPS (at 33.33 GiB and below)
      • maximum of 16000 IOPS (at 5,334 GiB)
  • SSD Provisioned IOPS (io1/io2)
    • highest performance for latency-sensitive transactional workloads
    • for I/O intensive NoSQL and relational databases, boot volumes
    • 4GB - 16TB
    • ratio with disk dimension 50:1
      • e.g. 100GiB, max 5000 IOPS
      • e.g. 200Gib, max 10000 IOPS
    • max IOPS: 64.000
  • HDD Throughput Optimized (st1)
    • low-cost for frequently accessed
    • for big data, warehouses, log processing
    • 125GB - 16TB
    • max IOPS: 500
  • HDD Cold (sc1)
    • lowest cost for less frequently accessed
    • colder data
    • 125GB - 16TB
    • max IOPS: 250
Snapshot
  • can use to convert unencrypted volume to encrypted
  • no granular backup
  • saved incrementally
  • only accessed through EC2 APIs
  • Region-specific (volume are AZ-specific)
  • can be taken of non-root volume while running
  • consistent snapshot needs write to be paused
  • deletion process deletes only data not needed by other snapshots
  • can use to resize volume
  • can use to copy between Regions (create an AMI image)
  • create volume from snapshot, choosing AZ
  • charged for data traffic to S3 and S3 storage cost (only changed blocks)
Encryption
  • can encrypt both the boot and data volumes
    • data at rest inside the volume
    • all data moving between the volume and the instance
    • all snapshots created from the volume
    • all volumes created from those snapshots
  • same IOPS performance
  • uses AES-256 data key stored on disk after being encrypted with CMK
    • never appears on disk in plaintext
    • same data key shared by volumes created from snapshot
  • can check encryption status of volumes with AWS Config
    • no direct way to change state
  • cannot share encrypted volumes created using default CMK key
  • can share unencrypted snapshots with other accounts by making them private and selecting the accounts to share with
    • have to use non-default CMK key and configuring cross-account permissions
    • cannot make encrypted snapshots public
    • recommended that receiving account re-encrypt snapshot with its own CMK key
RAID
  • can be used to increase IOPS
  • RAID 0 = 0 striping - data written across multiple disks
    • increase performances
    • same redundancy
  • RAID 1 = 1 mirroring - 2 copies of data
    • same performances
    • increase redundancy
  • RAID 10 = 10 combination of RAID 1 and 2
    • increase performances
    • increase redundancy
    • cost additional disks
  • can configure multiple striped gp2 or standard volumes (RAID 0)
  • can configure multiple striped PIOPS volumes (RAID 0)
  • configured through the guest OS
  • also EBS optimized EC2 instances increase performance
  • not recommended for root/boot volumes
Monitoring and reporting
  • CloudWatch monitoring:
    • Basic: data available in 5-minute period
      • automatically sent by General Purpose (gp2), Throughput Optimized (st1) and Cold (sc1)
      • not charged
    • Detailed: automatically send 1-minute metrics
      • automatically sent by Provisioned IOPS SSD (io1)
      • charged
  • log to CloudTrail
  • Status checks:
    • ok:
      • enabled: I/O enabled
      • I/O performance status (only Provisioned IOPS volumes):
        • Normal
    • warning:
      • enabled: I/O enabled
      • disabled: volume is offline and pending recovery or waiting for user to enable I/O
      • I/O performance status (only Provisioned IOPS volumes)
        • Degraded: performance is below expectations
        • Severely Degraded: performance well below expectations
    • impaired:
      • enabled: I/O enabled
      • disabled: volume is offline and pending recovery or waiting for user to enable I/O
      • I/O performance status (only Provisioned IOPS volumes):
        • Stalled: performance severely impacted
        • Not Available: unable to determine I/O performance
    • insufficient-data

Elastic Load Balancer

Automatically distributes incoming traffic across multiple targets.

Security
  • ELB node within a subnet
    • ensure at least /27 subnet, at least 8 IP
  • for ALB at least 2 subnets, for NLB at least 1 (2 recommended)
  • use a DNS record with 60sec TTL
  • automatically distributes traffic
    • supports only valid requests so DDoS attacks (UDP and SYN floods) are not able to reach EC2 instances
  • can add a WAF (AWS Web Application Firewall) to the ALB
Classic Load Balancer
  • (Layer 4/7): old, no more used
  • protocols: all
Application Load Balancer
  • (Layer 7): routes based on content of requests
  • target: IP, instance, Lambda, containers
  • protocols: HTTP, HTTPS
  • support SSL certificates through AWS Certificate Manager
  • support SNI (Server Name Indication)
    • allows multiple websites to use a single secure listener
  • automatically scales its request handling capacity
  • host-based routing and path-based routing (as nginx)
  • microservices: load balancing multiple ports on single EC2 instance
  • integration with Cognito (user pools, SAM) for authentication or OIDC IDP
  • health checks
  • sticky sessions (enabled at target group level): cookies to ensure client is bound to an individual back-end instance
    • duration-based cookies (AWSALB)
    • application based cookies: the cookie name for each target group, WebSockets connections are sticky to follow upgrade process
  • ECS: can use 1 ALB, integrates with EC2 container using service load balancing
Network Load Balancer
  • (Layer 4): routes based on IP data
  • target: IP, instance, ALB
  • protocols: TCP
  • handle millions of requests per second, low latency
  • supports long-running/lived connections (ideal WebSocket)
Gateway Load Balancer
  • (Layer 3/4): firewalls, IDP/IPS systems
  • target: IP, instance
  • protocols: IP
Listener
  • define protocol and port to listen on
  • no multiple listener on same port
  • up to 10
Listener rule
  • priority
  • one or more actions (one target group per action)
  • optional host/path condition
    • different target groups based on the content of the request
  • default rule directs to default target group
  • applied using round robin routing algorithm (priority order)
Target group
  • logical grouping of targets
  • each listener contains a default rule, then can contain other rules that route requests to different target groups
  • Auto Scaling Group scale target group individually
X-forwarded headers
  • LB intercepts traffic between clients and servers, so servers will see only IP and protocol of Load Balancer
  • non-standard HTTP headers with “X-Forwarded” prefix
  • X-Forwarded-For: help identify IP address of a client when using LB
    • use attribute routing.http.xff_header_processing.mode to modify, preserve or remove X-Forwarded-For header
    • append mode: store IP address of client in the header
    • preserve mode: ensure header is not modified before send to target
    • remove mode: remove the header before send to target
  • X-Forwarded-Proto: help identify protocol that client used (HTTP/S)
    • can be used by application to render response redirecting to appropriate URL
  • X-Forwarded-Port: help identify destination port client uses to connect
Monitoring
  • CloudWatch (1min, only when requests active)
    • can trigger SNS notification
  • Access Logs: requester info, time of the request, client’s IP, request paths, etc.
    • default disabled
    • storable in S3
  • CloudTrail (capture API calls, storable in S3)

Auto Scaling

Adjust capacity to maintain steady, predictable performance at lowest cost.

Provides horizontal scaling (scale-out).

Auto Scaling Group
  • can merge ASG
  • no additional cost, pay for the resources
Launch Configuration
  • template used to create new EC2 instances
Launch Template
  • same information of Launch Configuration
  • multi-versions of a template
Integration with ELB
  • can attach one or more ELB
  • can attach one or more Target groups
Lifecycle hooks
  • AS groups pauses to allow perform custom actions
    • wait until lifecycle action is complete
  • can configure to send SNS notification when instance is launched/terminated/fails to launch/terminate
Security and HA
  • HA is when instances are launched in at least 2 AZ
  • cannot provide HA across multiple Regions
  • support IAM policies: uses service-linked roles
    • default is AWSServiceRoleForAutoScaling
  • no support resource-based policies and ACLs
Cooldown period
  • setting, ensure AS not launching or terminating instances before previous scaling activity takes effect
Warm-up period
  • period in which a new EC2 instance (using Step Scaling Policy) is not considered towards metrics
Scaling options
  • Maintain: keep specific or minimum number of instances
  • Manual: maximum, minimum or specific number of instances
  • Scheduled: increase, decrease based on schedule
  • Dynamic: scale based on real-time metrics
  • Predictive: anticipates approaching traffic changes
    • predicts future traffic including regularly occurring spikes using machine learning algorithms
    • Target Tracking Policy: scales to keep metric at specific target value
      • e.g. want to keep CPU usage at 70%
    • Step Scaling Policy: scales based on set of scaling adjustments aka step adjustments
      • e.g. vary adjustments based on size of the alarm breach
  • Scaling based on SQS: custom CloudWatch metric that measures number of messages in queue per EC2 instance of AS group
    • also tracks the number available for retrieval
    • can base adjustments off the SQS Metric ApproximateNumberOfMessages.
Termination policies
  • control instances that are terminated first when scale-in event occurs
  • default ensures EC2 instances span AZ evenly for HA
  • can enable Instance Protection to prevent AS to scaling-in and terminate EC2 instances
  • health checks grace period of 300 seconds
    • in this period can use CLI set-instance-health to change instance to “healthy”
  • can use Console or CLI to manually detach instances from AS group
  • can suspend/resume any scaling process to investigate issues or make changes without scaling
  • instances can be put in standby (still managed by AS and charged)
    • can be used for performing updates/changes/troubleshooting without health checks invoking replacement
Monitoring, reporting and logging
  • examples of metrics sent every 1 min
    • GroupInServiceInstances
    • GorupStandbyInstances
    • GroupTerminatingInstances: used to wait volume attachment in ACT migration
    • GroupTotalInstances
  • by default AS uses EC2 status checks
  • support ELB health checks
    • instance is unhealthy if ELB reports OutOfService
    • BEST PRACTICE: enable ELB health checks, otherwise, in case of unhealthy ELB, an instance will be removed from service by ELB but not terminated by AS
  • CloudTrail captures API calls for Auto Scaling as events
    • if no trail is configured, can still view the most recent (up to 90 days) events in the Event History
  • no additional charges for default CloudWatch metrics

.

ECS

Highly scalable, high performance container management service that supports Docker.

Security
  • EC2 instances use IAM role to access ECS
  • can use IAM to control access at the container level
  • container agent calls to ECS API through applied IAM roles
    • need to apply before launch
  • need to assign extra permissions to tasks through separate IAM roles (IAM Roles for Tasks)
    • used to access services and resources
  • Compute SLA: guarantees Monthly Uptime at least 99.99%
X-Ray
  • as a daemon: X-Ray agent runs in a daemon container
  • as a “sidecar”: X-Ray agent runs alongside each container
    • only way in Fargate
  • task definition:
    • X-Ray runs on port 2000 UDP
    • must specify the daemon address
    • must link containers together
  • docker image can be deployed alongside application:
    • docker pull amazon/aws-xray-daemon
Elastic Beanstalk
  • Single Container Docker: described in Dockerfile or Dockerrun.aws.json definition
  • Multicontainer Docker: same set of containers in each environment defined in a Dockerrun.aws.json
    • use when need multiple Docker containers to each instance
Elastic Container Registry
  • through IAM
Application Load Balancer
  • ALB supports target group that contains a set of instance ports
  • can specify a dynamic port in task definition which gives the container an unused port when it is scheduled on the EC2 instance
Cluster
  • Container instance
  • Task
Service
  • Service scheduler: can schedule ECS
    • ensures specified number of tasks constantly running, reschedulates them when fail
    • can ensure tasks registered against an ELB
  • Custom scheduler: can schedule ECS
    • custom scheduler to meet business needs
    • leverages third party schedulers (Blox)
    • leverages same cluster state information provided by ECS API to make appropriate placement decisions
  • Task
Task definition
  • bind mounts: temporary storage
  • docker volumes: created in /var/lib/docker/volumes in EC2 instance
ECS container agent
  • allows container instances to connect to the cluster
  • runs on each infrastructure resource
  • included in ECS optimized AMI (can be installed)
  • manually install on non-AWS Linux instances
Launch types
  • Fargate Launch Type:
    • serverless managed by AWS
    • only supports images hosted on ECR or Docker Hub
    • no support for docker volumes
    • recently added support for EFS and EBS
    • charged for running tasks
  • EC2 Launch Type:
    • EC2 instances
    • only supports private repositories
    • EFS and EBS integration
    • charged for running EC2 instance
Auto Scaling
  • increase/decrease task count automatically
  • leverage Application Auto Scaling service
  • can use ECS CloudWatch metrics to deal high demand peaks or low utilization
    • average CPU and memory usage
  • Target tracking scaling policies
    • based on a target value for a specific metric
    • like a thermostat
    • scaling is not performed when insufficient data
    • Application Auto Scaling rounds actual metric data points
    • fast scale out, normal scale in
    • scale out during deployment, scale in suspended
    • ALBRequestCountPerTarget metric for target tracking scaling policies not supported for the blue/green deployment type
  • Step scaling policies
    • based on step adjustments based on size of alarm breach
    • can scale out when CPU utilization reaches a certain level, create an alarm using the CPUUtilization metric provided
  • Scheduled scaling
    • based on date and time
Cluster Auto Scaling
  • Capacity Provider: can be associated with an EC2 Auto Scaling Group
    • managed scaling: automatically created scaling policy on ASG, new scaling metric (Capacity Provider Reservation) that the scaling policy uses
    • manage instance termination protection: container-aware termination of instances in the ASG when scale-in

EKS

Managed service for running Kubernetes on AWS and on-premises.

  • Can run on EC2 or Fargate
  • integrates with ALB, IAM for RBA and VPC
  • provides scalable and ha Kubernetes control plane running across multiple AZs
  • automatically manages availability, scalability of Kubernetes API servers and etcd persistence layer
  • 3 AZs to ensure high availability (automatically detects and replaces unhealthy control plane nodes)
  • use when organization needs a consistent control plane for managing containers across hybrid clouds and multi cloud environments

Lambda

Lambda

Elastic Beanstalk

Quickly deploy and manage applications in AWS Cloud. PaaS (Platform as a Service). Relies on CloudFormation.

EB CLI
  • can use from local repository
Application
  • collection of environments
  • environments confs
  • application versions
Application version
  • reference section of deployable code
  • point to S3 bucket containing code (source bundle)
    • can protect to prevent data loss
  • can apply to any environment
  • retention by lifecycle policies:
    • time-based: max age
    • count-based: max number to retain
  • not deletable (if in use)
Environment
  • provisioning of all resources deployed by a version
  • can be dev, prod, uat, …
Environment tiers
  • web servers
    • standard app
    • HTTP requests over port 80
  • workers
    • specialized apps
    • background progressing task
    • long-running task: listen messages on SQS queue
      • can decouple application tiers (more environments for more workers)
      • can define periodic tasks (cron.yaml)
Environment configurations
  • collection of parameters and settings of env
  • configuration template
Deployment options
  • Single instance: for dev
  • High availability with load balancer: for prod
Deployment policies
  • All at once: deploy all instances simultaneously
    • fastest (good for quick iterations)
    • outage (not ideal for critical systems)
    • on failure need rollback and re-deploy original
    • no additional cost
  • Rolling: update few instances at time (batch)
    • downtime for batch at time
    • running both versions simultaneously
    • environment capacity reduced by batch size (nr of instances)
    • long deployment time
    • reduce performances (not ideal for performance-sensitive systems)
    • on failure need addition rolling update to rollback
    • no additional cost
  • Rolling with additional batch: update new instances
    • full availability
    • running both versions simultaneously
    • additional batch removed in the end
    • set bucket size related to speed of deployment
    • good for production
    • small additional cost
  • Immutable: update new instances in new ASG swapping traffic once healthy
    • zero downtime
    • longest deployment
    • great for production
    • on failure quick rollback
    • high cost (double instances)
  • support Blue/green deployment: create new “staging” environment (green)
    • green env validated independently (can rollback)
    • Route 53 can use weighted policies to redirect % of traffic to green
    • “swap URLs” when done
    • zero downtime
    • longest deployment
    • on failure: swap URL
    • extra cost
.ebextensions
  • folder containing config files (YAML or JSON ending with .config)
  • .ebextensions/<filename>.config
  • in the application source code root
  • allows:
    • package to install
    • Linux user and groups
    • shell commands
    • services (RDS, ElastiCache, DynamoDB, etc)
    • load balancer
  • can modify default settings (option_settings)
  • resources added by it get deleted if environment terminated
RDS
  • only for development environments
  • on termination data are lost
  • for production create RDS database outside Elastic Beanstalk
  • migration from environment to standalone
    • snapshot db
    • enable deletion protection on db
    • create new environment without RDS
    • make application point existing db
    • perform blue/green deployment
    • swap environments
    • terminate old environment (db won’t be deleted)
  • connection env properties: RDS_HOSTNAME, RDS_PORT, etc.
SSL certificate
  • provisioned using ACM or CLI
    • onto load balancer from Console
    • code (.ebextensions/securelistener-alb.config)
  • redirecting HTTP to HTTPS
    • by application (e.g. nginx)
    • by ALB rule
    • ensure health checks not redirected

S3

Object storage; durable, highly available, infinitely scalable data storage, low cost.

Redundancy, HA, scalability
  • redundantly stored across multiple AZs
  • automatically scales to high request rates (can offload using CloudFront edge locations)
Limits
  • can only store files
  • alternatively, can use S3 Object Tagging across bucket and/or prefixes
  • max file size: 5TB
Eventual consistency
  • eventual consistency of overwrite PUTS and DELETES
  • atomic updates: either get the new object or the old one (when reading updated object), never get partial or corrupt data
REST web services interface
Websites
  • hosting static websites
  • returns an HTML document
  • cannot use dynamic content such PHP etc.
  • automatically scales
  • can use custom domain with a Route 53 alias
  • bucket name must be the same as the domain
  • can enable redirection (object/bucket level)
  • URL: <bucketname>.s3-website.amazonaws.com
  • doesn’t support HTTPS/SSL
  • support only GET and HEAD S3 API
  • needed and index document (default web page) and an error document (optional)
  • access control is only public
User policies
  • IAM policies (programmatic access or Console)
  • granting permissions for all S3 operations
  • managing permission for users in account
  • granting object permissions to user within the account
Buckets
  • flat, unlimited, region-specific container of objects
  • doesn’t provide hierarchy of objects, object key name (prefix) mimics folders
  • ownership not transferable, the account owner is the owner rather than the IAM user
  • names cannot be changed, they are part of the URL
  • can backup into another bucket in another account
  • all private by default
  • bucket sub-resources: configuration containers associated to a bucket
    • lifecycle rules
    • Versioning
    • ACLs
    • Bucket policies
    • CORS: allow requests to a different origin using Access-Control-Allow-Origin header
Bucket lifecycle rules
  • set of rules that define actions applied to a group of objects
  • Transition actions: when objects transition to another storage class
    • billing changes not occur until object has transitioned (if there is a delay)
  • Expiration actions: when objects expire
    • costs depend on when expire objects
    • won't be charged for storage after the expiration time (if there is a delay)
  • use examples:
    • expiration of logs
    • documents frequently accessed only on certain period of time
    • upload some data for archival purposes, retained for regulatory compliance
Versioning
  • stores versions of an object protecting against accidental deletion or overwrites
  • when delete, marker is placed on the object
    • can delete marker to make object available again
  • enables roll-back and undelete
  • can be used for archive
  • old versions are billable
  • not replicating existing objects when enabled (they will have NULL version ID)
  • can integrate with lifecycle rules
  • can enable MFA for delete and changing versioning settings
  • reverting is not replicated
  • when enabled, only owner can permanently delete objects
  • cannot be disabled, only suspended
Bucket ACL (Access Control Lists)
  • pre-defined groups:
    • authenticated user group
    • all users group
    • Log Delivery Group: enable S3 to write server access logs (need WRITE permissions)
  • permissions:
    • READ: list the objects in the bucket
    • WRITE: create, overwrite, delete any object in the bucket (recommended only for S3 Log Delivery Group)
    • READ_ACP: read ACL
    • WRITE_ACP: write ACL
    • FULL_CONTROLL: all above permissions
Bucket policies
  • use AWS Policy Generator to create resource-based access policies
  • can grant permission to the bucket and objects:
    • individual users
    • AWS accounts
    • everyone (public/anonymous)
    • all authenticated AWS users
  • allow/deny based on elements:
    • requester
    • S3 actions
    • resources
    • aspects or conditions of the request (e.g. IP address, headers)
Objects
  • unique key (ID or name)
  • identified through service endpoint, bucket name, key (optionally) version
  • metadata
  • tags: can use for IAM policies, lifecycle policies and costumize metrics
  • object sub-resources: configuration containers associated to an object
    • ACLs
    • restore: restoring an archive
Object ACL (Access Control Lists)
  • by default grants resource owner full control
  • pre-defined groups:
    • authenticated user group: all AWS accounts
      • all requests must be signed
      • any authenticated user
    • all users group: anyone in the world
      • request can be unsigned (anonymous)
      • BEST PRACTICE: never grant WRITE o FULL_CONTROL
  • permissions:
    • READ: read object data and metadata
    • READ_ACP: read ACL
    • WRITE_ACP: write ACL
    • FULL_CONTROL: all above permissions
Storage classes
  • S3 Standard: durable, immediately available, frequently accessed
  • S3 Intelligent-Tiering: moves data to the most cost-effective tier
  • S3 Standard-IA: durable, immediately available, infrequently accessed
  • S3 One Zone-IA: lower cost, infrequently accessed data with less resilience; single AZ
  • S3 Glacier: archiving storage for infrequently accessed data
Glacier
  • must complete a job before get output
  • cannot be set as storage class at creation of object
  • encrypts data at rest using AES256 symmetric keys
  • supports transfer over SSL
  • multiple AZs (resilient to one AZ destruction)
  • PUT API for direct upload, cannot use Console
  • S3 lifecycle management for automatic migration
  • not archiving metadata
  • up to 40TB
  • can use multipart upload
  • synchronous upload and asynchronous download
  • content not editable, data not available for real time access
  • if lot of small objects, an archive is preferred
    • with description, no metadata supported
  • each upload has an unique archive ID
  • retrieval:
    • Expedited: 1-5 minutes, expensive
    • Standard: 3.5 hours, cheaper, 10GB free per month
    • Bulk: 5-12 hours, cheapest, use for large quantity of data
  • can retrieve part of archive
  • can send SNS notification when retrieval job completed
  • can retrieve specific objects in archive specifying byte range in HTTP GET (need to maintain a DB of byte ranges)
  • no charges for data transfer between EC2 and Glacier in same Region; charge if delete data within 90 days
  • when restore pay for Glacier archive, requests, restored data in S3
  • storage tiers:
    • Instant Retrieval:
      • retrieval in milliseconds
      • same performance as S3 standard
    • Flexible Retrieval, ideal for backup, disaster recovery with large sets of data:
      • retrieval in minutes to hours (free bulk retrievals)
    • Deep Archive, lowest cost for long-term retention (7-10 years)
      • ideal alternative to magnetic tape libraries
      • retrieval within 12 hours
Server-side encryption
  • protects data at rest encrypting objects with a unique key
  • encrypt the key itself with regularly-rotated master key using AES-256
  • need to use bucket policy for encrypt all objects in a bucket
  • when uploading/downloading need kms:Decrypt along with kms:ReEncrypt, kms:GenerateDataKey and kms:DescribeKey
  • encryption options:
    • SSE-S3: S3 managed keys
      • header x-amz-server-side-encryption=AES256
    • SSE-KMS: CMKs (Customer Master Keys) stored in KMS
      • additional benefits and charges
      • provides audit trails on who/when CMK was used
      • can use CMK or can select own key
        • an envelope key protects custom keys
      • header x-amz-server-side-encryption=aws:kms
    • SSE-C: client provided keys
      • client manage keys; S3 manages encryption
      • keys are not stored by AWS
      • if key lost, data cannot be decrypted
      • provide key with following request headers:
        • x-amz-server-side-encryption-customer-algorithm: must be AES256
        • x-amz-server-side-encryption-customer-key: 256-bit, base64-encoded key
        • x-amz-server-side-encryption-customer-key-MD5: digest of the encryption key (RFC1321) used for message integrity check
Client-side encryption
  • encrypt before upload using local encryption process
  • using CMK:
    • send a request to KMS and it returns plaintext version of data key (used to encrypt object) and a cipher blob of data key to upload to S3 object metadata
    • when downloading encrypted object came with cipher blob of data key
    • then the client sends it to KMS to get plaintext data key in order to decrypt object
  • master key stored in application:
    • S3 encryption client generates a data key locally used to encrypt object (one for each object)
    • client encrypts data key with the provided master key and upload it in metadata
    • client uploads encrypted data with the encrypted data key as metadata (x-amz-meta-x-amz-key)
    • when downloading material metadata tells which master key use to decrypt downloaded data key that will decrypt object
Event notifications
  • in response to PUTs, POSTs, COPYs or DELETEs enables to run workflow, send alerts, perform actions
    • publishes notifications when new object created, object removal, restore object, RRS (Reduced Redundancy Storage) object lost, replication events
    • sends event notification messages to destinations: SNS, SQS, Lambda function invoking; needs related permissions
CRR Cross Region Replication
  • automatically 1:1 replicate data across Regions
  • asynchronous copying of objects
  • configured at bucket level can define a destination bucket in different Region (can be different storage class)
  • versioning must be enabled in both source/destination bucket
  • can configure separate lifecycle rules
  • replicate KMS-encrypted object by providing destination KMS key
  • can be cross account
  • can replicate all object of a subject by key name prefix
  • data in transit are encrypted SSL
  • needed permissions to replicate
  • both charges as inter-region transfer and upload requests
  • replication are triggered by object upload/delete/change (also metadata or ACL)
  • only object created with SSE-S3 are replicated
  • not replicated:
    • object existing before enabling replication
    • objects on which bucket owner has no permissions
    • bucket-level subresources
    • object that are replicated from another source
    • encrypted with SSE-C or SSE-KMS
  • deleting by object version ID deletes object, delete marker is not replicated
SSR Same Region Replication
  • destination bucket within the same Region
    • can be different storage class
  • automatic and asynchronous
  • replicated object can be owned by different accounts
  • ACL and tags are replicated
Multipart Upload
  • for files larger than 100MB; uploads object in parts independently, in parallel or any order
  • can use for objects from 5MB to 5TB
  • mandatory for objects larger than 5GB
  • improves throughput
  • can be paused/resumed
  • can use Transfer Acceleration
Requester pays function
  • the requester pay (no anonymous access)
  • not working in static website
Pre-signed URLs
  • provide temporary access to specific object to those who don’t have AWS credentials
  • must configure expiration date and time
  • usable both for download/upload
Copy
  • copy of objects up to 5GB in a single atomic operations
  • objects cannot be changed, so copy allows to:
    • generate copies
    • rename objects
    • changing che copy’s storage class or rest encryption
    • move objects across locations/regions
    • change object metadata
Transfer Acceleration
  • fast, easy and secure transfers of files over long distances between client and S3 bucket
  • leverages CloudFront’s globally distributed Edge Locations
    • charged only if benefit in timing
    • URL: <bucketname>.s3-accelerate.amazonaws.com
Cloud Watch metrics
  • storage metric are enabled by default and reported once per day
  • can enable 1-minute CloudWatch request metrics
  • can call S3 PUT Bucket Metrics API to enable and configure S3 storage metrics
  • can monitor:
    • S3 requests
    • bucket storage
    • bucket size
    • all requests
    • HTTP 4xx/5xx errors
S3 Server Access Logging
  • can record actions taken by users, roles, services for log records for auditing and compliance
  • provides detailed records of requests
  • bucket being logged must not be the destination of logs (logging loop)
  • can use in combination with CloudTrail
  • CloudTrail is recommended for logging bucket and object-level actions

VPC

Logically isolated sections of the cloud, the resources are in a virtual network

  • region wide, each region has a default VPC (up to 5)
Dedicated tenancy
  • ensure dedicated hardware for instances in VPC
IP address
  • IP address space from master address range
  • CIDR block between /28 and /16 netmask
  • first 4 and last IP are reserved
Subnet
  • segment of VPC’s addresses range (cannot overlap each other), can place groups of isolated resources
  • map 1:1 with AZ, cannot span zones
  • public: traffic is routed to an Internet Gateway
    • Auto-assign public IPv4 addresses” is true
    • route table must allow traffic to all destination (0.0.0.0/0)
  • private: no connection to internet
  • VPN-only: traffic routed to a private gateway for VPN connection
Internet Gateway
  • horizontally scaled, redundant, highly available VPC side of a connection to internet, attach to VPC
  • VPC can have max 1 IG
NAT Gateway
  • highly available, managed by AWS, NAT to make resources in private subnet access internet
    • must live on a public subnet
    • uses Elastic IP for public IP
    • multi-AZ redundancy (HA)
    • up to 5 Gbps bandwidth that can scale up to 45 Gbps
    • cannot use for VPC peering, VPN, Direct Connect (they need specific routes)
    • no associated with any security group
    • no need to disable source/destination checks
    • no port forwarding
    • cannot use as bastion host
    • no metrics
NAT Instance
  • managed by user
    • must live on a single public subnet
    • must disable the source/destination check on the instance
    • need to be assigned to security groups
    • amount of traffic supported based on instance type
    • can lead to bottleneck (not HA), but can be HA with ASG, multiple subnets in different AZ’s and script to automate failover
    • scale up (instance type or enhanced networking)
    • scale out by using multiple NATs in multiple subnets
    • can use as bastion (jump) host
    • monitor traffic metrics
    • not supported for IPv6
Hardware VPC Connection
  • hardware-based VPN connection between VPC and corporate data center (site-to-site)
    • or home network, co-location facility etc.
VPN Connection
  • Virtual Private Network: VPC side of a VPN connection
  • Customer Gateway: your side of VPN connection
Router
  • directs traffic between gateways, subnets, AZs etc.
  • can redirect to external destinations
  • each subnet can have only one route table
    • one route table can be assigned to many subnets
  • the default rule allows VPC subnets to communicate with one another
Peering Connection
  • route traffic via private IP addresses between two peered VPC
VPC Endpoints
  • private connectivity to services
    • without gateway, NAT, VPN, firewall etc.
Egress-only Internet Gateway
  • stateful gateway to provide egress only access of IPv6 traffic from VPC to internet
    • prevent inbound access to IPv6 instances
    • must create a custom route for ::/0
    • to be preferred to NAT Gateway for IPv6
Security groups
  • act like a firewall at instance level (network interface level)
  • second line of defense
  • implicit deny rule at the end, support allow rules
  • stateful
  • all outbound traffic allowed by default
  • cannot delete default SG of a VPC
  • names can be destination of other security groups and its own inbound rules
  • members of SG can be in any AZ or subnet of the VPC
  • changes take effect immediately
  • cannot block specific IP addresses, use NACLs
Network ACLs
  • act like firewall at subnet level (subnets must be associated with a NACL)
  • first line of defense
  • support permit/deny rules, evaluated in order from lowest number to explicit deny
  • stateless, responses are subject to rules for direction of traffic
  • not apply to traffic within subnet (do not filter traffic between instances in same subnet)
  • custom NACL deny all traffic by default
  • can block specific IP addresses or ranges
  • software firewall installed on instances is recommended
VPC Flow Logs
  • diagnosing overly restrictive security group rules
  • monitoring traffic that is reaching instance
  • determining direction of traffic to and from the network interfaces
S3 Object Ownership
  • bucket-level setting to control ownership of objects uploaded to bucket and enable/disable ACL
  • when ACL disabled bucket owner owns all objects in the bucket
    • manage access exclusively using access management policies
  • BEST PRACTICE: ACL A majority of modern use cases in Amazon S3 no longer require the use of ACLs, and we recommend that you keep ACLs disabled except in unusual circumstances where you must control access for each object individually. With ACLs disabled, you can use policies to more easily control access to every object in your bucket, regardless of who uploaded the objects in your bucket.

    Object Ownership has three settings that you can use to control ownership of objects uploaded to your bucket and to disable or enable ACLs:

    ACLs disabled

    Bucket owner enforced (default) – ACLs are disabled, and the bucket owner automatically owns and has full control over every object in the bucket. ACLs no longer affect permissions to data in the S3 bucket. The bucket uses policies to define access control.

    ACLs enabled

    Bucket owner preferred – The bucket owner owns and has full control over new objects that other accounts write to the bucket with the bucket-owner-full-control canned ACL.

    Object writer – The AWS account that uploads an object owns the object, has full control over it, and can grant other users access to it through ACLs.

CloudFront

Web service that distributes content with low latency and high data transfer speeds; distribution of frequently accessed static content (popular images, videos, media files, software downloads).

  • used for dynamic, streaming, interactive content
  • ingress to upload, egress to distribute
PCI DSS
  • PCI DSS compliant
  • BEST PRACTICE: not to cache credit card information at edge locations
HA
  • HIPAA compliant
  • DDoS protection (distributes traffic across multiple locations)
Support
  • support Perfect Forward Secrecy
    • new private key for each SSL session
  • support wildcard SSL certificates
  • support Dedicated IP
  • support custom SSL and SNI Custom SSL (cheaper)
Edge Location
  • where content is cached
  • request are automatically reloaded to nearest
  • not tied to AZs or Regions
  • PUT/POST/PATCH/OPTIONS/DELETE proxy methods and dynamic content go directly to web origin from edge location (not passing for Regional Edge caches)
  • can write content
  • can upload file
Regional Edge Caches
  • between origin web servers and global edge locations
  • have larger cache-width than any individual edge location
  • longer duration of caching
  • aim to get closer to user
Origins
  • origin of the files that CDN will distribute
    • S3 bucket
      • static website
    • EC2 instance
      • use AMI that install software for web server
      • use ELB and specify its URL as domain name of the origin server
    • ELB
    • Route 53
    • external non-AWS (must specify DNS name and ports)
Distributions
  • CDN configuration:
    • content origins
    • access (public/restricted)
    • security (HTTP/HTTPS)
    • cookie or query-string forwarding
    • geo-restrictions (restrict file access at country level, can use 3rd party geo-location service for finer granularity)
    • access logs
    • must be disabled before delete
  • Web Distribution:
    • static and dynamic content (html, css, php, graphics)
    • distributed over HTTP/HTTPS
    • add/update/delete objects
    • submit forms
    • live streaming (real time event)
  • RTMP: streaming media files using Adobe Flash Media Servers’s RTMP protocol
    • allow consuming media file before file finished download
    • file must be stored in S3
  • need both distributions to serve real time streaming with media player
Expiration
  • object cached for 24 hours
  • expiration time controlled through TTL
High Availability with Origin Failover
  • uses an origin group, when primary origin fails automatically switches to second origin
  • works with Lambda@Edge functions
Encryption
  • in-transit encryption with ACM
    • ensure request/import certificate in Region us-east-1
Signed URLs
  • additional informations (e.g. expiration date)
  • key pair
    • private key encrypts portion of URL
    • public key is hold by CloudFront
  • use when:
    • restrict access to individual files
    • client doesn’t support cookies
  • need a signer
    • can be a trusted key group in CloudFront (recommended)
    • can be an account that contains key pair
      • only root can generate CloudFront key pair
      • when using root can only have up to 2 active key pairs per account
Signed Cookies
  • allow control who can access content
  • provide access to multiple restricted files
    • e.g. files in subscriber area
  • application must authenticate and send three Set-Cookie headers
  • use when:
    • multiple restricted files
    • don’t want to change current URL
Origin Access Identity (OAI)
  • used in combination with signed URLs and Cookies restrict direct access to S3 bucket
  • prevent bypassing CloudFront
  • consist in a user associated with distribution with restrict access to S3 bucket
    • if user requests files they are denied
WAF
  • web app firewall that monitor HTTP/S requests to control access to content
  • can be based on conditions in a web ACL associated with distribution (e.g. Origin IP address, values in query strings)
  • can also deliver custom error page (403)
Domain names
  • <name>.cloudfront.net
  • can add domain name using Route 53 alias
  • for other service providers can use CNAME
    • support wildcard CNAME
  • can move subdomains
    • need AWS support for root domain
Charges
  • can use reserved capacity
    • 12 months
    • 10TB of data transfer in single region
  • pay for:
    • data transfer out to internet
    • data transfer out to origin
    • number of requests
    • invalidation requests
    • dedicated IP custom SSL
    • field level encryption requests
  • don’t pay for:
    • transfer between regions and CloudFront
    • Regional edge cache
    • ACM SSL/TLS certificates
    • Share CloudFront certificates
Monitoring and auditing
  • distributions can create access logs and cookie logs to S3 bucket (of the requests made)
  • can analyze access logs with Athena
  • integrated with CloudTrail
    • which requests
    • source IP address
    • etc.

Route 53

Highly available Domain Name System (DNS) service. Offers domain name registry, DNS resolution, health checking of resources.

DNS service
  • worldwide distributed DNS service, located alongside all edge locations
  • can use to route Internet traffic for domain registered with another domain registrar
  • uses UDP port 53
  • 100% uptime
  • max domain names: 50
    • can increase by contacting support
  • Private DNS: authoritative DNS within VPC without exposing DNS records
Domain
  • can transfer domains to Route 53 only if TLD is supported
  • transfer domain to another registrar needs AWS support
  • pay for domain names
Hosted zones
  • collections of records for a specific domain (DNS zone file)
  • can be public or private for VPC
    • private needs enableDnsHostname and enableDnsSupport properties
    • private needs DHCP options set
      • cannot automatically register EC2 instances, need scripting
  • automatically creates NS and SOA records
    • 4 unique name servers
    • NS specified by FQDN
  • health checks pointed at endpoints
    • IP addresses or domain names
    • can check status of their health checks
    • can check status of CloudWatch alarm
    • additional charging
      • different prices for AWS vs non-AWS endpoints
  • pay per hosted zone per month
Records
  • A, AAAA for IPv6, MX, NS, SOA, TXT, Alias
Alias
  • map resource in hosted zone
  • works like CNAME: points the DNS name of the service
  • can be:
    • ELB
    • CloudFront distribution
    • Beanstalk environment
    • S3 bucket as website
    • another record of the hosted zone
  • alias record and its target must exist in Route 53
  • can map custom domain names (e.g. api.example.com) to:
    • API Gateway custom regional APIs
    • API Gateway edge-optimized API
    • VPC interface endpoints
  • can map one DNS name to another target DNS name
  • can be used for resolving apex/naked domain names
  • use it when possible
  • supports wildcard entries except NS records
  • no charge for alias queries (as CNAME records do)
Routing policies
  • determine how Route 53 responds to queries
  • Simple Routing Policy
    • A record associate with one or more IP addresses
    • uses round robin
    • doesn’t support health checks
    • failover to a secondary IP address associated with a health check
      • route only when the resource is healthy
    • can be used with ELB
  • Geo-location Routing Policy
    • used for localizing content and presenting in the language of users
    • protects distribution rights
    • can be used for spread load evenly between Regions
    • if overlapping will route to the smallest geographic region
    • can create a default IP that doesn’t map geographic location
    • higher pricing
  • Geo-proximity Routing Policy
    • routing traffic based on the location of resources
    • shift traffic from resources in one location to resources in another
    • requires Route Flow
    • higher pricing
  • Latency Routing Policy
    • database of latency from different parts of the world
    • improve performance by routing to Region with lowest latency
    • can create latency records for resource in multiple EC2 locations
    • queries are more expensive
  • Multi-value Answer Routing Policy
    • responds to DNS queries with up to 8 healthy records selected at random
  • Weighted Routing Policy
    • like Simple but can specify a weight per IP address
      • each record has a relative weight
Traffic flow
  • provide Global Traffic Management (GTM)
  • create routing configurations for resources using routing types such as failover and geolocation
  • create policies that route traffic based on specific constraints:
    • latency, endpoint health, load, geo-proximity, geography
  • versioning feature allows maintain history of changes to routing policies
    • easily roll back to a previous policy version
  • additional charges
Route 53 Resolver
  • bi-directional querying between on-premises and AWS over private connections
  • used for enabling DNS resolution for hybrid clouds
  • Resolver Endpoints: inbound query capability
    • allowing DNS queries
    • originate on-premises to resolve hosted domains
    • connectivity between on-premises DNS infrastructure and AWS through Direct Connect (DX) or VPN
  • conditional forwarding rules: outbound DNS queries
    • rules trigger when query is made to one of those domains
    • attempt to forward DNS requests to your DNS servers
    • requires DX or VPN

API Gateway

Fully managed service to publish, maintain, monitor and secure APIs at any scale

  • can scale to any level of traffic received by an API
Endpoints
  • hostname for an API deployed to a specific Region
    • <api-id>.execute-api.<region>.amazonaws.com
  • Edge-optimized Endpoint (default) around the world
    • ideal for geographically distributed clients
    • requests are routed to nearest CloudFront Point of Presence (POP)
    • capitalizes names of HTTP headers (e.g. Cookie)
    • any custom domain name applies across all Regions
  • Regional Endpoint same region
    • intended for clients in same Region
    • reduce connection overhead when small number of client with high demand or client running on EC2 instances in same Region
    • custom domain with Route 53 can perform tasks such as latency-based routing
    • headers name trough as-is
    • custom domain name specific to Region
    • any custom domain name applies across all Regions, if multiple Region
  • Private Endpoint same VPC
    • only accessed from VPC
    • use an interface VPC Endpoint (ENI)
    • headers name through as-is
API
  • collection of HTTP resources and methods integrated with backend
    • HTTP endpoints
    • Lambda functions
    • other services
  • organized in a resource tree according to application logic
  • API Gateway WebSocketAPI: collection of WebSocket routes integrated with backend
    • HTTP endpoints
    • Lambda functions
    • other services
    • API method invoked through front-end WebSocket connections associated with a registered custom domain name
  • methods: HTTP methods associated with an API resource
    • each resource URL can be GET, PUT, POST, DELETE and ANY (catch-all)
  • pay when your APIs are in use
    • no minimum fees or upfront commitments
  • pay only far receiving calls and the amount of data transferred out
    • not for Private APIs
API throttling
  • maximum concurrent requests within an account: 5000
  • maximum requests per second: 10.000
  • if over maximum then 429 (Too Many Requests) is returned
  • client can resubmit the failed requests in a rate-limiting way
  • Server-side throttling limits: applied across all clients
    • avoid API to being overwhelmed by too many requests
  • Per-client throttling limits: clients that use API key associated with usage policy as client identifier
  • applies settings in the following order:
    • per-client per-methods set for a stage in a usage plan
    • per-client throttling set in a usage plan
    • default per-method limits (individual) in stage settings
    • account-level throttling
Usage plans
  • specifies who can access API stages and methods, how much and how fast
  • can configure throttling and quota limits enforced on individual client API keys
  • can use together with Lambda authorizers to control access to APIs
  • can generate API keys or import from external source (as CSV file)
Deployments
  • snapshot of API resources and methods
  • must be associated with a stage for anyone to access API
  • stage: logical reference to a lifecycle state of REST/WebSocket API
    • e.g. dev, beta, prod, v2
    • identified by API ID and stage name
    • stage variables: environment variable
      • ARN, HTTP endpoint, parameter mapping templates
      • can use to configure HTTP endpoint for stage
      • can use to configure parameters passed to Lambda through mapping templates
        • passed to the “context” object of Lambda
        • used with Lambda aliases
Mapping templates
  • map/modify request/response parameters, body content, headers
  • can map JSON to XML
  • uses VTL (Velocity Template Language)
  • filter output results
  • can be specified as Integration request/response
  • data is referenced at runtime as context and stage variables
Integration and Method
  • Integration request: internal interface of API
    • map body and parameters of request to the formats required by backend
  • Integration response: internal interface of API
    • map status codes, headers, payload from the backend to the response format
  • Method request: public interface of API
    • defines parameters and body that client must send in request to access backend through API
  • Method response: public interface of API
    • defines status codes, headers and body models that client should expect in response from API
Integration type
  • AWS Integration: exposes AWS service actions
    • must configure Integration request and response setting up mappings
  • AWS_PROXY Integration: Lambda function invocation action (flexible, versatile and streamlined integration setup)
    • direct interaction between client and Lambda function (Lambda proxy integration) so Integration request/response are not needed
      • incoming request from client are input of function
    • preferred integration type to call Lambda function
  • HTTP Integration: exposes HTTP endpoints (custom integration)
    • must configure Integration request and response setting up data mappings
  • HTTP_PROXY Integration: HTTP endpoints invocation (flexible, versatile and streamlined integration setup)
    • Integration request/response are not needed
  • MOCK Integration: returns a response without sending the request further
    • useful for testing purposes, simulations
    • enable collaborative development: working with another team API without waiting it’s complete
    • returns CORS-related headers to permite CORS access
Caching
  • can add cache specifying size in GB
  • allow cache endpoint’s response
  • reduce number of calls to backend (improve latency)
  • TTL (default 300 seconds)
  • defined per stage
    • caches are provisioned for a specific stage
  • can encrypt caches
  • capacity between 0.5GB to 237GB
  • can override cache settings for specific methods
  • can flush entire cache immediately
  • clients can invalidate with Cache-Control: max-age=0 header
  • data caching charged at hourly rate
    • varies based on cache size
Security
  • Same Origin Policy: prevent cross-site scripting attacks
  • Cross-Origin Resource Sharing (CORS): allows restricted resources (eg fonts) to be requested from another domain outside
    • can enable if using JavaScript / AJAX
  • IAM Resource-Based Policies: JSON policy document to attach to an API to control invoke permission
    • users from specified account
    • source IP address ranges / CIDR blocks
    • VPCs or VPC endpoints
    • usable for every type of endpoint
  • IAM Identity-Based Policies:
    • verifies IAM permissions passed by the caller
    • leverages sigv4 capability where IAM credentials are passed in headers
    • handles authentication and authorization
    • great for user/roles within account
  • Lambda Authorizer: uses Lambda to validate token in header
    • can cache result of authentication
    • need to implement a function
    • user controls authentication process
      • Cognito, instead, is self-managed
    • Lambda must return an IAM policy for the use
    • handle authentication and authorization
    • good for OAuth, SAML, 3rd party auth
    • pay per Lambda invocation
  • Cognito User Pools (see related paragraph)
    • can deliver temporary, limited-privilege credentials
    • support unauthenticated users
    • identities are not credentials
    • identities are exchanged for credentials using STS
  • SDK Generation of iOS, Android and Javascript
  • reduced latency and DDoS protection using CloudFront
Logging and monitoring
  • log API calls, latency, error rates to CloudWatch
  • can monitor through the API Gateway dashboard (visually)
  • meter utilization by third-party developers
  • integrated with CloudTrail: full auditable history of changes to API
  • important metrics:
    • IntegrationLatency: measure backend responsiveness
    • Latency: measure overall responsiveness
    • CacheHitCount and CacheMissCount: optimize cache capacities to achieve a desired performance
Open API / Swagger
  • can import/export Swagger / Open API 3.0 definitions (YAML or JSON): API definition as code
  • can import through POST request
    • with Swagger definition in the payload
  • can update existing through PUT request
  • can use mode query parameter in request URL to specify options

RDS

Online Transaction Processing (OLTP) type of database.

Maintenance window
  • maintenance window to allow DB instances modifications to take place (scaling, software patching, etc)
    • can be defined or AWS will schedule a 30-minute window
Encryption
  • encrypt instances and snapshot at rest using KMS
  • also encrypt backups and Read Replicas
  • cannot encrypt an existing DB, need to create a snapshot, copy it, encrypt the copy then build an encrypted DB from the snapshot
  • if master and Read Replicas are in different regions, can encrypt using the encryption key of the Region
  • support SSL encryption between applications and DB instances
  • generate a certificate for the instance
DB Subnet Groups
  • collection of subnets (private) designated for DB instances
  • should have at least 2 AZ
  • recommended to configure group with subnets in each AZ
Scalability
  • can only scale up:
    • compute (not MS SQL, need to recreate instance from snapshot)
    • storage (cannot decrease allocated storage)
  • scaling compute causes downtime
    • can be immediate or within maintenance window
  • max DB size: 64TB
  • max MS SQL DB size: 16TB
Storage type
  • use EBS volumes for DB and log storage
  • General Purpose (SSD) - gp2:
    • moderate I/O requirement
    • cost effective
    • 3 IOPS/GB
    • burst up to 3000 IOPS
  • Provisioned IOPS (SSD):
    • use for I/O intensive workloads
    • low latency and consistent I/O
  • Magnetic: not recommended anymore
Multi-AZ
  • replica in another AZ (cannot choose): standby DB
  • recommends Provisioned IOPS
  • failover on:
    • loss of primary AZ
    • loss of network connectivity on primary
    • compute unit failure on primary
    • storage unit failure on primary
    • primary DB instance changed
    • patching of the OS on the primary DB instance
    • manual failover (reboot selecting the option to failover)
  • use the second node on failover
  • can take 1 to a few minutes
  • application must have connection retries and use endpoint rather than IP address
  • alert on failover
  • standby DB cannot be used as read node
  • snapshots and automatic backups are performed on the standby
  • Read Replica support Multi-AZ:
    • combining them build a resilient disaster recovery strategy
    • simplify database updates
    • Read Replica in a different Region can be used as standby and promoted to new production database in case of regional disruption
    • allow to scale reads whilst having multi-AZ
  • multi-AZ deployments version upgrades causes outage
    • both primary and standby are updated at the same time
  • ensure security groups and NACLs allow application to communicate with both primary and standby
Read Replicas
  • used for read-heavy DBs and replication is asynchronous
    • workload sharing and offloading
  • read-only
  • created from a snapshot of the master instance
  • automated backups must be enabled on the primary
  • asynchronous replication to update the read replica whenever there is a change to the source DB instance
  • cannot enable automated backups on PostgreSQL
  • up to 5 read replicas of a production DB
  • cannot have more than four instances in replication chain
  • can have read replicas of read replicas for MySQL and MariaDB (not PostgreSQL)
  • can specify the AZ of read replica
  • storage type of replicas can be different
  • compute of replicas should be at least the performance of source
  • in multi-AZ failover read replicas are switched to the new primary
  • must be explicitly deleted
  • if only source DB is deleted, replica becomes standalone single AZ instance
  • can promote replica to primary (takes several minutes)
  • promoted replicas retain backup retention window, backup window and DB parameter group
  • each replica has its own DNS endpoint
  • can create replicas of multi-AZ source
  • can be in another Region
Snapshots
  • enable backup and restore of DB instances in a state as frequently as wanted
  • cannot be used for point-in-time recovery
  • stored in S3
  • remain until manually deleted
  • backup taken within defined window
  • I/O suspended briefly while backups
    • may increase latency on single-AZ
  • restored DB is always a new instance with new endpoint
  • can restore up to the last 5 minutes
  • only default DB parameters and security groups are restored
  • BEST PRACTICE: take final snapshot before deleting an RDS instance
  • snapshots can be shared across accounts
Pricing
  • DB instance/hours
  • storage GB/month
  • I/O requests/month (for magnetic storage)
  • provisioned IOPS/month (for RDS provisioned IOPS SSD)
  • egress data transfer
  • backup storage (free up to the provisioned EBS volume size)
  • multi-AZ charges for:
    • Multi-AZ DB/hours
    • provisioned storage
    • double write I/Os
  • data transfer during replication from primary to standby IS NOT charged
  • Oracle and Microsoft SQL Licenses included
  • can use on-demand and reserved instance pricing
    • reserved instances (not changeable) based on
      • DB engine
      • DB instance class
      • deployment type (standalone/multi-AZ)
      • license model
      • Region
    • reserved instances can be moved between AZs in the same Region
    • available for multi-AZ
    • scaling is achieved through changing instance class and modifying storage capacity (additional storage allocation)

DynamoDB

Fully managed NoSQL database service. Stores three geographically distributed replicas to enable high availability and data durability.

  • Not ideal for traditional RDS apps, joins or complex transactions, BLOB data and large data with low I/O rate
Storage
  • ideal for session data storage
  • BEST PRACTICE: keep item size small
  • BEST PRACTICE: compress larger attribute values
Authentication and access control
  • managed by IAM
  • identity-based policies
    • permissions policy to user or group
    • permissions policy to a role
  • doesn’t support resource-based policies
  • can use special IAM condition to restrict user access to their own records
Security
  • VPC endpoints
  • encryption at rest
  • encryption in transit (SSL/TLS)
Integrations
  • ElasticCache can be used in front of DynamoDB
  • triggers integrate with Lambda
  • RedShift: advanced business intelligence, can perform complex data analysis queries including joins
  • Apache Hive on EMR: allow querying using SQL-like language (HiveQL)
    • copy data from table to S3 bucket (vice versa)
    • copy data from table into HDFS (vice versa)
    • perform JOIN operations on tables
  • BEST PRACTICE: store objects larger than 400KB in S3
    • use pointers (S3 Object ID)
TTL
  • automatically delete items after expiry date/time
    • items marked for deletion
  • allows to remove irrelevant or old data
    • session data
    • event logs
    • temporary
  • reduce storage and manage table size over time
  • enabled per row (TTL date time column)
  • deletes expired items within 48 hours of expiration
  • deletes items in LSI / GSI
  • no extra cost nor capacity use
Exponential Backoff
  • on network errors retries, with progressively longer waits between retries for improved flow control
  • after 1 minute not working request size may be exceeding throughput
    • consider offloading using DAX or ElastiCache
    • consider increasing the WCUs
Optimistic locking
  • strategy to prevent writes from being overwritten by the writes of others
Table
  • items: size of items cannot exceed 400KB
    • attributes
  • BEST PRACTICE: when storing serial data use separate tables of days, weeks, months
  • BEST PRACTICE: store more-frequently and less-frequently accessed data in separate tables
  • BEST PRACTICE: design tables in a way that can use Query, Get, BatchGetItem
Primary keys
  • partition key: unique, input to an internal hash function that determines the partition or physical location
    • best practices:
      • use high-cardinality attributes
      • use composite attributes
      • cache popular items (use DAX for caching reads)
      • add random numbers from predetermined range for write-heavy use cases
  • composite key: partition key + sort key in combination
    • two items may have same partition key but different sort key
    • items in same partition are sorted according to the sort key
Index
  • data structure that allow perform fast queries on specific columns in a table
  • run search on the index instead of the entire dataset
  • Local Secondary Index (LSI): provides an alternative range key for table, local to the hash key
    • can have up to 5 LSI per table
    • must be a scalar String, Number or Binary
    • must be created at table creation time
    • same partition key as original table (different sort key)
    • queries based on this sort key are much faster
    • can query on additional values other than partition key / sort key
  • Global Secondary Index (GSI): speed up queries on non-key attributes
    • can create at any time
    • different partition key and different sort key
    • different view of data
    • speeds up queries relating to this alternative partition and sort key
    • is a new “table” on which project attributes on
      • partition key and sort key of original table are always projected (KEYS_ONLY)
      • can specify extra attributes to project (INCLUDE)
      • can use all attributes from main table (ALL)
    • must define RCU/WCU for the index: has to be at least the same or more as in main table to avoid throttling on main table
Transaction
  • make coordinate, all-or-nothing changes to multiple items
  • provide atomicity, consistency, isolation and durability (ACID)
  • checks pre-requisite before writing to a table
  • write API can group multiple Put, Update, Delete and ConditionCheck actions
  • submit the actions as a single TransactWriteItems that succeeds or fails as a unit
  • performs two underlying reads or writes: one to prepare the transaction and one to commit
    • visible in CloudWatch metrics
Scan
  • return one or more items by accessing every item in a table or a secondary index
  • max 1 MB
  • use a lot of RCUs
  • can use ProjectionExpression to only return some attributes
  • can provide a filter expression to refine results
  • scan operations proceed sequentially
  • can request a parallel Scan for faster performance
    • need provide Segment and TotalSegment parameters
    • can configure Parallel scan by dividing a table/index into segments scanning each segment in parallel
    • BEST PRACTICE: avoid parallel scans if table/index is already incurring in heavy read/write activity from others
  • eventually consistent reads
  • for consistent copy of data set ConsistentRead parameter
  • could use up the provisioned throughput in just a single operation (if large table)
  • BEST PRACTICE: avoid scan if possible
Query
  • finds items in table based on primary key and a distinct value to search for
  • can use ProjectionExpression to only return some attributes
  • eventually consistent reads
  • for consistent copy of data set ConsistentRead parameter
  • more efficient than scan, doesn’t deteriorate on growing tables
Pages
  • smaller page size reduces impact of a query or scan
Stream
  • captures a time-ordered sequence of item-level modifications in any table and stores informations in an encrypted log for up to 24 hours
  • enable or modify stream with CreateTable and UpdateTable
  • log can be accessed using a dedicated endpoint, near-real time
  • by default records only Primary key
  • can be event source for Lambda that can, for example, writes to CloudWatch logs
  • see how the stream is configured: StreamSpecification parameter
    • StreamEnabled
    • StreamViewType: information that will be written, can be:
      • KEYS_ONLY (only key attributes)
      • NEW_IMAGE (entire item after modification)
      • OLD_IMAGE (entire item before modification)
      • NEW_AND_OLD_IMAGES (both new and old item)
  • stream read request unit: each GetRecords API call to Streams
    • return up to 1MB of data
Partitions
  • allocation of storage for a table automatically replicated across multiple AZ
  • handled entirely by DynamoDB, allocates sufficient partitions so can handle provisioned throughput requirements
  • increase additional partitions on:
    • increase of provisioned throughput
    • existing partition fills capacity
Provisioned Capacity
  • evenly distributes provisioned throughput among partitions
  • read capacity units (RCUs); 1 RCU can perform:
    • 1 strongly consistent read request / second for items up to 4KB
    • 2 eventually consistent read request / second for items up to 4KB
    • half transactional read request / second (need 2 RCUs to perform one) for items up to 4KB
  • write capacity units (WCUs); 1 WCU can perform:
    • 1 standard write request / second for items up to 1KB
    • half transactional write request / second (need 2 WCUs to perform one) for items up to 1KB
  • Replicated Write Capacity Unit (rWCU): for global tables
  • if access pattern exceeds 3000 RCU or 1000 WCU for a single partition, request might be throttled
  • throttling occurs when configured RCU/WCU are exceeded (ProvisionedThroughputExceededException)
    • use burst capacity effectively: DynamoDB retains up to 5 minutes of unused read and write capacity which can be consumed quickly
  • reading/writing above the limit can be caused by:
    • uneven distribution based on partition key
    • frequent access to same key in partition (most popular item, hot key)
    • request rate greater than provisioned throughput
  • BEST PRACTICE: larger number of smaller operations will allow other request succeed without throttling
On-Demand Capacity
  • don’t need to specify requirements, instantly scales up and down based on activity (useful for unpredictable / spikey workloads)
  • can switch between Provisioned Capacity / On-Demand Capacity once per day
  • pay for what you use
Consistency models
  • eventually consistent reads (default): response might not reflect the result of a recent write operation (stale data)
    • repeat read request after a short time
  • strongly consistent reads: response with the most up-to-date data
    • might not available when network delay or outage (HTTP 500)
    • higher latency
    • not supported on Global Secondary Indexes
    • use more throughput capacity
Application Auto Scaling
  • enables table or GSI to increase provisioned read/write capacity to handle increases in traffic without throttling
  • create a scaling policy for a table/GSI; specify:
    • read/write/both capacity, minimum and maximum provisioned capacity
    • target utilization: percentage of consumed provisioned throughput at a point in time
    • target tracking: algorithm to adjust the provisioned throughput so that the actual capacity utilization remains at target utilization
  • if table/GSI is created using Console, auto scaling is enabled by default
DAX (DynamoDB Accelerator)
  • fully managed, HA, in-memory cache that delivers up to 10x performance improvement
  • microseconds performance, even at millions of requests per second
  • improve only READ performance
  • ideal for read-heavy and bursty workloads
    • auction applications
    • gaming
    • retail sites
    • special sales
  • data is written to the cache and back-end store at the same time
  • return data if items is in the cache (cache hit)
  • if not in cache performs eventually consistent GetItem operation
  • reduces provisioned read capacity
  • differences with ElastiCache:
    • optimized for DynamoDB
    • not support lazy loading
    • less management overhead
    • don’t need to modify application
    • less datastores supported
  • pay for the capacity provisioned
    • runs on EC2 instances cluster, charged by the node
API
  • PutItem: create data or full replacement (consumes WCU)
  • UpdateItem: partial update of attributes
    • add new item if not exists
  • Conditional writes accept a write/update only if conditions are met
    • for example to ensure not overwriting data on eventual consistency
  • GetItem
    • can use parameter ConsistentRead
  • DeleteItem
  • DeleteTable
  • BatchWriteItem: put or delete up to 25 items in one call
    • reduces the number of API calls and so the latency
  • operations are done in parallel
Global tables (Cross Region Replication)
  • global tables provide fully managed solution for deploying a multi-region, multi-master database
  • create identical tables in these regions and propagate ongoing data changes
  • ideal for massively scaled applications (globally dispersed users)
  • global table: collection of one or more replica tables
    • replica table: single table, part of global table
      • can have one replica table per Region
  • strongly consistent reads/writes no supported across Regions
    • if required must be on the same Region
  • ensure that each replica table and secondary index in global table has identical write capacity to ensure proper replication

ElastiCache

Fully managed implementations of in-memory data stores: Redis and Memcached.

  • Improve latency and throughput for many read-heavy/compute-intensive workloads
Compute nodes
  • a maintenance window can be defined for software patching
  • EC2 nodes cannot be accessed from the Internet or other instances in other VPCs
  • on-demand or reserved instances (not spot)
Use cases
  • offload reads from a database
  • improve latency and throughput for many read-heavy/compute-intensive workloads
  • store the results of computations and session state
  • streaming data dashboards: landing spot for streaming sensor data on the factory floor, live real-time dashboard displays
Memcached
  • ideal for database caching: use Memcached in front of RDS
    • cache popular queries to offload work
  • simple
  • scalable in/out (adding/removing nodes)
  • scalable up/down (changing node family type)
  • supports multi-thread multi-core
  • caches objects like database queries
  • caches contents of a DB
  • caches data from dynamically generated web pages
  • ideal for transient session data
  • ideal for high frequency counters
  • max number of nodes per Region: 100
  • max number of nodes per cluster: 1-20 (soft limits)
  • can integrate SNS for node failure/recovery notification
  • doesn’t support multi-AZ failover or replication
  • doesn’t support snapshots
  • can place nodes in different AZs
  • each node represents a partition of data
Redis
  • open-source in-memory key-value store
  • ideal for load-balanced web servers, store web session information
    • if server is lost, session info is not and can be picked up
  • ideal for leaderboards
    • can provide live leaderboard for millions of users of mobile app
  • supports encryption
  • supports HIPAA and HA replication
  • supports clustering
  • supports complex data types (sets and lists)
  • data is persistent (can be used as datastore)
  • not multi-threaded
  • master/slave replication and multi-AZ for cross-AZ redundancy
  • automatic failover and backup/restore (backup clusters and metadata)
    • can restore creating a new Redis cluster and populating from a backup
  • shard: a subset of the cluster’s keyspace
    • include a primary node and 0 or more read replicas
      • even across Regions
  • scales by adding shards
  • clustering mode:
    • disabled: can have only one shard
      • replication from primary node is asynchronous
    • enabled: can have up to 15 shards
      • recommended taking snapshots from read replicas
        • can slow down nodes
  • automatic and manual snapshots (S3)
  • can only move snapshots between Regions by exporting them
  • Multi-AZ failover: failures are detected by ElastiCache then automatically promotes the replica that has lowest replica lag
    • DNS records remain the same but point to the IP of the new primary
    • by enabling cluster mode and Multi-AZ failover you have fully automated, fault tolerant Redis
Cluster
  • collection of one or more nodes using same caching engine
  • cannot move a cluster from outside VPC into VPC
  • need to configure subnet groups for VPC hosting EC2 instances and cluster
  • if not using VPC, can control access to cluster through Cache Security Groups
  • applications connect to the cluster using endpoints
  • no charge for data transfer between EC2 and ElastiCache within same AZ
Node
  • runs an instance of Memcached or Redis protocol-compliant service
  • has its own DNS name and port
  • failed nodes are automatically replaced
  • controlled by VPC Security Groups and Subnet groups
  • deployed in clusters
  • can span more than one subnet
  • pay per node/hour
Lazy Loading
  • loads data into the cache only when necessary (if cache miss occurs)
  • avoids filling up cache with not requested data
  • if data is not in the cache, returns null then app fetches data from database and writes it into the cache (available next time)
  • can produce stale data without further strategies
  • available only in ElasticCache, not in DAX
Write Through
  • cache is updated whenever a new write or update is made to underlying database
  • cache data remain up to date
  • add wait time to write operations
  • without a TTL will end up with a lot of cached data never read
TTL
  • mitigates drawbacks of cache strategies
  • specifies number of seconds until the key (data) expires
  • when reading an expired key, application checks value in database
    • is a cache miss for Lazy Loading

CloudWatch

Monitoring service used to collect track metrics, log files and set alarms. Monitor operational health.

Metrics
  • system-wide visibility into resource utilization
  • monitor application performance
  • time-ordered set of data points
  • exist within a Region
  • defined by name, namespace and zero or more dimensions
  • metrics retention:
    • data points of < 60 seconds available for 3 hours (high resolution)
      • (higher charges on high resolution metric)
    • data points of 60 seconds available for 15 days
    • data points of 300 seconds (5 min) available for 63 days
    • data points for 3600 seconds (1 h) available for 15 months
  • can publish custom metrics using CLI or API
    • statistic set: publish an aggregated set of data points
  • resolution:
    • standard resolution: one-minute granularity
    • high resolution: one second granularity
      • immediate insight into sub-minute activity
      • can specify an high-resolution alarm with a period of 10/30 seconds
Namespace
  • container for metrics; metrics in different namespaces are isolated
    • e.g.: ApiGateway, EC2, Lambda, etc.
Dimensions
  • --dimension parameter clarifies what the metric is and what data stores
    • e.g.: by AutoScaling Group or Per-Instance metrics
Statistics
  • metric data aggregations
  • Minimum: lowest value; can use to determine low volumes of activity
  • Maximum: highest value; can use to determine high volumes of activity
  • Sum: total volume of a metric
  • Average: Sum / SampleCount
    • can use to determine the full scope of a metric (how close average is to max and min)
    • helps to know when to increase or decreases resources
  • SampleCount: count (number) of data points
  • pNN.NN: value of specified percentile (not available for negative values)
Alarms
  • automatically initiate actions
  • action is a notification sent to SNS topic or an Auto Scaling policy
  • invokes actions for sustained state changes only
Event
  • delivers near real-time stream of events describing changes in resources
    • can use to schedule automated actions that self-trigger using cron or rate expressions
    • targets includes: EC2 instances, Lambda functions, streams, delivery streams, log groups, ECS tasks, pipelines, SNS topics, SQS queues, etc.
API
  • PutMetricData: can specify each dimension as MyName=MyValue
    • aws cloudwatch put-metric-data --metric-name Buffers --namespace MyNameSpace [...] --dimensions Intanced=139391,InstanceType=m1.small
    • publishes a single metric data points
    • create specified metric if not exist
    • every PutMetricData API call for custom metric is charged
  • GetMetricStatistics: specify each dimension as Name=MyName, Value=MyValue
    • --namespace MyNameSpace [...] --dimensions Name=InstanceId,Value=139391 Name=InstanceType,Value=m1.small
      • is the same for “put-metric-alarm” API
    • must specify a value for every defined dimension
      • e.g.: metric BucketSizeBytes includes BucketName and StorageType
    • aggregates data points based on the length of the period specified
    • maximum number of data points: 1.440
  • GetMetricData: retrieve up to 500 different metrics in a single request
  • PutMetricAlarm: creates or updates an alarm and associates with specified metric, metric math expression or anomaly detection model (this one cannot have Auto Scaling actions)
CloudWatch Logs
  • centralizes logs from systems, applications and services
  • help monitor and troubleshoot systems and apps using existing custom log files (log files from EC2, CloudTrail, Route 53)
  • Log retention: by default retained indefinitely
    • configurable from 1 day to 10 years
  • can use for real time application, system monitoring and long term log retention
  • CloudTrail logs can be sent to CloudWatch Logs for real-time monitoring
  • Log agent: install into EC2 instances to collect both logs and metrics (e.g. memory and disk utilization)
    • collects more system-level metrics from EC2 instances
    • collects system-level metrics from on-premises servers (hybrid environment or not managed by AWS)
    • retrieves custom metrics from apps and services using StatsD and collectd protocols
CloudWatch Logs Insight
  • interactively search and analyze log data in CloudWatch Logs
  • can perform queries (purpose-built query language)
    • include sample queries
  • can identify potential causes of issues and validate deployed fixes
  • discover fields in logs from services
    • Route 53
    • Lambda
    • CloudTrail
    • VPC
    • any application or custom log that emits log event as JSON
  • cannot access log events with timestamps that pre-date creation time of log group
CloudWatch Metric Filter
  • search and filter log data coming into CloudWatch Logs
  • define terms and patterns to look for in log data as it is sent
  • CloudWatch Logs uses these metric filters to turn log data into numerical metrics
    • can graph them
    • can set an alarm on them
  • can assign dimensions and a unit to the metric
    • changing the unit for the filter later will have no effect
  • supported only for log groups in the Standard log class
  • only publish metric data point for events happen after filter creation

CloudTrail

A web service that records activity made on account, delivers log files to an S3 bucket.

Use case
  • enables governance, compliance and operational and risk auditing
  • visibility into user activity by recording actions
  • security analysis, resource change tracking and compliance auditing
Log
  • logging history of API class in AWS account
  • logs:
    • identity of the API caller
    • time of the API call
    • source IP
    • request parameters
    • response
  • for an AWS account, trail can:
    • record events in all regions and deliver to an S3 bucket
    • records events in a single region and deliver to an S3 bucket
      • additional single trails can use the same or different bucket
  • can integrate with CloudWatch Logs to deliver data events through a CloudWatch Logs log stream
  • log file integrity validation feature: determines whether log file was unchanged, deleted or modified since delivered to S3 bucket
Events
  • data events: provide insight into the resource operations (data plane operations)
    • e.g.: S3 object-level API activity (GetObject, DeleteObject, PutObject API)
    • e.g.: Lambda function execution activity (Invoke API)
  • management events: provide insight into management operations (control plane operations)
    • include non-API events
    • e.g.: configuring security, registering devices, configuring rules for routing data
Encryption
  • log files are encrypted using S3 SSE
  • can enable KMS for additional security:
    • single KMS key can be used to log files to all Regions
Multi account
  • can consolidate logs from multiple accounts using S3 bucket:
    • turn on CloudTrail paying account
    • create bucket policy that allows cross-account access
    • turn on CloudTrail in the other account and use the bucket of the paying account
Alarms
  • no native alarming

CloudFormation

Infrastructure as Code using a template (YAML or JSON). “Template-drive provisioning”.

  • best practices:
    • provides Python “helper scripts” which help install software and start services on EC2 instances
    • use Stack Policies to protect sensitive portions of stack
Infrastructure as Code
  • infrastructure is provisioned consistently
  • less time and effort than configure manually
  • can use version control and peer review templates
  • BEST PRACTICE: use a version control system
  • manage updates and dependencies
  • no charges, pay for resources
Template
  • can upload directly or use S3
  • read the template and makes API calls
  • resulting resources are the Stack
  • logical IDs to reference resources within the template
  • physical IDs to reference resources outside templates after being created
  • mandatory elements:
    • list of resources and config values
  • not mandatory elements:
    • template parameters (up to 60)
    • output values (up to 60)
    • list of data tables
Resources
  • mandatory
  • resources are declared and can reference each other
  • Type
  • Properties
Parameters
  • parameters
    • custom values as inputs, useful for template reuse
    • Type
    • Default
    • AllowedValues
    • Description
  • pseudo parameters
    • predefined parameters by CloudFormation
    • can use as argument of Ref function
    • AWS::AcountId
    • AWS::NotificationARNs: list of notification ARNs
    • AWS::Region
    • AWS::StackId
Mappings
  • matches key to corresponding set of named values
  • fixed variables good for differentiation between regions, environments, AMIs etc.
    • not user specific, use parameters
  • can set value based on Region
Outputs
  • output values that can be imported into other stacks (cross-stack references)
  • returned in response
  • cannot delete a stack if outputs are being referenced by another stack
  • can use Export Output Values to export the name of resource output for a cross-stack reference
    • unique in Region
Conditions
  • statements that define the circumstances under which entities are created/configured: creation of resources based on a condition (resources and outputs):
  • Conditions:
    • CreateProdResources: !Equals [!Ref EnvType, prod]
  • intrinsic functions: If, Equals, Not
Transform
  • specifies one or more macros to process template
  • can reference additional code stored in S3
    • Lambda code or snippets of CloudFormation code
Intrinsic functions
  • Ref
  • GetAtt: value of an attribute from a resource
    • (YAML) !GetAtt logcalNameOrResource.attributeName
    • (JSON) Fn::GetAtt [logicalNameOrResource,attributeName]
  • FindInMap: value corresponding to keys in a two-level map declared in Mappings section:
    • (YAML): !FindInMap [MapName, TopLevelKey, SecondLevelKey]
  • ImportValue: value of an output exported by another stack (cross-stack references)
  • Join: construct a string value
    • see example in the notebook
  • Sub: substitute variables in an input string with values specified
    • construct commands or outputs that include values that aren’t available until create or update stack
Stack
  • entire environment; automatic rollback by default on error
  • updating stacks:
    • direct update
    • creating and executing ChangeSet
StackSet
  • create, update, delete stacks across multiple accounts and Regions with a single operation
    • can select target accounts
    • must set up a trust relationship between the administrator and target accounts
NestedStack
  • allow reuse code for common use cases
ChangeSet
  • summary of proposed changes to see how might impact existing resources
  • BEST PRACTICE: use to identify potential trouble spots in updates
Drift Detection
  • detect whether a stacks’s actual configuration differs from expected configuration
  • work on resources that support drift detection
    • resources not supporting are assigned with NOT_CHECKED
  • support drift detection on private resource types provisionable
    • provisioning type is FULLY_MUTABLE or IMMUTABLE
  • can perform drift detection on stacks with following statuses: CREATE_COMPLETE, UPDATE_COMPLETE, UPDATE_ROLLBACK_COMPLETE, and UPDATE_ROLLBACK_FAILED
  • not detecting drift on any nested stacks that belong to that stack
    • can initiate a drift detection operation directly on the nested stack
Serverless Application Model
  • can use SAM to deploy serverless applications
  • extension to CloudFormation for serverless applications
  • simplified syntax for defining serverless resources: APIs, Lambda functions, DynamoDB tables, etc.
  • can use to package deployment code, upload it to S3 and deploy serverless application
  • AWS Serverless transform: takes an entire template in AWS SAM syntax and transforms and expands it into CloudFormation template
    • Transform: AWS::Serverless-2016-10-31
    • in resources set type of Lambda function as “AWS::Serverless:Function” and use syntax
  • resource examples:
    • AWS::Serverless::Function
    • AWS::Serverless::API
    • AWS::Serverless::SimpleTable

Using CLI

Create stack
  • aws cloudformation create-stack
    • --stack-name
    • --template-body file:///filepath.yml
    • --parameters ParameterKey=Parm1,ParameterValue=test1 ParameterKey=Parm2,ParameterValue=test2
  • NoEcho doesn’t mask information stored in
    • Metadata template section
    • Outputs
    • Metadata of resource definition
    • do not use for sensitive information
Describing and listing stacks
  • aws cloudformation list-stacks:
    • get a list of any of the stacks created
    • --stack-status-filter
      • e.g. --stack-status-filter CREATE_COMPLETE
  • aws cloudformation describe-stacks
    • information on running stacks
    • --stack-name
View stack event history
  • aws cloudformation describe-stack-events
    • can track status of resources AWS CloudFormation is creating and deleting
    • --stack-name
    • --resource-status
List stack resources
  • aws cloudformation list-stack-resources
    • summary of each resource in stack that specified with the --stack-name parameter
Retrieve template
  • aws cloudformation get-template
    • --stack-name
Validate template
  • aws cloudformation validate-template
    • --template-body
    • --template-url
Update local artifacts to S3 bucket
  • some resource properties require an S3 location (bucket and file name)
  • can specify local references instead (local artifacts)
    • e.g. can specify S3 location of Lambda function source code
    • instead of manually uploading and specify S3 location can specify local references
  • can use package command to quickly upload them
    • uploads directly to S3
    • returns a copy of template replacing local references with S3 location
    • can use returned template to create or update a stack
  • local artifact: path to file or folder (like Lambda function code)
    • folder: command creates a .zip and upload it
Deploy template with transforms
  • use change set to create template including transformations
  • can use aws cloudformation deploy
    • creates a change set
    • initiates and terminates change set
    • reduce number of required steps
    • --template /path/template.json
    • --stack-name my-new-stack
    • --parameter-overrides Key1=Value1 Key2=Value2

Kinesis

Collect, process and analyze real-time, streaming data (timely insight). Collection of services for processing streams of various data.

Security
  • control access / authorization using IAM policies
  • encryption in flight using HTTPS endpoints
  • encryption at rest using KMS
  • can encrypt data on the client side
  • VPC endpoints available
Differences with SQS
  • must provision throughput (not needed in SQS)
  • ordering at the shard level (in SQS no ordering guarantee except FIFO queues)
Differences with SNS
  • pull data (SNS push data to subscribers)
  • must provision throughput (not needed in SNS)
Stream
  • Shards: uniquely identified groups or data records in a stream; base throughput unit of data stream
    • each shard ingest up to 1000 records/second
    • default limit of 500 shards, can be requested increase
    • data input max capacity: 1MB/sec
    • data output max capacity: 2MB/sec
    • max records/second per PUT: 1000
    • Record: data units stored in a Kinesis Stream
      • Partition Key: group data by shard
      • Sequence number
      • data blob (up to 1 MB, before base64 encoding)
  • Transient data store: retention from 24 hours (default) to 7 days
Kinesis Data Stream
  • real-time processing of streaming big data
  • useful for rapidly moving data off data producers, continuously processing the data
  • stores data for later processing by applications
  • use cases:
    • accelerated log data feed intake
    • real-time metrics and reporting
    • real time data analytics
    • complex stream processing
  • producers: continually push data to Kinesis Data Streams
    • creates the data that makes up the stream
    • can be used through:
      • Kinesis Stream API
      • Kinesis Producer Library (KPL)
      • Kinesis Agent
  • consumers: EC2 instances that analyze the data received from a stream (Kinesis Stream Applications)
    • process the data in real time
    • can store results using DynamoDB, Redshift or S3
  • resharding: adapt to changes in rate of data flow
    • shard split: divide one into two (increase cost)
    • shard merge: combine two into one
  • KMS master key for encryption
    • permissions to access the needed master key
  • replicates synchronously across 3 AZs
  • pay per shard
Kinesis Data Firehose
  • easiest way to load streaming data into data stores and analytics tools
  • captures, transforms and loads streaming data
  • enables near real-time analytics with existing business intelligence tools and dashboards
  • Kinesis Data Stream can be the source
  • can configure to transform data before delivering it
    • can invoke Lambda function to transform
  • don’t need to write an application
  • can batch, compress and encrypt data before loading it
    • encrypt data with existing KMS key
  • replicates synchronously across 3 AZs
  • maximum size for a record (before Base64 encoding): 1000 KB
  • Source: where streaming data is continuously generated and captured
  • Delivery Stream: underlying entity of KDF
    • stores data records for up to 24 hours
  • Destination: data store where data will be delivered
    • S3
    • Redshift
    • Elasticsearch
    • Splunk
  • no shards, tot ally automated
Kinesis Data Analytics
  • process and analyze real-time streaming data
    • provides real-time analysis
  • can use SQL queries to process data streams
  • use cases:
    • generate time-series analytics
    • feed real-time dashboards
    • create real-time alert and notifications
  • can ingest data from Data Streams and Firehose
  • output to:
    • S3
    • Redshift
    • Elasticsearch
    • Kinesis Data Streams
  • input: streaming source for application
    • streaming data source: continuously generated data read into application
    • reference data source: static data that app uses to enrich data coming from streaming data sources
  • application code: SQL statements that process input and produce output
  • output: in-application streams to hold intermediate results
  • destinations: persist the result
    • Kinesis Data Streams
    • Kinesis Firehose
    • S3
    • Redshift
    • Elasticsearch
  • can use IAM to provide permission to read from source and write to destinations
Kinesis Client Library
  • Java library help read records from Kinesis Stream with distributed applications sharding the read workload
  • provides a layer of abstraction specifically for processing data in a consumer role
  • intermediary between record processing and Data Streams
  • differs from Kinesis Data Streams API that helps creating streams, resharding, putting and getting records
  • worker (consumer)
    • can be:
      • EC2 instance
      • Beanstalk
      • on-premises servers
    • connect to the stream
    • enumerates the shards
    • coordinates shard association with other workers (if any)
    • instantiates a record processor for every shard
    • pull data from stream
    • pushes the records to the corresponding record processor
    • checkpoints processed records
    • balances shard-worker associations when worker instance count changes
    • balances shard-worker associations when shards are split or merged
  • each shard is processed by 1 worker
  • each shard has exactly 1 corresponding record processor
  • never need multiple instances to process 1 shard
  • 1 worker can process multiple shards
  • if there are 2 consumers it load balances and creates half processor on one instance and half on another
  • scaling out consumers:
    • ensure number of instances not exceed number of shards
  • progress is checkpointed into DynamoDB
    • IAM access required
  • records are read in order at the shard level

Opensearch (Elasticsearch)

Open source, distributed search and analytics suite based on Elasticsearch. Search and analytics engine built on Apache Lucene.

Use case
  • log analytics, full-text search, security intelligence, business analytics, operational intelligence
  • log analytics interactively, real-time application monitoring, website search, performance metric analysis
  • supports multiple query languages
    • DSL (Domain-Specific Language), SQL, PPL
  • integrates with Logstash, OpenTelemetry and ElasticSearch APIs
  • ELK stack (Elasticsearch, Logstash, Kibana): aggregates logs from all systems and apps, analyzes these logs and creates visualizations. Useful for infrastructure monitoring, troubleshooting, security analytics etc.
ELK stack
  • (Elasticsearch, Logstash, Kibana): aggregates logs from all systems and apps
  • analyzes these logs and creates visualizations
  • useful for infrastructure monitoring, troubleshooting, security analytics etc.
Cluster
  • need to specify number of instances, instance type and storage options
  • can perform upgrades without downtime
  • built-in monitoring and alerting with automatic notifications
VPC domain
  • domains can be launched into VPC
  • enable secure communication between other services in the VPC
  • extra layer of security
  • to be accessible from internet VPC domains require VPN or proxy
  • display less information
    • cluster health (not including shard information)
  • cannot apply IP-based access policies
  • cannot switch later to use public endpoints
  • cannot launch on VPC with dedicated tenancy
  • cannot change VPC
  • can change subnets and security groups settings
  • user must have access to the VPC to access Dashboards
Security
  • encryption of data at rest (AES-256)
  • uses KMS for storage and management encryption keys
  • can encrypt node to node communications using TLS 1.2
    • once enabled cannot be disabled
  • support access policies:
    • Resource-based policies
    • Identity-based policies
    • IP-based policies
  • fine-grained access control
    • role-based access control
    • security at the index, document and field level
    • multi-tenancy
    • HTTP basic authentication
  • supports authentication through SAML and Cognito

SNS

Fully managed messaging service for A2A (application-to-application) and A2P (application-to-person) communication.

  • Pub/sub provides messaging for high-throughput, push-based, many-to-many use cases
  • sending notification between distributed systems, microservices, event-driven serverless applications
  • can send:
    • SMS
    • email
    • SQS queues
    • trigger Lambda function
    • Kinesis Data Firehose
    • HTTP endpoint
    • platform application endpoint (mobile push)
  • inexpensive and based on a pay-as-you-go model
  • pub-sub model whereby users or applications subscribe to SNS topics:
    • access-point” for allowing recipients to dynamically subscribe for identical copies of the same notification
    • stored redundantly across multiple AZ
    • instantaneous, push-based delivery
  • can fanout messages to many subscribers including SQS queues
    • SQS manages subscription and any necessary permissions
    • supported for A2A messaging

SQS

Distributed queue system that enables web service applications to quickly and reliably queue messages that one component in the application generates to be consumed by another component.

  • Send, store and receive messages between software components
  • temporary repository for messages awaiting processing
  • acts as a buffer between producer and receiver
  • resolves issues if the producer work faster than the consumer
  • allow decoupling / loose coupling
  • pull-based
  • guarantees that your messages will be processed at least once
Limits
  • messages up to 256 KB
  • messages can be kept in the queue from 1 minute to 14 days
    • default 4 days
CloudWatch integration
  • automatically collects every 5 minutes.
  • considers a queue to be active for up to 6 hours if it contains any messages or if any API action accesses it
  • captures API calls from SQS and logs to a specified S3 bucket
  • when using long-polling no charge in addition (no detailed monitoring)
Queue
  • must be unique name between a Region
  • queue policy
    • can specify permissions
    • finer grained control
    • control over the requests that come in
  • cannot change queue type after creation
Standard queue
  • default queue type
  • unlimited transactions/second (TPS)
  • guarantee message is delivered at least once
  • occasionally more copies of message delivered out of order
  • best-effort ordering: ensure messages delivered in same order they are sent
  • can have different priorities
  • scaling is performed by creating more queues
  • data is stored within a single, highly available Region with multiple redundant AZs
First in First Out (FIFO) queue
  • exactly-once processing
  • order strictly preserved
  • message delivered once and remain available until consumer processes and deletes it
  • no duplicates
  • support message groups: multiple ordered message
  • max transactions/second (TPS): 300
    • same capability of standard queues
  • deduplication
    • MessageDeduplicationId
    • deduplication interval of 5 minutes
    • content based duplication: the ID is generated as SHA-256 with message body
  • sequencing
    • MessageGroupId
    • strict ordering between messages
    • messages with different MessageGroupId may be received out of order
    • messages with same MessageGroupId delivered once
Visibility timeout
  • provided job is processed before the visibility timeout expires, then message is deleted
  • if job not processed within visibility timeout, message become visible again and another reader will process it
  • could result in a message delivered twice
  • default timeout is 30 second
    • can be increased (max 12 hours)
Polling
  • short polling (default): returns immediately (even if queue is empty)
    • queries subset of available servers
    • ReceiveMessageWaitTime is set to 0
    • more requests, higher cost
  • long polling: doesn’t return until a message arrives in the queue or the long poll times out
    • can be enabled at queue level
    • can be enabled at API level using WaitTimeSeconds
    • eliminates false empty responses by querying all servers
    • waits until a message is available, before sending a response
    • requests contain at least one of the available messages up to the maximum number of messages specified in the ReceiveMessage action
    • should not be used if expect an immediate response
    • ReceiveMessageWaitTime set to non-zero value
      • up to 20 seconds
    • fewer requests and reduces cost
      • same charge per million of requests as short polling
SQS Delay Queues
  • postpones delivery of new messages to queue for a number of seconds
  • messages remain invisible over that period
  • default is 0 second, maximum 900 seconds (15 min)
  • when enabled, doesn’t affect delay of messages already in the Standard queue
  • when enabled, do affect delay of messages already in the FIFO queue
  • use for:
    • large distributed applications that need to introduce a delay in processing
    • need to apply delay to an entire queue
    • update to sales or stock control databases before sending a notification to a customer confirming an online transaction
SQS Extended Client Library for Java
  • can use to manage messages
  • alternatively, can use S3
  • useful for storing and consuming messages up to 2 GB in size
  • can use for:
    • send message that references a single message object stored in an S3 bucket
    • get the corresponding message object from an S3 bucket
    • delete the corresponding message object from an S3 bucket
Security
  • can use IAM policies to control who can read/write messages
  • authentication can be used to secure messages in queues
  • in-flight security with HTTPS
  • can enable server-side encryption with KMS
    • only encrypt the message body, not the attributes
API
  • CreateQueue
  • DeleteQueue: needs specified QueueUrl
  • PurgeQueue: Delete messages in QueueUrl
  • SendMessage
  • ReceiveMessage: up to 10
    • can use WaitTimeSeconds to enable long-poll
  • DeleteMessage: RecepitHandle to select the message
  • ChangeMessageVisibility: changes visibility timeout
  • manipulate up to 10 messages with a single action (reduce costs):
    • SendMessagesBatch
    • DeleteMessageBatch
    • ChangeMessageVisibilityBatch

Developer tools

CodeCommit

Fully managed source control that hosts secure Git-based repositories. Secure and highly scalable ecosystem.

  • From source code to binaries
  • repositories are private
  • scales seamlessly
  • integrated with Jenkins, CodeBuild and other CI tools
  • continuous integration tool
  • can transfer files using HTTPS or SSH
  • encrypted repositories through KMS using customer-specific keys
  • monitor repositories via CloudTrail and CloudWatch
  • authentication:
    • Git credentials: username and password pair over HTTPS
    • SSH keys: public-private key pair
    • AWS access key
  • authorization:
    • user/roles with IAM
    • identity-based policies (not resource-based)
    • can attach tags or pass tags in a request; control access based on tags
  • notifications:
    • SNS, Lambda:
      • on deletion of branches
      • on pushes on master
      • notify external build system
      • trigger Lambda function to perform codebase analysis
    • CloudWatch Event rules
      • on pull request updates (created, updated, deleted, commented)
      • on commit comment
      • event rules go into an SNS topic

CodeBuild

Fully managed CI service that compiles source code, runs tests and produces packages to deploy.

  • Scales continuously, multiple builds concurrently, builds are not left waiting in queue
  • alternative to other tools like Jenkins
  • can extend capabilities by custom Docker images
Build project
  • location of the source code
  • build environment to use
  • build commands to run and where to store the output
Build environment
  • operating system
  • programming language runtime
  • build tools (Maven, Gradle, npm, etc.)
  • Preconfigured: for Java, Python, Node.js, Ruby, Go, Android, .NET Core for Linux and Docker
  • Custom: custom environment
    • can package runtime and tools into a Docker image and upload to ECR, then specify the location of Docker image and CodeBuild will pull
Build specification
  • YAML file that describes collection of commands and settings to run a build
  • need a buildspec.yml at root of source code
  • can define specific commands such as installing tool packages, run unit tests and packaging code
  • has sample build specification files for common scenarios
    • Maven, Gradle or npm
  • can define environment variables:
    • plaintext variables
    • secure secrets using SSM parameter store
  • phases:
    • install: install dependencies
    • pre-build: final command before build
    • build
    • post build: finishing touches (e.g. zip file output)
Build
  • source code from GitHub, CodeCommit, S3 etc.
  • timeout: build in queue not started after x minutes is removed
    • default timeout: 8 hours
    • can override timeout with value between 5 minutes and 8 hours
  • Artifacts: uploaded to S3
    • encrypted with KMS
  • IAM permissions for build
  • VPC for network security
  • pay based on time to complete builds
Cache
  • file to cache (usually dependencies) to S3 for future builds
Monitoring and debugging
  • CloudTrail for logging API calls
  • CloudWatch alarms to detect failed builds
  • can run locally for deep troubleshooting using Docker
    • leverages CodeBuild agent

CodeDeploy

Deployment service that automates application deployments to EC2 instances, on-premises instances, serverless Lambda functions or Amazon ECS services.

  • Can deploy Lambda functions, configuration files, executables, packages, scripts, multimedia files, etc.
  • integrates with CI/CD tools (Jenkins, GitHub, Atlassian, CodePipeline)
  • fully managed
Application
  • information of what to deploy and how on EC2/On-premises, Lambda, ECS
Deployment Group
  • (dev, test, prod, etc.)
  • deployment configuration: set of rules
    • success / failure conditions
  • notification configuration for deployment events
  • CloudWatch alarms to monitor a deployment
  • deployment rollback configuration
In-place deployment
  • only EC2/On-premises
  • application on each instance in the deployment group is stopped, latest application is installed, new version is started
  • can use a load balancer so that each instance is deregistered during its deployment and then restored after deployment is complete
Blue/green deployment on EC2
  • not on-premises
  • instances are provisioned for the replacement environment
  • latest application revision is installed on the replacement environment
  • optional wait time occurs for activities (testing, verification)
  • instance in the replacement environment are registered with ELB causing traffic to be rerouted to them
  • instances in the original environment are deregistered, can be terminated or kept running for other users
  • replacement:
    • can use Auto Scaling group as template for replacement environment (e.g. number of running instances)
    • can specify instances to be counted as replacement
Blue/green deployment on Lambda
  • traffic is shifted from current serverless environment to one with updated Lambda function versions
  • traffic shifting configured in deployment configuration:
    • linear: equal increments with an equal number of minutes between each increment
    • canary: two increments, from predefined canary options can specify percentage of traffic shifted in first increment and the interval before second increment
    • all-at-once: all traffic is shifted all at once
  • only way for Lambda compute platform deployments
  • do not need to specify deployment type
Blue/green deployment on ECS
  • traffic is shifted from the task set with the original version of the app to a replacement task set
  • traffic shifting configured in deployment configuration:
    • linear
    • canary
    • all-at-once
  • test listener can serve traffic to the replacement task while validation tests are run
Deployment on EC2
  • instances are identified by tags or Auto Scaling Group names
  • instance proper must have IAM instance profile
  • CodeDeploy agent must be installed on each instance
  • instances are grouped by Deployment Group
appspec.yml
  • file must be at the root of the source code
    • can be appspec.yaml for ECS or Lambda)
  • files: specifies how to source and copy from S3 / GitHub to filesystem
  • hooks: set of instructions to be run to deploy the new version
    • EC2 (examples):
      • ValidateService: last deployment lifecycle event
        • used to verify the deployment was completed successfully
      • AfterInstall: can use for tasks such as configuring application or changing file permissions
      • ApplicationStart: typically use to restart services that were stopped during ApplicationStop
      • AllowTraffic: during this deployment lifecycle event, internet traffic is allowed to access instances after a deployment
        • this event is reserved for the CodeDeploy agent and cannot be used to run scripts
    • ECS (examples):
      • BeforeInstall
      • AfterInstall
      • AfterAllowTestTraffic
      • BeforeAllowTraffic
      • AfterAllowTraffic
    • Lambda (examples):
      • BeforeAllowTraffic: specify task or functions to run before traffic is routed to new function
      • AfterAllowTraffic: specify the tasks or functions to run after the traffic has been routed to new function
  • for ECS:
    • must specify task definition ARN (TaskDefinition)
    • must specify where load balancer reroutes traffic during a deployment (LoadBalancerInfo)
  • for Lambda:
    • -myLambdaFunction
    • CurrentVersion
    • TargetVersion
  • Revision: includes everything needed to deploy the new version
    • AppSpec file, application files, executables, config files

CodePipeline

Fully managed continuous delivery service: automate release pipelines (build, test, deploy) every time there is a code change. Enables to deliver features and updates rapidly and reliably.

  • Integrates with GitHub or custom plugins
  • structured in:
    • pipelines: workflow that describes how software changes go through release process
    • artifacts: files or changes worked on by stages
      • each pipeline stage can create artifacts
      • artifacts are passed, stored in S3 and then passed to the next stage
    • stages: can be build, deploy, test, load test, etc
      • can define manual approval
      • source: S3, CodeCommit, GitHub, ECR, Bitbucket Cloud
      • build: CodeBuild, Jenkins
      • deploy: CloudFormation, CodeDeploy, ECS, Elastic Beanstalk, Service Catalog, S3
    • actions: stages contain at least one action on artifacts as input, output or both
    • transitions: processing from one stage to another inside of a pipeline
  • code changes pushed to repository automatically enter workflow
  • can create SNS notification on state changes (CloudWatch Events)
    • failed pipelines
    • cancelled stages
  • can audit API calls with CloudTrail
  • need IAM service role attached to the pipeline with permissions
  • only pay for what use; no upfront fees or long-term commitments

CodeStar

Unified user interface to easily manage software development activities. Fast setup of CD toolchain.

  • Makes easy to work in team, manage access and add owners, contributors and viewers
  • can easily track progress across development process
  • useful for unified development toolchain with collaboration between team members, synchronization, centralized management of CI/CD pipeline
  • templates:
    • project templates: websites, web applications, web services, Alexa skills, etc.
    • templates with code for getting started on supported programming languages: Java, JavaScript, PHP, Ruby, Python, etc.
  • support IDEs:
    • Cloud9 natively
    • Visual Studio, Eclipse, etc.
  • no additional charge

X-Ray

Helps to analyze, debug and trace production, distributed applications using microservices architecture.

Capabilities
  • can understand how app and underlying services are performing to troubleshoot root cause of performance issues and errors
  • provide end-to-end view of requests
    • show map of app underlying components
  • can analyze development, production, from simple three-tier app to complex microservices apps
  • should not be used as an audit or compliance tool
    • doesn’t guarantee data completeness
  • can view and filter data by properties such as: annotation value, average latencies, HTTP response status, timestamp, database table used, etc.
Applications support
  • EC2 / On-premises
    • Linux system must run X-Ray daemon
    • EC2 need instance role
  • ECS/EKS/Fargate
    • create Docker image that runs daemon or use X-Ray Docker image
    • ensure port mapping and network settings and IAM task roles
  • Lambda
    • need X-Ray integration to be ticked in configuration
      • Lambda will run daemon
    • IAM role is the Lambda role
  • Elastic Beanstalk
    • set configuration in console or use .ebextensions/xray-daemon.config
SDK
  • captures metadata for requests made to MySQL and PostgreSQL databases (RDS, Aurora) and DynamoDB
  • captures metadata for requests to SQS and SNS
  • installed in the application and forwards to X-Ray Daemon which forwards to X-Ray API
  • interceptors: add them to code to trace incoming HTTP requests
  • client handler to instrument SDK client that app uses to call other AWS services
  • HTTP client to use to instrument calls to other internal and external HTTP web services
  • X-Ray console to visualize what is happening
X-Ray Agent
  • can assume a role to publish data into different account
Trace
  • set of data points that share same trace ID
Segments
  • single component that encapsulates all data points
    • e.g. authorization services of distributed application
Subsegments
  • more granular timing information and details about downstream calls
  • additional details about a call to a service, external HTTP API or SQL database
  • can define arbitrary subsegments to instrument specific functions or lines of code
  • use for services that don’t send their own segments (DynamoDB):
    • subsegments generate inferred segments and downstream nodes on the service map
  • can see all downstream dependencies, even if they don’t support tracing or are external
Annotations
  • system-defined or user-defined data associated with a segment
  • system-defined: include data added to the segment by services
  • user-defined: metadata added to a segment by a developer
  • key/value pairs used to index traces and with filters
  • can use to record information on segments or subsegments (indexed for search)
Sampling
  • X-Ray (to be performant and cost-effective) doesn’t collect data for every request, but a statistically significant number of requests
Metadata
  • key/value pairs, not indexed and not used for searching

IAM

Identity and Access Management Service, centralized control of account; enables shared access; not used for application-level authentication.

Infrastructure
  • global (not per region)
  • eventually consistent
  • replicate across multiple data centers
  • support PCI DSS compliance
CLI
  • by obtaining security credentials from STS aws sts get-session-token
API
  • request temporary security credentials, MFS parameters in the STS API request
  • can use IAM Query API to make direct calls
SDK
  • can use to make programmatic API calls
Root user
  • BEST PRACTICE: don’t use (except for billing)
  • BEST PRACTICE: don’t share root credentials
  • BEST PRACTICE: create IAM admin user with permissions instead
Princial
  • entity that can take action on resource
  • can be users or roles
  • first principal: administrative IAM user
  • support federated users
    • con configure Identity Federation allowing secure access to resources without creating IAM user account
  • support programmatic access
User
  • entity representing a person or a service (service accounts)
  • up to 5000 users per account
  • unique ID
  • BEST PRACTICE: enable MFA
  • can define password policy enforcing password length, complexity, etc.
    • not apply to root
    • BEST PRACTICE: configure strong password policy
  • can allow/disallow to change password
  • groups: collection of users with policies attached
    • can use to assign permissions to users
    • cannot nest groups
Role
  • created and assumed by trusted entities
  • define a set of permissions for making service requests
  • can delegate permissions to resources for user and services without using permanent credentials
  • can be assigned to federated user who signs in using external IDP
  • EC2 instances:
    • applications retrieve temporary security credentials from instance metadata
    • BEST PRACTICE: use roles for applications on EC2 instances
    • instance profile: grant applications running on EC2 instances to API requests
      • only one role at instance at time
      • CLI
        • aws iam create-instance-profile
        • aws iam add-role-to-instance-profile
        • aws iam list-instance-profiles(-for-role)
        • aws iam get-instance-profile
        • aws iam remove-role-from-instance-profile
        • aws iam delete-instance-profile
  • role delegation: trust between two accounts, one owning resource (trusting account), one containing users (trusted account)
    • can be
      • same account
      • separate accounts of same organization
      • different organizations accounts
    • permissions policy: grant user of role the required permissions on resource
    • trust policy: specify trusted account members that are allowed to assume role
  • BEST PRACTICE: use roles to delegate permissions
Requests
  • Action: operation that principal want to perform
    • defined by services
    • can be viewing, creating, editing, deleting, etc.
  • Resource: entity that exists within a service upon which actions are performed
  • Principal information
    • environment from which request is made
    • etc.
  • Request context
    • principal (requester)
    • aggregate permissions associated with principal
    • environment data: IP address, user agent, SSL status, etc.
    • resource data
Authentication
  • principal sending a request must be authenticated; methods:
    • console password
    • access key (for API and CLI): combination of access key and secret access key
      • can use for programmatic calls
      • can create, modify, view or rotate
      • secret access key is returned only at creation time (if lost requires new key)
      • must ensure stored securely
      • can disable a user access key (IAM Identity Center)
    • server certificate: SSL/TLS certificate to authenticate
      • use only when must support HTTPS in Region not supported by Certificate Manager
  • BEST PRACTICE: change regularly key and password
Policies (authorization)
  • stored in JSON documents
  • values from request context to match policies
  • can apply to users, groups and roles
  • most restrictive policy is applied
  • can use IAM policy simulator tool to understand effects
    • BEST PRACTICE: validate your policies
  • BEST PRACTICE: least-privilege principle
    • actions on specific resources at specific conditions
  • condition: element to apply further conditional logic
    • BEST PRACTICE: use policy conditions for extra security
  • User (identity) based policies
    • attached to identity (user, group or role)
    • specify what identity can do
    • IAM permissions boundaries: maximum permission on identity based policy can grant to an entity
  • Resource-based policies
    • attached to resources (S3, SQS, VPC endpoints, KMS, etc.)
    • can specify who has access and what can perform on resource
    • only inline (not managed)
  • AWS Organizations service control policies (SCPs): specifies maximum permissions for an organization or OU
  • Session policies: parameters passed when programmatically create temporary session for a role or federated user
  • evaluation logic:
    • default: all requests denied
    • explicit allow overrides implicit deny
    • explicit deny overrides any explicit allow
  • Managed policy
    • created by AWS for common use cases
    • cannot change permissions
    • same policies are designed for specific job function
      • Administrator
      • Billing
      • Database Administrator
      • Data Scientist
      • Power User
      • Network Administrator
      • System Auditor
      • System User
      • System Administrator
      • View-only User
  • Customer managed policy: standalone policy that administrator creates
  • Inline policy: 1:1 relationship between entity and policy
    • when deleting entity, inline policy is deleted
  • BEST PRACTICE: use managed policies instead of inline
  • Billing policy is not enough: need to activate access to Billing to each user that needs it
Security Token Service
  • web service that enables to request temporary, limited-privilege credentials for IAM users or federated users
  • use cases:
    • identity federation
    • enterprise identity federation
      • authenticate users in organization’s network
      • without creating AWS identities
      • single sign-on approach to temporary access
      • support SAML 2.0
    • custom federation broker
    • web identity federation
      • let users sign in using known third-party identity provider such Amazon, Facebook Google or any OIDC
      • exchange credentials from provider for temporary permissions to resources
    • Cognito recommended for identity federation for mobile applications
      • support same identity providers as STS
      • support unauthenticated access
      • let migrate user data when user sign in
      • provide API operations for sync user data between devices
    • roles for cross-account access
      • use user identities in different account of same org
      • delegation approach to temporary access
    • roles for EC2
      • EC2 instances that need access to resources
      • temporary security credentials available to all apps in the instance
      • don’t need to store any long-term credentials
  • available as global service
  • all STS requests go to a single endpoint (sts.amazonaws.com)
  • can send STS requests to endpoints of any Region (reduce latency)
  • support CloudTrail which records calls and deliver log into S3 bucket
  • BEST PRACTICE: monitor activity
  • similar to long-term key credentials
    • temporary
    • can configure to last anywhere
    • not stored with user but generated dynamically
    • user can request new credentials if has permissions
  • temporary security credentials have limited lifetime
    • no need to rotate or revoke
  • cannot be reused
  • security credentials consist of:
    • access key: access key ID + secret ID
    • session token
    • expiration or duration of validity
  • can use API to request session token:
    • AssumeRole: IAM users
    • AssumeRoleWithSAML: user passing SAML auth response that indicates auth from known trusted IDP
    • AssumeRoleWithWebIdentity: user passing web identity token from known trusted IDP
    • GetSessionToken: IAM user or root
    • GetFederationToken: IAM user or root (not MFA)
Cross Account Service
  • useful for separate AWS accounts
  • e.g. development and production resources
  • resource-based policy: needed permissions on resource in different accounts
  • identity-based policy: assuming a role with needed permissions within different account
IAM Access Analyzer
  • help identify resources that are shared with external entity
  • help identify unused access
  • BEST PRACTICE: remove unnecessary credentials
  • validate policies against policy grammar and best practices
  • custom policy checks: help validate policies against specified security standards
  • generate policies based on access activity in CloudTrail logs
Access Advisor
  • use data analysis to help set permission guardrails confidently
  • provide service last accessed information for
    • accounts
    • organizational units
    • organization managed by Organizations
  • permissions guardrails help control which services can be accessed by developers and applications
  • can use service control policies (SCP) to restrict access to service
  • can determine the services not used by IAM users and roles

Cognito

Identity Broker that lets add user sign-up, sign-in and access control to web and mobile apps quickly and easily. Provides authentication, authorization and user management.

Federation
  • work with external IDP that support SAML or OpenID connect
    • social identity providers
  • federation allows users to authenticate with a Web IDP
    • authenticate first with the Web ID provider and receives an auth token
    • then exchanged for temporary credentials to assume an IAM role allowing access to required resources
    • can integrate custom IDP
Temporary credentials
  • no need for the application to store credentials locally
  • provides temporary security credentials to access app’s backend resources in any service behind API Gateway
User pools
  • authentication
  • user directories that provide sign-up and sign-in options for application users
  • is an IDP
  • built-in, customizable web UI
  • social sign-in
  • MFA
  • checks for compromised credentials
  • account takeover protection
  • phone and email verification
  • customized workflow through Lambda triggers
    • pre sign-up Lambda trigger
    • post confirmation Lambda trigger
    • pre authentication Lambda trigger
    • post authentication Lambda trigger
    • pre token generation Lambda trigger
    • custom message Lambda trigger
    • migrate user Lambda trigger
      • invoked when user doesn’t exist in the user pool
        • after Lambda returns success, Cognito creates the user in the user pool
      • invoked in the forgot-password flow
    • challenge Lambda trigger: user pool custom authentication flow
      • define auth challenge
        • invoked to initiate custom auth flow
      • create auth challenge
        • invoked after “define auth challenge” to create custom challenge
      • verify auth challenge response
        • invoked to verify response from end user for custom challenge is valid or not
      • can incorporate new challenge types
        • e.g. include CAPTCHAs
        • e.g. include dynamic challenge questions
      • API:
        • InitiateAuth
        • RespondToAuthChallange
  • after authentication, issue a JWT to secure access to APIs
  • can be seen as Active Directory
Identity pools
  • authorization
  • create unique identities for users and authenticate them width IDP
  • can obtain temporary, limited-privilege credentials
  • uses Push Synchronization to push updates and synchronize user data across multiple devices
  • silent push notification using SNS to all devices
  • support following IDP
    • public providers: login with Amazon, Facebook, Google
    • Amazon Cognito User Pool
    • OpenId or SAML IDP
    • Developer Authenticated Identities
  • can be seen as a IAM role
Cognito Sync
  • client library that enables cross-device syncing of application-related user data
  • cache data locally so app can read/write data regardless of device connectivity status
  • similar to AppSync
    • can synchronize across devices (not users as AppSync does)
  • push sync
  • Cognito streams: allow push dataset change to Kinesis stream in real-time
  • Cognito events: allow execute Lambda function in response to events in Cognito
    • function can evaluate and manipulate data before sync other devices
    • e.g. issuing an award when player reaches new level

KMS

Highly available key storage, management and auditing solution for encrypt.

Key
  • Alias: unique alias and description
  • Key material: used to encrypt and decrypt data
  • Metadata
    • key ID
    • creation date
    • description
    • key state
  • can generate keys in KMS, in CloudHSM cluster or import them
  • cannot export keys (CloudHSM allows this)
  • support for symmetric and asymmetric keys
  • encryption key are Regional
  • encrypt data up to 4KB
  • BEST PRACTICE: recommended to delete keys no longer in use
AWS Managed Key
  • can only be used by service that created them in a particular Region
    • created on the first time encryption is implemented in the service
  • do not pay a monthly fee
  • can be subject to fees for use in excess of free tier
    • cost covered in some services
Customer Managed Key
  • greater flexibility
  • can perform rotation, governing access and key policy configuration
  • can enable and disabled when no more required
  • monthly fee and a fee for use in excess of free tier
AWS Owned key
  • collection of keys that service owns and manage in multiple accounts
    • cannot view, use, track or audit them
  • no charge
Data key
  • key used for encrypt data, including large amounts of data
  • can use KMS keys to generate, encrypt and decrypt data keys
  • KMS doesn’t store, manage o track data keys
  • must use and manage outside KMS
  • if ta service encrypt data than will use data keys protected by master key
  • no limits on number of data keys
  • integrated with client-side toolkit that use a method known as envelope encryption to encrypt data
    • generates data keys used to encrypt data and are themselves encrypted using master keys
    • KMS key encrypt data key (envelope key)
    • envelope key decrypt data
Usage policies
  • determine which users can use keys to encrypt/decrypt data and under which conditions
Master key
  • protected by hardware security modules (HSMs) and are only ever used within those modules
  • can submit data directly to KMS to be encrypted/decrypted using master keys
  • service stores data key along with encrypted data
  • when service needs to decrypt data, it requests KMS to decrypt the data key using the master key
  • all request to use master keys are logged to CloudTrail
  • can schedule a deletion (7 to 30 days)
    • during the waiting period can verify the impact of deletion in applications
    • can cancel key deletion
  • max master keys per account per Region: 1000 (excluding managed key)
Custom key store
  • combine CLoudHSM with KMS
  • can configure a custom CloudHSM cluster and authorize KMS to use it as a dedicated key store rather than the default key store
  • master keys generated in custom key store never leave the cluster in plaintext
  • all the operations that use those keys are only performed in HSMs
API
  • can use KMS APIs directly to encrypt and decrypt data using master keys
  • encrypt (aws kms encrypt)
    • encrypts plaintext into ciphertext by using master key
    • can use to move encrypted data from one Region to another
  • decrypt (aws kms decrypt)
  • re-encrypt (aws kms re-encrypt)
    • can use to change the customer master key
    • can use when you manually rotate
  • enable-key-rotation
    • automatic rotation of key material for specified symmetric customer master key
    • cannot perform on key in different account
  • GenerateDataKey (aws kms generate-data-key)
  • GenerateDataKeyWithoutPlaintext (aws kms generate-data-key-without plaintext)
    • generates unique symmetric data key
    • returns data key encrypted under a customer master key
  • GenerateDataKeyPair requests an asymmetric data key pair
  • GenerateDataKeyPairWithoutPlaintext

Secret Manager

Protect secrets needed to access applications, services and IT resources. Enables to rotate, manage, retrieve database credentials, API keys, and other secrets.

  • can configure VPC endpoints to keep traffic
  • can use client-side caching libraries to improve the availability and reduce the latency
Secret storage
  • key/value type: String or Binary (encrypted)
  • store API keys, OAuth tokens
  • charges apply per secret
Rotation
  • Rotation with built-in integration for RDS, Redshift and DocumentDB
  • can extend rotate other secrets by modifying sample Lambda functions
  • automatic key rotation for some services (RDS)
    • for others use Lambda
Security
  • encrypts secrets at rest using KMS
  • transmits secret securely over TLS
  • fine-grained permissions
    • IAM
    • resource-based policies
Auditing and monitoring
  • auditing and monitoring
  • audit secret rotation, third-party services and on-premises
SSM Parameter Store
  • No native key-rotation, can use custom Lambda
  • key/value type: String, StringList, SecureString (encrypted)
  • hierarchical keys
  • free for standard, charged for advanced
  • can store:
    • passwords
    • database strings
    • AMI IDs
    • license codes
    • parameter values
  • can store as plain text or encrypted data
  • can reference in scripts, commands, SSM documents and config automation workflows by using unique name of the parameter

Other services

AWS Systems Manager
  • operations hub for AWS applications and resources
  • secure end-to-end management solution for hybrid and multi cloud environments
    • enables secure operations at scale
  • AppConfig: help create, manage, and deploy application config and feature flags
    • support controlled deployments to applications of any size
    • can use with applications hosted on:
      • EC2 instances
      • Lambda containers
      • mobile applications
      • edge devices
    • include validators
AWS Systems Manager Session Manager
  • can manage EC2 instances, edge devices, on-premises servers and VMs
  • can use interactive one-click browser-based shell or CLI
  • secure and auditable node management without:
    • open inbound ports
    • maintain bastion hosts
    • manage SSH keys
  • useful for:
    • improve security and audit posture
    • reduce operation overhead centralizing access control
    • reduce inbound node access
    • monitor and track managed node access and activity
    • close down inbound ports
    • allow connections to managed nodes that don’t have public IP
    • grant and revoke access from single location
    • one solution to users for Linux, macOS and Windows Server
    • want to connect to managed node with just on click from CLI
      • no use of SSH keys
AWS Systems Manager Parameter Store See section related to Secret Manager.
Resource Access Manager (RAM)
  • enable to share resources easily and securely with any account of Organization
  • can share Subnets, License Manager configurations, Route 53 resolvers, etc.
  • eliminates the need to create duplicate resources in multiple accounts
  • create resource centrally in multi-account environment:
    • create a Resource Share
    • specify resources
    • specify accounts
  • Reduces Operational Overhead
  • Improves Security and Visibility: leverages existing policies and permissions (IAM)
    • comprehensive visibility into shared resources to set alarms and view logs through CloudWatch and CloudTrail
  • Optimize Costs: leverage licenses in multiple parts of company
  • no additional cost
Cloud Development Kit (CDK)
  • open-source soft-dev framework for defying cloud infrastructure in code providing through CloudFormation
    • Infrastructure as Code (IaC)
  • CDK Construct Library: pre-written modular pieces of code (constructs)
    • can integrate to develop infrastructure quickly
    • reduce complexity required to define and integrate services together
  • CDK Toolkit: command line tool for interacting with CDK apps
    • create, manage and deploy CDK projects
  • can define constructs with programming languages:
    • support TypeScript, JavaScript, Python, Java, C#, .Net and Go
  • can compose constructs into stacks and apps
  • can deploy CDK apps to CloudFormation to provision/update resources
  • creating an app:
    • create app from template
    • initialize app cdk init
    • build the app (optionally to catch errors and syntax)
    • BEST PRACTICE: synthesize one or more stacks to create CloudFormation template
      • catch logical errors
    • deploy stacks to account cdk deploy
AppSync
  • can synchronize mobile app data across devices and users
  • support for additional devices and data types
  • based on GraphQL
Serverless Application Repository
  • managed repository for serverless applications
  • store and share reusable applications
  • don’t need to clone, build, package or publish source code before deploying
  • can use pre-built applications in your serverless architectures
    • help reduce duplicated work
    • ensure organizational best practices
    • get to market faster
  • integration with IAM: resource-level control of each application
    • can publish built applications and share with specific accounts
    • publicly shared apps include link to source code
  • package applications with SAM template that defines resources used
  • no additional cost, pay for resources
Step Functions
  • coordinate the components of distributed applications as a series of steps in a visual workflow
  • define steps of workflow in the JSON-based Amazon States Language
  • visual console graphs each step
  • start an execution, the console highlights real-time status
  • managed workflow and orchestration platform
  • scalable and highly available
  • can create tasks, sequential steps, parallel steps, branching paths or timers
  • apps can interact and update the stream via Step Function API
  • built-in error handling: retry failed, timed-out tasks, catch specific errors, recover gracefully
  • automatic scaling: underlying compute automatically scales in response to changing workloads
  • execution event history: detailed event log (where and why)
  • high availability: built-in fault tolerance; multiple AZ
  • administrative security: IAM policies to control access
  • pay only for the transition from one step to the next (state transition)
  • metered by state transition regardless of how long each state persists (up to one year)
Fault injection simulator
  • fully managed service for run fault injection experiments
  • improve application performance, observability and resiliency
  • used in chaos engineering, stressing an application in testing/production environment by creating disruptive events
  • helps create real-world conditions needed to uncover the hidden bugs, monitoring blind spots, performance bottlenecks in distributed systems
Trusted advisor
  • inspect environment and makes recommendations to:
    • save money
    • improve system availability and performance
    • help close security gaps
  • Basic or Developer Support plane
    • access all checks in the Service Limits category
    • access 6 checks in the Security category
  • Business, Enterprise On-Ramp or Enterprise Support plan
    • can use API (as well as console)
    • access all checks
    • can use CloudWatch Events to monitor status of checks
AWS Billing - Consolidated billing
  • use in AWS Organizations
  • billing and payment for multiple accounts
  • can track charges across multiple accounts
  • can download combinate cost and usage data
  • can combine usage across accounts and share:
    • volume pricing discounts
    • RI discounts
    • Saving Plans
  • no extra charge
AWS Budget
  • can use to track and take action on costs and usage
  • monitor aggregate utilization and coverage metrics for RI or Savings Plan
  • can enable simple-to-complex cost and usage tracking
    • setting monthly cost budget with fixed target amount to track all costs
      • alert for actual and forecasted
    • setting monthly cost budget with variable target amount to track all costs
    • setting daily utilization to track RI or Savings Plan
  • update up to 3 times a day
  • updates occur 8-12 hours after previous
  • types of budgets:
    • cost budgets: how much to spend on a service
    • usage budgets: how much to use on services
    • RI utilization budgets: see if RI are unused or under-utilized
      • receive alert when RI usage falls below threshold
    • RI coverage budgets
    • Savings Plans utilization budgets
    • Savings Plans coverage budgets

Use cases table

Action Service
Manage hybrid and multi-cloud environments AWS Systems Manager
Create, manage and deploy application config and feature flags AppConfig
Generate resource-based access policies AWS Policy Generator
Access VPC instances for management (SSH or RDP)
  • AWS System Manager Session Manager (recommended)
  • EC2 Bastion host
Speed up queries on DynamoDB table on non-key attributes DynamoDB Global Secondary Index
Node management without open inbound ports, maintain bastion host and manage SSH keys AWS Systems Manager Session Manager
Upload libraries to Lambda functions without including them in deployment package Lambda Layer
Compute unpredictable workloads (dev and test) EC2 On-Demand
Centralize access control AWS Systems Manager Session Manager
Define a rule on object transition to another storage class S3 Bucket Lifecycle rules
Create or update an alarm and associated with specified metric, math expression or anomaly detection model CloudWatch API
Share resources (Subnets, License Manager, configs, Route 53 resolvers) with accounts or Organizations AWS Resource Access Manager
Increase by 10 times the performance of DynamoDB DynamoDB Accelerator (DAX)
Create resource in multi-account environment AWS Resource Access Manager
Restrict user access to his records of a DynamoDB table IAM Condition
Map custom domain names to API Gateway custom regional API Route 53 Alias
Defyne cloud infrastructure in code using programming languages Cloud Development Kit (CDK)
Coordinate components of distributed applications as series of steps in visual workflow AWS Step Functions
Run fault injection experiments (improve performance, observability and resiliency) AWS Fault injector simulator
Accelerated log data feed intake Kinesis
Log API calls from SQS to S3 bucket CloudWatch
Avoid API to being overwhelmed by too many requests API Gateway Server-side throttling limits
Log calls from STS to S3 bucket CloudTrail
Real-time processing of streaming big data Kinesis Data Stream
Specify percentage of consumed provisioned throughput of DynamoDB at a point in time DynamoDB Target utilization
Real-time data analytics with SQL Kinesis Data Analytics
Log bucket and object-level actions CloudTrail
Enable table or GSI to increase provisioned read/write capacity to handle traffic variations without throttling DynamoDB Application Auto Scaling
Real-time analytics with existing business intelligence tools and dashboards Kinesis Firehose
Log actions taken by users, roles, services on S3 objects for auditing and compliance S3 Server Access Logging
Capture, transform and load streaming data into data store or analytics tools Kinesis Firehose
Measure backend responsiveness of API CloudWatch IntegrationLatency metric
Configure a fully automated, fault tolerant in-memory storage ElastiCache Redis Multi-AZ
Map custom domain names to VPC interface endpoints Route 53 Alias
Leverage CloudFront Edge Locations to transfers files over long distances between client and bucket S3 Transfer Acceleration
Monitor health of serverless app via execution status Lambda Destination
Enforce standardized tagging
  • AWS Config
  • EC2 custom scripts
Read stream records with distributed applications sharding workload Kinesis Client Library
Read-heavy database replication RDS Read Replicas
Create streams, reshard, put and get records in streams Kinesis Data Stream API
Enable SSL certificates on Application Load Balancer AWS Certificate Manager
Send, store and receive messages between software components SQS
Introduce a delay in processing of large distributed applications SQS Delay Queues
Provide temporary access to specific S3 object to those who don’t have AWS credentials S3 pre-signed URLs
Increase read performance of auction applications, gaming, retail sites or special sites DynamoDB Accelerator (DAX)
Rename an S3 object, or change its storage class or rest encryption S3 Copy
Update sale or stock control database before sending notification to confirm transaction SQS Delay Queues
Setup a global table with replicas in different Regions DynamoDB Cross Region Replication
Send, get or delete message that references message object stored in S3 bucket SQS Extended Client Library
Back-up or restore a database RDS Snapshot
Deny request with specific header or IP address to access S3 bucket S3 Bucket Policies
Store data from streams
  • DynamoDB
  • Redshift
  • S3
  • Elasticsearch
Automate release of Lambda function
  • CodePipeline
  • CodeDeploy
Checkpoint progress of stream DynamoDB
Prevent cross-site scripting attacks on APIs API Gateway Same Origin Policy
Customize CloudFront content, request and response at lowest network latency Lambda@Edge
Encrypt streams KMS
Bring publicly routable IPv4/IPv6 address range from on-premises to AWS EC2 BYOIP
Localize content and presenting in the language of users Route 53 Geo-location Routing Policy
Change S3 object metadata S3 Copy
In-transit message encryption SQS HTTPS
Support live streaming (real time event) CloudFront Web Distribution
Allow signed request to read object ACL S3 Object ACL
Allow authentication to APIs with OAuth, SAML or 3rd party auth Lambda Authorizer
Protect object against accidental deletion S3 Versioning
Server-side message encryption KMS
Debug and trace distributed applications using microservices X-Ray
Control who can read/write messages IAM Policies
Host server-bond software licenses that uses metrics like per-core, per-socket or per-VM EC2 Dedicated host
Ensure client to be bound to an individual back-end instance (e.g. WebSocket) ALB sticky sessions
Cache in-memory with less management overhead DynamoDB Accelerator (DAX)
Protect distribution rights Route 53 Geo-location Routing Policy
Retrieve up to 500 metrics in a single request CloudWatch API
Control access to cache cluster without using VPC subnet groups ElastiCache Cache Security Groups
Delegate permissions for user/services without permanent credentials IAM Role
Copy an EBS volume EBS snapshot
Storage for frequently accessed big data at low cost EBS HDD Throughput Optimized
See the underlying reads or writes performed by a DynamoDB Transaction CloudWatch
Host static website S3 Bucket static website
Scale out ECS tasks using CPUUtilization metric ECS Step Scaling Policies
Measure overall responsiveness of API CloudWatch Latency metric
Define APIs as code
  • Swagger
  • Open API
Upload dependency to Lambda function larger than 50MB S3
Understand effects of IAM policies IAM Policy Simulator
Specify maximum permissions for an organization AWS Organizations service control policies (SCP)
Increase IOPS redundancy at same performance EBS RAID 1
Execute advanced business intelligence and perform complex data analysis queries RedShift
Setup direct interaction between client and Lambda function through an API API Gateway AWS_PROXY Integration
Storage with low-latency for I/O intensive databases or boot volumes EBS SSD Provisioned IOPS
Allow restricted resources (e.g. fonts) to be requested from another domain outside through an API API Gateway Cross-Origin Resource Sharing
Scale an ELB target group Auto Scaling Group
Send notification to SNS topic or invoke Auto Scaling policy action on metric sustained state change CloudWatch Alarms
Take advantage of unused capacity in the cloud EC2 Spot Instance
Move S3 object across location S3 Copy
Route traffic based on location of resources Route 53 Geo-proximity Routing Policy
Identify resources shared with external entity IAM Access Analyzer
Attach boot volume for low latency apps for dev and test EBS SSD General Purpose
Automatically delete items in DynamoDB table DynamoDB TTL
Store BLOB data with low I/O rate RDS
Validate policies (against syntax, best practices or custom checks) IAM Access Analyzer
Determine request, IP address, who made the request and when on EC2 instance CloudTrail
Generate policies based on access activity in CloudTrail logs IAM Access Analyzer
Storage with low latency but don’t need persistence on instance termination EC2 Instance Store
Configure throttling and quota limits enforced on individual client API keys API Gateway Usage plans
Enable SSL on Elastic Beanstalk serverless application
  • AWS Certificate Manager
  • Elastic Beanstalk CLI
Perform query on DynamoDB table primary key on different sort key DynamoDB Local Secondary Index
Load data in ElastiCache cache only when necessary ElastiCache Lazy Loading
Control which services can be accessed (permissions guardrails) Access Advisor
Define a scaling policy to scale basing on set of step adjustments Auto Scaling Step Scaling Policy
Store infrequently accessed data in a durable, immediately available class S3 Standard-IA
Upload files larger than 100MB S3 Multipart Upload
Update dashboard to least amount of delay from 1KB SQS messages sent seldom SQS Long polling
Allow consuming media files before file finished download (media streaming) CloudFront RTMP Distribution
Get last accessed information for accounts or organizations Access Advisor
Serve Web Socket APIs API Gateway WebSocket API
Audit history of changes to API CloudTrail
Allow any authenticated user to read object data and metadata S3 Object ACL
Monitor HTTP/HTTPS requests to control access to CloudFront content AWS WAF
Restrict access to service with service control policies (SCP) Access Advisor
Grant access to bucket and its objects to anyone on internet S3 Bucket Policies
Need database for massively scaled applications and globally dispersed users DynamoDB Global tables
Retrieve archived data in milliseconds S3 Glacier Instant Retrieval
Encrypt RDS instances and snapshots at rest KMS
Deliver real-time stream of events following changes in resources to EC2 instances, Lambda functions or streams CloudWatch Events
Allow secure access to resources without creating IAM user
  • Cognito User pools (recommended)
  • IAM Identity Federation
Authenticate with external or custom IDP (JWT) Cognito User pools
Define a scaling option for scale based on real-time metrics Auto Scaling Dynamic scaling option
Improve performance by routing to Region with lowest latency Route 53 Latency Routing Policy
Make coordinate, all-or-nothing change to multiple items in a DynamoDB table DynamoDB Transaction
Execute joins or complex transactions on database RDS
Monitor request, source IP etc. to a CloudFront distribution CloudTrail
Increase IOPS performance and redundancy EBS RAID 10
Encrypt and EBS volume EBS snapshot
Keep ElastiCache cache always update at every database write ElastiCache Write Through
Perform SQL-like JOIN operations on DynamoDB tables Apache Hive on EMR
Protect from DDoS attacks
  • Elastic Load Balancer
  • CloudFront
Storage for less frequently accessed colder data at low cost EBS HDD Cold
Enable S3 to write server access logs (S3 Log Delivery Group) S3 Bucket ACL
Auto scale ECS tasks based on existing Auto Scaling group ECS Cluster Auto Scaling

Capacity Provider

Encrypt S3 data providing audit trails on who/when used CMK S3 SSE-KMS
Control access to APIs with usage plans Lambda authorizers
Store web session information so if server is lost, session info can be recovered by next server ElastiCache Redis
Handle millions of requests/second at low latency on network Network Load Balancer
Store frequently accessed data in a durable, immediately available class S3 Standard
Restrict access to S3 bucket, prevent bypassing CloudFront CloudFront Origin Access Identity
Route to a CloudFront distribution or an Elastic Load Balancer Route 53 Alias
Limit specific client’s requests to an API
  • API Gateway Per-client throttling limits
  • API Gateway usage plans
Create unique identities for users and authenticate them with IDP Cognito Identity pools
Disable a user access key IAM Identity Center
Detect whether a stacks’s actual configuration differs from expected CloudFormation Drift detector
Reduce number of calls to backend of APIs improving latency API Gateway Cache
Configure database caching in front of RDS ElastiCache Memcached
Deliver real-time stream of events following changes in resources to ECS tasks, pipelines, SNS topic or SQS queues CloudWatch Events
DNS resolution for hybrid clouds Route 53 Resolver
Host on virtualized instance EC2 Dedicated instance
Ensure specified number of tasks constantly running and reschedule them on fail ECS Service scheduler
Deploy a multi-region, multi-master database DynamoDB Global tables
Auto scale based on number of messages in a queue per EC2 instance Auto Scaling - Scaling based on SQS
Certificate on Regions not supporting AWS Certificate Manager IAM server certificates
Configure in-memory storage for leaderboards ElastiCache Redis
Interactively search and analyze log data in CloudWatch Logs CloudWatch Logs Insight
Configure long term log retention CloudWatch Logs
Setup direct interaction between client and HTTP endpoint through an API API Gateway HTTP_PROXY Integration
Verify IAM permissions passed by a caller on APIs IAM Identity-Based Policies
Mitigate the drawbacks of the ElastiCache cache strategies ElastiCache TTL
Serve APIs reducing connection overhead for small number of clients with high demand API Gateway Regional Endpoint
Request temporary limited-privilege credentials for IAM or federated users
  • Cognito Identity pools (recommended)
  • IAM Security Token Service
Route by specifying a weight per IP address Route 53 Weighted Routing Policy
Log API calls, latency and error rates CloudWatch
Request temporary security credentials to access backend resources behind API Gateway Cognito
Increase/decrease number of ECS tasks based on CloudWatch alarm ECS Step Scaling Policy
Dynamic temporary credentials
  • IAM Security Token Service
  • Long-term key credentials
Create policies that route traffic based on latency, load or geo-proximity Route 53 Traffic flow
Build a resilient disaster recovery strategy for database
  • RDS Multi-AZ
  • RDS Read Replicas
Identify unused access IAM Access Analyzer
Perform authoritative DNS within VPC without exposing DNS records Route 53 Private DNS
Serve real time streaming with a media player
  • CloudFront Web Distribution
  • CloudFront RTMP Distribution
Check status of IP address or domain names or CloudWatch alarm Route 53 Health Checks
Find items in DynamoDB table by primary key DynamoDB Query
Cache complex data types ElastiCache Redis
View resource utilization CloudWatch
Improve latency and throughput for read-heavy/compute-intensive workloads ElastiCache
Configure multi-thread or multi-core in-memory cache ElastiCache Memcached
Define rules when log expires or documents are frequently accessed on certain period S3 Bucket Lifecycle rules
Add domain name to a CloudFront distribution Route 53 Alias
Prevent Auto Scaling to scale-in and terminate EC2 instances Auto Scaling termination policy
Retries network requests on DynamoDB on network errors DynamoDB Exponential Backoff
Configure in-memory cache that can be encrypted ElastiCache Redis
Cache in-memory always strongly consistently and optimized for DynamoDB DynamoDB Accelerator (DAX)
Write DynamoDB Stream log to CloudWatch logs Lambda
Define a scaling policy to scale keeping specific target value Auto Scaling Target Tracking Policy
Monitor CloudTrail logs in real-time CloudWatch Logs
Retrieve archived data in minutes/hours for disaster recovery S3 Glacier Flexible Retrieval
Don’t want to specify provisioned capacity of DynamoDB DynamoDB On-Demand Capacity
Support connection of firewalls or IPS systems on Layer 3 and 4 ISO/OSI Gateway Load Balancer
Push updates and synchronize user data across multiple devices Cognito Push Synchronization

Cognito Sync

Increase/decrease number of ECS tasks based on CloudWatch metric ECS Target Tracking Scaling Policy
Increase IOPS performances at same redundancy EBS RAID 0
Need up to 64000 IOPS for a volume storage EBS SSD Provisioned IOPS
Check encryption status of EBS volumes AWS Config
Cache objects like database queries ElastiCache Memcached
Search and filter log data coming into CloudWatch Logs CloudWatch Logs Metric filters
Temporary storage of information changing frequently (buffers, caches, scratch data, etc.) EC2 Instance Store
Notify to SNS, SQS, or Lambda an event on objects in S3 S3 Event notifications
Route to a DNS name Route 53 CNAME
Push updates and synchronize user data across multiple devices and users AppSync
Allow all authenticated users to list objects in a bucket S3 Bucket ACL
Ensure an instance is removed from load balancer when unhealthy instead of terminated by Auto Scaling Group Auto Scaling ELB health checks
Publish a single metric data point CloudWatch API
Analyze CloudFront access logs AWS Athena
Route to an S3 Bucket as website Route 53 Alias
Configure a landing spot for streaming sensor data on factory floor ElastiCache
Serve API endpoint for geographically distributed clients around the world API Gateway Edge-optimized Endpoint
Send SNS notification when Auto scaling event terminates Auto Scaling lifecycle hooks
Store and persist session data DynamoDB
Avoid to be charged after expiration of object storage S3 Bucket Lifecycle rules
DNS querying between on-premises and AWS over private connections Route 53 Resolver
Transfer domain from Route 53 to another registrar AWS Support
Centralize logs from systems, applications and services CloudWatch Logs
Push Cognito data change to Kinesis stream in real-time Cognito streams
Enable long-running/lived connections (for WebSocket) Network Load Balancer
Execute Lambda function in response of Cognito events before sync other devices Cognito events
Store infrequently accessed data in a less resilient, single-AZ class at lower cost S3 One Zone-IA
Accept a write/update to a DynamoDB table only if conditions are met DynamoDB API Conditional writes
Collect system-level metric from EC2 instance
  • CloudWatch Log Agent
  • XRay
Replicate bucket across Regions S3 Cross Region Replication
Serve APIs only from a VPC using ENI API Gateway Private Endpoint
Control permissions to invoke API from specific users, source IPs, VPC endpoint, etc. IAM Resource-Based Policies
Specify capacity of DynamoDB DynamoDB Provisioned Capacity
Cache data from dynamically generated web pages ElastiCache Memcached
Control CDN content expiration time CloudFront TTL
Resolve apex/naked domain names Route 53 Alias
Retrieve archived data within 12 hours S3 Glacier Deep Archive
Offload workload of a database RDS Read Replicas
Remove session data or event logs from DynamoDB table DynamoDB TTL
Manage repository for serverless applications Serverless Application Repository
Allow to encrypt all objects in a bucket S3 Bucket Policies
Configure live real-time dashboard displays ElastiCache
Use pre-built applications in serverless architectures Serverless Application Repository
Compute with discounts reserving 1 or 3 years of instance EC2 Reserved Instance
Route randomly responding to DNS queries with up to 8 healthy records Route 53 Multi-value Answer Routing Policy
Configure in-memory store for high frequency counters ElastiCache Memcached
Capture and log time-ordered sequence of item-level modifications in DynamoDB table DynamoDB Stream
Offload S3 request rate CloudFront Edge Location
Validate token in header of an API request Lambda Authorizer

Previous Post Next Post