AWS Step Functions
What are AWS Step Functions?
AWS Step Functions is a serverless orchestration service that lets you coordinate multiple AWS services and microservices into resilient workflows. You define state machines using Amazon States Language (JSON) to sequence tasks, handle retries, parallelism, and errors—allowing you to focus on business logic while AWS manages state, durability, and execution.
Workflow types
- Standard Workflows – ideal for long-running (days to years), auditable, and exactly‑once workflows with detailed execution history.
- Express Workflows – designed for high‑volume, event‑driven workloads with millisecond latency and short executions (up to 5 minutes), charging per request and duration.
Key features
- Declarative state machines (choices, parallels, maps, waits).
- Built‑in error handling with retry/backoff.
- Integration with 11+ AWS services out‑of‑the‑box (Lambda, ECS, Batch, SageMaker, DynamoDB, SNS, SQS, Glue, etc.).
- Visual workflow editor and execution logging in the console.
- Automatic state management, scheduling, and concurrency control.
- Child workflows and dynamic Parallel/Map states for large‑scale processing.
Use cases
- Order processing pipeline – validate order, charge payment, update inventory, notify customer; errors automatically retried, failures sent to DLQ.
- Video transcoding – fan‑out a video file to parallel Step Functions Map state invoking Lambda that triggers Elastic Transcoder, then merge results.
- Machine learning training – orchestrate data preprocessing on EMR, training in SageMaker, and model deployment with Lambda.
- IoT data processing – Express workflow triggered by Kinesis events to filter, enrich, and store telemetry.
Example state machine (JSON)
{
"Comment": "Simple order workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:validate",
"Next": "ChargePayment",
"Retry": [{"ErrorEquals": ["Lambda.ServiceException"],"IntervalSeconds":2,"MaxAttempts":3}]
},
"ChargePayment": {"Type":"Task","Resource":"arn:aws:lambda:charge","Next":"UpdateInventory"},
"UpdateInventory": {"Type":"Task","Resource":"arn:aws:lambda:update","End":true}
}
}
Best practices
- Keep individual states idempotent and stateless.
- Use
HeartbeatSecondsandTimeoutSecondsto detect hung tasks. - Pass small payloads or S3 references to avoid state size limits (32 KB).
- Use Maps or Parallel states for batch/parallel workloads; handle large iterations with
MaxConcurrency. - Monitor metrics (
ExecutionsStarted,ExecutionsFailed,ThrottledEvents) and set CloudWatch alarms. - Archive history for Standard workflows if retention > 90 days is required (via CloudWatch Logs).
Service integrations
Lambda, ECS/Fargate, Batch, DynamoDB, SQS, SNS, Glue, SageMaker, EMR, CodeBuild, Step Functions (nested), Amazon API Gateway (for callbacks), Activity workers (for on‑prem tasks).
Pricing overview
- Standard: $0.025 per 1,000 state transitions.
- Express: $1.00 per 1M requests plus $0.000004 per GB‑second duration.
(Always verify current pricing on AWS site.)
Related Razorops articles
- Amazon Simple Workflow Service (SWF)
- AWS Lambda
- AWS EventBridge
- AWS Batch
- Top 50 AWS DevOps Interview Questions and Answers
Top 20 AWS Solutions Architect interview FAQs
-
Q: What is a VPC and why is it important? A: A Virtual Private Cloud isolates resources at the network level, providing subnets, routing, security groups, and private connectivity.
-
Q: Security groups vs NACLs—differences? A: Security groups are stateful and attached to instances; NACLs are stateless and apply to subnets.
-
Q: How would you design for high availability? A: Use multiple AZs, autoscaling groups, load balancers, and managed services (RDS Multi-AZ).
-
Q: What is Auto Scaling and how does it work? A: Auto Scaling adjusts EC2 capacity using policies, health checks, and scheduled scaling to meet demand and maintain availability.
-
Q: What is IAM best practice? A: Use least privilege, roles for services, enable MFA, rotate credentials, and use centralized identity providers.
-
Q: How does S3 provide durability? A: S3 replicates data across multiple AZs and performs continuous integrity checks (11 9s durability for standard storage).
-
Q: When to use SQS vs SNS? A: Use SQS for decoupled queue processing; use SNS for pub/sub notifications to multiple subscribers.
-
Q: What is cross-region replication (CRR) for S3? A: CRR asynchronously copies objects to another region for DR and compliance.
-
Q: What is an Auto Scaling group? A: ASG maintains EC2 fleet capacity based on policies, scales automatically, and replaces unhealthy instances.
-
Q: How does Route 53 routing policy work (simple, weighted, latency)? A: Simple returns a single resource; weighted distributes traffic by weight; latency routes to the lowest-latency region.
-
Q: When to use RDS Multi-AZ vs Read Replica? A: Multi-AZ for HA and failover; Read Replica for read scaling and analytics.
-
Q: What is CloudFormation and why use it? A: CloudFormation is AWS’s IaC service to provision resources declaratively and track drift.
-
Q: How do you optimize costs in AWS? A: Rightsize, use spot/reserved/savings plans, lifecycle policies for storage, and monitor spend with Cost Explorer.
-
Q: Explain S3 consistency model. A: S3 provides strong read-after-write consistency for new PUTs and updates across all regions.
-
Q: IAM best practices? A: Apply least privilege, use roles, enable MFA, rotate credentials, and use centralized identity providers.
-
Q: How do you design a data lake on AWS? A: Use S3 for storage, Glue Data Catalog for metadata, Lake Formation for access control, and Athena/EMR/SageMaker for processing.
-
Q: Differences between ALB and NLB? A: ALB is Layer 7 with routing features; NLB is Layer 4 optimized for throughput and static IPs.
-
Q: What is a VPC endpoint and when to use it? A: VPC endpoints provide private connectivity to AWS services without internet.
-
Q: How would you perform disaster recovery in AWS? A: Choose a DR strategy (Backup & Restore, Pilot Light, Warm Standby, Multi-Site) based on RTO/RPO and automate failover when possible.
-
Q: How do you monitor and troubleshoot AWS infrastructure? A: Use CloudWatch metrics/logs, CloudTrail for API audits, X-Ray for tracing, VPC Flow Logs for network debugging, and third-party monitoring/APM tools.
Enjoyed this article? Share it.