AWS Project Task List Based on the AWS Well-Architected Framework
If you have ever used the AWS Well-Architected Tool to perform a Well-Architected review of your workload, you must have come across questions in there that make you go “That never even occurred to me.” or “I never knew something like that even existed in AWS.”
Instead of doing a review after the fact and discovering such things, wouldn’t it be great if someone told you to do all that before you started the project? Wouldn’t it be nice to have an exhaustive checklist of actionable items that you can plan for, back in the planning stage of your project?
This article is an attempt at creating such a checklist of items that you can pick up when starting a project, prioritize based on your requirements and available time, put in your project management tool as actionable tasks, estimate effort and cost against, and finally implement!
If you have a standardized project management tool, you could templatize this checklist, bake it into the tool, embed estimates with the user stories and tasks, and refine the estimates over time, so you get all this ready to go with one click of the “New Project” button.
This list is based entirely on the questions asked by the AWS Well-Architected Framework. You should go through this list only after you have an initial draft of your AWS architecture diagram ready. You can refine the diagram based on the outcomes of this checklist.
For example, you should have decided whether to use EC2, Beanstalk, or Lambda for your compute, RDS or Dynamo for your database, and so on. If you don’t have any idea of what your finished deliverable is going to look like, what AWS services it will utilize, this list will anyway force you to think about it beforehand, so it’s best to have the architecture ready. 😊
There are questions in the Well-Architected Framework which call for org-wide changes like having a defined mechanism for communication between the dev and ops teams. This list does not include such items. It is assumed that all such things are already in place. This list is specific to items you should think about just before you embark on a new AWS project.
Let’s begin.
Operations
- Determine your priorities. Evaluate customer requirements. Meet with stakeholders, including business, dev, and ops teams.
- Evaluate governance requirements:
- What guidelines does your organization define for AWS usage? Are you limited to certain regions/services? Will that work for your project?
- How will you enforce these org policies in your project? Manual review? During/after the project? Can it be automated, like enforcing encryption of all file and database storage using IAM policies or AWS Config rules?
- How will you know if and when there is a change in these org guidelines? Identify the individual(s) responsible for this and add this task item to their clearly defined list of responsibilities.
- Evaluate compliance requirements:
- Does your app deal with the kind of data that mandates compliance with regulations or industry standards? HIPAA, PCI, or just GDPR?
- How will you enforce adherence to these in your project? How will you test adherence? Can enforcement/testing be automated?
- How will you satisfy auditors of your compliance with them? Get the relevant docs from AWS Artifact, and plan for what needs to be added to them.
- Evaluate threat landscape:
- What threatens your business? Competition? Risks and liabilities? Operational risks? InfoSec threats?
- Catalog this in a registry and focus your efforts based on these.
- Assign clearly defined resource ownership:
- Who owns the application, the workload, individual infrastructure components?
- Where is this ownership info maintained? AWS resource tags maybe?
- Enforce ownership tags on AWS resources using IAM policies or Config rules.
- Plan to implement application telemetry:
- Look at every piece of your architecture. Think about what metrics it should emit to indicate the health of your app. If the default AWS metrics cover your need, OK. Otherwise, plan to add custom metrics to your app.
- Introduce user activity telemetry. Emit metrics to monitor user activity like click streams, transaction states, etc. Understanding your app usage patterns would help plan features down the line.
- Although every infra component emits its own metrics, have them emit (custom) metrics that show the health of their dependencies, like if an EC2 calls RDS, emit RDS response times metrics from EC2, and set alarms for when this gets too high.
- Implement traceability. EG: Use X-Ray to trace a single incoming HTTP request through API Gateway, Lambda, DynamoDB, and back.
- Decide on which version control system to use: CodeCommit, GitHub, GitLab, Bitbucket, etc. Plan to version control everything, app code, infra templates, etc.
- Plan to build a CI/CD pipeline. Include as much automated testing in it as possible. Automate rollbacks for failed deployments.
- ALL infra changes must be done via Infra as Code. No exceptions. Run changes to IaC templates through the same rigorous code review process that you use for your app code, like using pull requests.
- Automate patch management, like using SSM for EC2s.
- Isolate environments cleanly, like using separate AWS accounts controlled by Control Tower for dev and prod.
- Think about using techniques like canary deployment to prod instead of all-at-once. Automate this as well using canary metrics and rollbacks.
- Do you need blue/green deployments? How will you automate it? How will you determine the switchover/rollback points?
- Centralize logging and monitoring. Send logs from all components to CloudWatch, Elasticsearch, or any other preferred location. Depending on your choice of infra components, this might need additional configuration like CloudWatch agent on EC2s.
- Plan to create runbooks/playbooks, preferably automated, for every operational event you can think of.
Security
- If you’re starting your project with a clean slate, plan for all of the following:
- New AWS account creation, with properly defined billing, operations, and security contacts as per best practices.
- Control Tower set up. SSO set up: using customer’s existing identity infra or creating new.
- New infra security setup: strong password policies, MFAs everywhere, CloudTrail enabled everywhere and locked down, AWS Config rules enabled and alerting with automated remediation if possible, notifications from Trusted Advisor, Security Hub, etc.
- Clearly define who should be able to access what. Automate this as much as possible. EG: IAM policies can look at user’s groups or resource tags to allow access.
- Establish an emergency access process. If devs need prod access during an event, have an appropriate IAM role with the right permissions ready to go.
- Define permission guardrails. For example, if your project team uses 2 AWS accounts in an AWS Organization, put them in an Organizational Unit (OU) and assign them Service Control Policies (SCPs) that deny access to resources nobody will ever need, like most AWS regions, exotic AWS services, etc.
- Use network boundaries to separate resources and control traffic at every boundary. EG: Keep EC2 and databases in separate subnets, preferably private subnets, and open security groups only for necessary ports.
- Consider deploying IDS/IPS with VPC traffic mirroring that can automatically take action on malicious network traffic.
- Consider Web Application Firewalls for all your application/API endpoints. Look for an auto-updating WAF solution that suits your budget/automation requirements.
- Use Amazon Inspector and GuardDuty to frequently scan and patch vulnerabilities.
- Reduce attack surface. Harden operating systems. Minimize open ports, external dependencies, third-party libraries, etc.
- Enable action-at-a-distance. Avoid direct SSH to EC2. Instead, use SSM documents.
- Validate software integrity. Introduce code signing and validation. Include automated static/dynamic code analysis in CI/CD pipeline using tools like Veracode.
- Classify your data. Is your S3 bucket storing PII? How is it protected? Are you storing legal documents? Do they need to be locked for the next 7 years? How will you ensure that no one can go and assign a public bucket policy to that bucket?
- Automate continuous data identification and classification using services like Macie.
- What about keys? Encrypt everything of course but what about the keys? Are you OK using AWS-provided keys in KMS? If not, plan for your own, their rotation, and management. And please don’t reuse the same keys across data sets.
- Plan for game days to test your incident response processes.
Reliability
- Plan for service quotas and constraints. Estimate how much you’ll need and ask for more if needed. Think cross-account and cross-region as well. Architect around fixed quotas. Consider quota usage spikes during failovers.
- Ensure high availability at every layer:
- At the application’s public endpoints, consider using CDNs or managed AWS services that are highly available by default, like API Gateway.
- At compute layers like EC2, use auto-scaling groups spanning 2 or more AZs.
- At database layers, use capabilities like RDS’s built-in multi-AZ feature.
- If connecting to AWS privately (VPN or Direct Connect), provision redundant tunnels.
- Watch out for IP addresses left in subnets, especially for subnets hosting load balancers (they auto-scale), auto-scaling EC2 groups, or container applications (can consume a lot of IPs).
- Architect using hub-and-spoke pattern instead of many-to-many. EG: Use Transit Gateway from the get go even if you’re only connecting 2 sites for now.
- Plan the network well, not just for now but for future expansion as well. No overlapping CIDRs anywhere.
- Segment your workload. Many small pieces, each specializing in one task, is better than one monolith trying to do everything. Clearly define the API service contract between these so teams can build independently.
- See whether parts of your system that you thought must be real-time can instead be scheduled batch jobs? It feels natural when you’re building with Kinesis to just to the analysis right there on the stream but think about whether you really need this. It might not be necessary and cheaper and more efficient to run the analysis as a batch job.
- Implement loosely coupled dependencies. EG: Instead of calling Lambdas from Lambdas, use SQS in between.
- Make responses idempotent. Same requests should produce the same results irrespective of how many times the request is issued.
- Implement graceful degradation. Turn hard dependencies into soft dependencies. EG: If the microservice serving the Netflix homepage is down or slow, it serves up a cached response instead of bringing down the entire service.
- Throttle requests. Even if you trust your caller 100%, please set a sensible limit and throttle on your requests.
- Set timeouts everywhere. We never want a component waiting on another component indefinitely.
- Make services stateless as much as possible. Offload state to ElastiCache / DynamoDB.
- Automate backups at every layer (EBS, S3, etc). Test backups regularly.
- Plan for disaster recovery. Do you need a copy of your infra in another region? Should it be pilot light, warm, or active-active?
Performance
- Know about the services available at your disposal and use the best one for the job. EG: Use instance store in EC2 if extremely high IOPS are needed, instead of EBS.
- Check out reference architectures. Chances are that someone has tried to implement parts of your system before you. Look out for reference architectures available online. Learn from others’ mistakes.
- Emphasize right-sizing your compute resources. Actively monitor for idle compute and reduce it. If migrating from on-prem, analyze compute utilization on-prem to determine the correct size in the cloud.
- Utilize automated elasticity in the cloud, like auto-scaling for EC2s.
- Evaluate storage solutions available based on file size, caching, access patterns, latency, throughput, and persistence. Pick the right one for your use case.
- Don’t forget about your network’s impact on performance. For latency-sensitive workloads consider local zones or wavelength zones. Examples:
- If your instances are communicating across AZs or regions a lot, the latency will cause a bad user experience. Use placement groups.
- Offload TLS termination to load balancers so your instances can do more.
- HTTP/2 instead of HTTP/1.1 where possible.
- Use performance-related strategies like caching at every layer, using read-replicas, sharding/compressing data, buffer/stream results instead of blocking.
Cost
- Enforce cost allocation tags on everything.
- Set budgets and alerts. Study cost reports looking for improvement opportunities.
- Block accidental usage of costly resources using SCPs at org-level like costly EC2 instance types.
- Implement a decommissioning process for unutilized resources, preferably automated.
- When selecting unmanaged services over managed services solely for cost reasons, consider the cost of operations and management of the service/component.
- Software licensing. Keep looking for ways to get away from costly proprietary licensing to maybe open-source alternatives. Are your licenses bound to CPUs? Does your license force you to use costly dedicated instances, like macOS?
- Align resource usage patterns with appropriate pricing models, like using reserved instances or savings plans for long-running EC2 instances, or inexpensive spot instances for interruptible workloads.
- Perform pricing analysis at master account to factor in consolidated billing and bulk usage discounts.
- Think about data transfer costs. Data leaving your AZ or region might be charged. Avoid it if possible. If data transfer is unavoidable, consider using solutions to reduce costs like CDNs or Lambda@Edge.
Conclusion
That’s all for the checklist. Hopefully, each point in the list is thought-provoking for you and leads to a tiny improvement in your architecture.
About the Author ✍🏻
Harish KM is a Principal DevOps Engineer at QloudX & a top-ranked AWS Ambassador since 2020. 👨🏻💻
With over a decade of industry experience as everything from a full-stack engineer to a cloud architect, Harish has built many world-class solutions for clients around the world! 👷🏻♂️
With over 20 certifications in cloud (AWS, Azure, GCP), containers (Kubernetes, Docker) & DevOps (Terraform, Ansible, Jenkins), Harish is an expert in a multitude of technologies. 📚
These days, his focus is on the fascinating world of DevOps & how it can transform the way we do things! 🚀