Serverless Production Readiness Checklist
Going to production is both an exciting and terrifying event.
Doing it with a Serverless application can be even scarier if you are unaware of all the caveats.
This blog will covers my personal productization checklist for Serverless applications.
However, some items in the checklist are relevant for SaaS applications in general.
The checklist is split into five categories:
Observability & support readiness
Run performance tests
Your Serverless application works, and all the tests are green.
Great, but did you examine the overall performance and hidden bottlenecks lurking around to crash your application?
Don't wait for performance issues to occur; handle them as soon as possible:
1. Use AWS X-Ray tracing to find bottlenecks in your code and refactor them.
2. Use AWS Lambda Power Tuning to balance cost and performance.
3. Consider switching to Graviton CPU for your Lambda functions.
Read more about tracing and performance monitoring here.
Improve performance bottlenecks by adding caching mechanisms.
Many caching mechanisms can provide a substantial performance boost for your application.
For AWS Lambda, an in-memory cache (memoization), or a DynamoDB cache (with/without DAX) table are usually considered low-hanging fruit and can provide an immediate performance boost.
Read more about it in Yan Cui's post.
Penetration tests are a no-brainer and are mandatory in any service, especially a Serverless service.
Run penetration tests in a dedicated AWS account with a configuration that matches your production account.
Internal Security Review
Conduct an internal security review of the application with an AWS security expert to pinpoint points of failure and take measures to resolve them.
Use AWS Web Application Firewall (WAF) - use AWS-managed rules on your API Gateways and CloudFront distributions.
Read about WAF rules types here and AWS-managed rules here to get started.
AWS Lambda Inspection
Enable Amazon Inspector on your AWS Lambda functions and layers.
"Amazon Inspector scans functions and layers initially upon deployment and automatically rescans them when there are changes in the workloads, for example, when a Lambda function is updated or when a new vulnerability (CVE) is published."
Learn more about it here.
CI/CD Vulnerability Scanner
Use a vulnerability scanner in your CI/CD pipeline. While Amazon Inspector checks deployed Lambda layers and functions for vulnerabilities, it's always good to "shift left" and do these checks as early as possible at the CI/CD pipeline.
I've used tools such as Snyk, which get the job done. Read more about it here.
IAC Security Best Practices
Use AWS IAM permissions and AWS best practices guidelines in your deployment framework.
You want to prevent security breaches such as an unprotected API Gateway, an open S3 bucket, or an overly permissive IAM role.
If you use CDK, you should implement CDK nag; otherwise, use cfn-nag.
Check out a working CDK nag code example in my blog post about CDK's best practices under the security guidelines.
Canary Deployment for AWS Lambda Functions
Canary deployment for AWS Lambda - for a production environment, use canary deployments with automatic rollbacks at first sight of AWS Lambda error logs or triggered AWS CloudWatch alarms.
Canary deployments gradually shift traffic to your new AWS Lambda version and revert the shift at first sight of errors.
One way to achieve that is to use AWS Code Deploy with AWS Lambda.
Read more about it here.
Canary Deployment for Lambda Dynamic Configuration
Canary deployments are also relevant in the domain of dynamic application configuration.
Feature flags are a type of dynamic configuration and allow you quickly change the behavior of your AWS Lambda function. One way of improving feature release confidence is to be able to turn a feature on or off quickly.
I recommend using AWS AppConfig with AWS Lambda Powertools feature flags utility for the best feature flags experience.
For more details and code examples, click here and here.
AWS Lambda cold starts can become an issue in business-critical or real-time use cases. Critical services such as customer log in and main page load come to mind.
For those use cases, and only in the production account, I'd suggest enabling provisioned concurrency, so your service is always up and ready to serve.
Read more about it here.
AWS Lambda service scales your functions according to load.
However, every AWS account and region has a maximum amount of concurrent lambdas, which are shared across ALL lambda functions.
If two lambda functions are deployed in the same account and region, one function can scale drastically and cause starvation and throttling to the other function (HTTP 429 error) since the entire account reaches the maximum concurrent quota.
Defining reserved concurrency per lambda function can prevent that.
Read more about it here.
You should accept that it is just a matter of time until something terrible happens.
You can't prepare for everything, but you can have insurance policies and ready your services for all catastrophes.
Let's review some items that can prepare your service for the worst.
This one is tricky. Expect the unexpected. Outages and server errors are going to happen, even in Serverless.
You need to be prepared.
Use AWS Fault Injection Simulation to create chaos in your AWS account, have your AWS API calls fail, and see how your service behaves.
Try to design for failure as early as possible in your Serverless journey.
Back up your data and customers' data.
Enable hourly backups of your DynamoDB tables, Aurora databases, OpenSearch indexes, or any other database entity. It's better to be safe than sorry.
Some services, like DynamoDB, offer automatic backups and ease of restoration.
See more CDK backup best practices here.
Restore from Backup
Create a process for restoring production data from the backup.
Creating a backup is one thing, but restoring from a backup when the clock is ticking, and upset customers are at your doorstep is another.
You should create a well-defined process to restore any database quickly and safely.
Develop the required scripts, define the restoration process (who runs it, when, how), test it in non-production environments, and train your support staff to use it.
Production Ad-hoc Actions
Create a process for making ad-hoc changes/scripts/fixes on the production account. Sometimes you can't wait for a bug fix to deploy from the dev account to production, and it can take too much time to go through all the CI/CD pipeline stages.
Sometimes you need a quick, audited and safe manner of changing production data.
Make sure it is audited, requires extra approvals, and does not break any regulations you are obligated to.
Design for data redundancy. AWS regions can do down. Yes, it happens.
However, you can't have your entire service go down with it.
Design for multiple region backups as explained here.
Simulate the outage in the primary region and verify that the application works in the secondary region.
Once the primary region is active again, validate that the customer data is in sync with the secondary region.
Observability & Support Readiness
Good observability and logging practices will make your developers and support teams happy.
In my eyes, the perfect debugging session is the one in that I can trace a single user activity across multiple services with just one id - the infamous correlation id value.
One way to achieve this experience is to inject a correlation id value into your service logs.
In addition, you must pass this value to any following call to services via request/event headers (AWS SNS attributes/HTTP headers, etc.).
See an example here with AWS Lambda Powertools Logger.
Create AWS CloudWatch dashboards that provide a high-level overview of your service status for your SRE team.
It should contain manageable error logs and service information, so non-developers can quickly pinpoint errors and their root cause.
Leave the complicated dashboards containing low-level service CloudWatch metrics to the developer's dashboards.
Work closely with the SRE team, add precise log messages describing service issues and create the dashboard.
Read more about observability and CloudWatch here.
Define CloudWatch alerts on critical error logs or CloudWatch metrics that correlate to a severe service deficiency or denial of service.
These can include Lambda function crashes, latency issues, Lambda function timeouts, DynamoDB errors, Cognito login issues, etc.
Each alarm needs to be investigated and mitigated quickly.
Invest time and effort in support training for developers and SREs.
Each dashboard error log or CloudWatch metric must have a predefined action for the SRE to take.
Include guidelines such as "If you see 'X' and 'Y,' it most probably means that Z is an issue, follow these steps to mitigate Z."
Make sure the SREs understand the high-level event-driven architecture of the service so they can support it more efficiently.
Add KPI Metrics - key performance indicators are "special" metrics that, in theory, can predict the success of your service.
KPIs serve as a means to predict the future of the business. KPIs are strategically designed to support the business use case. They require a deep understanding of your business and users and, as such, require careful definition.
Fail to meet well-defined KPIs, and see your service fail. Succeed to meet your KPIs, and your service will have better chances of succeeding.
Learn here how to implement KPIs in your Serverless service with AWS CloudWatch custom metrics.