top of page
Search

Improving Availability and Performance with Multi-Region Deployments

Race Brocx, Senior Software Engineer


In today’s fast-paced digital landscape, ensuring the availability and performance of critical applications is paramount to success. One common challenge organizations face is relying on a single region for infrastructure deployment, which can lead to significant stability and application uptime risks. This article will explore a real-world scenario in which

  • our client’s infrastructure was deployed solely in the US-East-1 region,

  • its challenges and

  • how we successfully mitigated those risks through a multi-region deployment strategy.



Our client had their entire infrastructure for a critical application deployed in the US-East-1 region. Because this region receives updates before others, it is considered the most unstable, making it a high-risk choice for hosting critical infrastructure. Due to the precarious nature of this region, our team recognized the need for greater availability, and we embarked on a journey to duplicate the infrastructure in the more stable US-West-2 region while improving performance through latency-based routing.


To address the challenges posed by relying solely on the US-East-1 region, we adopted a multi-region deployment approach with the following key steps:

  1. Leveraging AWS Cloud Development Kit: The existing infrastructure, written in TypeScript, utilized AWS CDK (Cloud Development Kit) to convert the code into CloudFormation templates for deployment. This allowed us to automate and streamline the process of provisioning resources in both regions, minimizing manual effort and potential errors.

  2. Duplicating Infrastructure: Because we were utilizing CDK, it was pretty simple to replicate the existing infrastructure from US-East-1 to US-West-2, effectively creating a redundant setup to ensure greater availability and fault tolerance.

  3. Latency-Based Routing: By setting up latency-based routing between the two regions with Route 53, we optimized performance by directing traffic to the region with the lowest latency. This approach ensures a seamless user experience, regardless of their geographical location.



Throughout the deployment process, we encountered several challenges that required innovative solutions. Here are some key challenges and the strategies we employed to overcome them:


1. Globally Unique Resources: We faced issues with resources that required globally unique names, specifically IAM Roles and S3 buckets. To prevent name collisions, we created new IAM Roles with region-specific suffixes, ensuring uniqueness across both US-East-1 and US-West-2 regions.

2. S3 Resource Redundancy: For our S3 buckets, which primarily store ephemeral data that is not crucial enough to require additional redundancy, we decided to deploy them exclusively in US-East-1. We then configured all resources, including those deployed to US-West-2, to point to the S3 buckets in US-East-1. This means that if there is a failure in US-East-1, we may not have access to that data, but because the data is not crucial to the operation of our application, we decided it would not be worth the costs of copying to multiple regions.

3. DynamoDB Multi-Region Migration: Converting a non-global DynamoDB table to a global one, as suggested by CloudFormation documentation, presented potential data loss risks. However, we successfully achieved this migration without any data loss through a proof of concept and CDK. We created a DynamoDB table with CDK and filled it with test data. We updated the CDK configuration to add a replica region to the DynamoDB table and redeployed it. CDK generated a temporary custom resource (Lambda) in the CloudFormation template, leveraging DynamoDB streams to transfer table data seamlessly. This allowed us to convert a single region DynamoDB table to a global table with replica tables in regions US-East-1 and US-West-2.

4. IAM Role Permissions: We encountered challenges with CloudFormation’s refusal to create the necessary IAM roles for the custom resource generated by CDK. Due to a permissions boundary in our deployment account that required a specific prefix in IAM role names, CDK needed to create custom resources to duplicate the DynamoDB tables between regions after creating the replica table. Because they were automatically generated by CDK to be used by the custom resource that CDK generated to create the replica DynamoDB table, it wasn’t possible to rename the IAM roles. To overcome this complex issue, we developed a function to crawl through the nested constructs of the Cloudformation stack, automatically adding the required prefix to any role names generated by CDK. See the below code for a generic example.


5. Seamless Deployment of DynamoDB Tables: First, we enabled the conversion of our DynamoDB tables to global tables by creating connected replica tables in US-West-2. Afterward, we tested the deployment of DynamoDB tables during the replication process. Remarkably, the original tables remained available during the deployment of replica tables, ensuring uninterrupted access to crucial data.


The importance of avoiding single-region deployments, particularly in regions known for their instability, cannot be overstated. By replicating infrastructure, implementing latency-based routing, and leveraging AWS CDK, organizations can safeguard critical applications against outages, ensure global accessibility, and deliver exceptional user experiences. We significantly improved availability and performance through our expertise and implementation of a multi-region deployment strategy.



On June 13, 2023, AWS experienced an outage that impacted Lambda and API Gateway services in the US-East-1 region. This outage could have been disastrous for our client; however, our multi-regional deployment approach proved highly effective in reducing application latency and enhancing availability. Thanks to our infrastructure replication in US-West-2, our deployed services remained fully functional, and we witnessed real-time traffic shifting from the affected US-East-1 region to the US-West-2 region. This successful intervention averted a potential catastrophe for our client and reinforced the benefits of multi-region redundancy in mission-critical applications.






References:


Comments


bottom of page