We’d like to give you some additional information about the service disruption that occurred in the Seoul (AP-NORTHEAST-2) Region on November 22, 2018. Between 8:19 AM and 9:43 AM KST, EC2 instances experienced DNS resolution issues in the AP-NORTHEAST-2 region. This was caused by a reduction in the number of healthy hosts that were part of the EC2 DNS resolver fleet, which provides a recursive DNS service to EC2 instances. Service was restored when the number of healthy hosts was restored to previous levels. EC2 network connectivity and DNS resolution outside of EC2 instances were not affected by this event.
The root cause of DNS resolution issues was a configuration update which incorrectly removed the setting that specifies the minimum healthy hosts for the EC2 DNS resolver fleet in the AP-NORTHEAST-2 Region. This resulted in the minimum healthy hosts configuration setting being interpreted as a very low default value that resulted in fewer in-service healthy hosts. With the reduced healthy host capacity for the EC2 DNS resolver fleet, DNS queries from within EC2 instances began to fail. At 8:21 AM KST, the engineering team was alerted to the DNS resolution issue within the AP-NORTHEAST-2 Region and immediately began working on resolution. We identified root cause at 8:48 AM KST and we first ensured that there was no further impact by preventing additional healthy hosts from being removed from service; this took an additional 15 minutes. We then started restoring capacity to previous levels which took the bulk of the recovery time. At 9:43 AM KST, DNS queries from within EC2 instances saw full recovery.
We are taking multiple steps to prevent recurrence of this issue, some of which are already complete. We have immediately validated and ensured that every AWS region has the correct capacity settings for the EC2 DNS resolver service. We are implementing semantic configuration validation for all EC2 DNS resolver configuration updates, to ensure every region always has sufficient minimum healthy hosts. We are also adding throttling to ensure that only a limited amount of healthy host capacity can be removed from service each hour. This will prevent the downscaling of the EC2 DNS resolver fleet in the event of an invalid configuration parameter.
Finally, we want to apologize for the impact this event caused for our customers. While we’ve had a strong track record of availability with EC2 DNS, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.