아시아-태평양 서울 리전(AP-NorthEast-2)의 Amazon EC2 DNS 확인(Resolution) 이슈 요약

2018 년 11 월 22 일 서울 리전(AP-NORTHEAST-2)에서 발생한 서비스 중단 상황에 대해 추가적으로 정보를 알려드립니다. 당일 한국시간 오전 8시 19분에서 9시 43분까지 서울 리전에서 EC2 인스턴스에 DNS 확인 이슈가 있었습니다. 이는 EC2 인스턴스에 재귀 DNS 서비스를 제공하는 EC2 DNS 확인 서버군(resolver fleet) 중 정상 호스트 수가 감소했기 때문입니다. 정상 상태의 호스트 수가 이전 수준으로 복원됨에 따라 DNS 확인 서비스는 복원되었습니다. 이번 이슈에서 EC2 인스턴스의 네트워크 연결 및 EC2 외부의 DNS 확인 과정은 영향을 받지 않았습니다.

DNS 확인 문제의 근본 원인은 설정 업데이트 시 서울 리전의 EC2 DNS 확인 서버군의 최소 정상 호스트를 지정하는 설정을 잘못 제거한 것에 기인합니다. 이로 인해 최소한의 정상 호스트 구성 기본 설정 값이 매우 낮은 것으로 해석되어, 정상 서비스 호스트 숫자가 줄어들었습니다. EC2 DNS 확인 서버군의 정상 호스트 용량이 감소함에 따라, 고객 EC2 인스턴스 내의 DNS 쿼리가 실패하기 시작했습니다. AWS 엔지니어링 팀에게 오전 8시 21분에 서울 리전 내의 DNS 확인 문제가 통보되었고, 즉시 문제 해결에 나섰습니다. AWS는 먼저 더 이상 정상 호스트가 서비스에서 제거되는 것을 방지함으로써 추가적인 영향이 없음을 확인했습니다. 이 작업에 15분이 추가로 소요되었습니다. 이후 서비스 용량을 이전 수준으로 복원했으며, 복구 시간의 대부분이 이 작업에 사용되었습니다. 한국 시간 오전 9시 43분에 EC2 인스턴스의 DNS 질의 문제를 완전히 복구했습니다.

AWS는 이러한 문제의 재발을 막기 위해 다방면의 조치를 취하고 있으며, 그 중 일부는 이미 완료되었습니다. 먼저 모든 AWS 리전의 EC2 DNS 확인 서비스에 대한 올바른 용량 설정이 있는지 즉시 확인했습니다. AWS는 모든 리전에 항상 충분한 최소한의 정상 호스트를 제공하기 위해 모든 EC2 DNS 확인 설정 업데이트에 대해 의미적 구성 검증(semantic configuration validation)을 구현했습니다. 또한 정상 호스트 중 시간당 제한된 양의 용량만 서비스에서 제거할 수 있도록 조절 기능을 추가하고 있습니다. 이러한 방법으로 잘못된 구성 매개 변수가 발생할 경우에도 EC2 DNS 확인 서버군의 용량 축소를 방지할 수 있습니다.

마지막으로 이번 경우로 인해 고객 여러분들에게 끼친 영향에 대해 사과드립니다. AWS의 EC2 DNS는 그 동안 높은 가용성을 제공해 왔습니다. AWS는 이 서비스가 저희 고객들과 고객들의 애플리케이션 및 최종 사용자, 그리고 비즈니스에 얼마나 중요한지 잘 알고 있습니다. AWS는 큰 교훈을 얻었으며, 저희의 가용성을 더욱 높이기 위해 최선의 노력을 다 할 것입니다.

Summary of the Amazon EC2 DNS Resolution Issues in the Asia Pacific (Seoul) Region (AP-NORTHEAST-2)

We’d like to give you some additional information about the service disruption that occurred in the Seoul (AP-NORTHEAST-2) Region on November 22, 2018. Between 8:19 AM and 9:43 AM KST, EC2 instances experienced DNS resolution issues in the AP-NORTHEAST-2 region. This was caused by a reduction in the number of healthy hosts that were part of the EC2 DNS resolver fleet, which provides a recursive DNS service to EC2 instances. Service was restored when the number of healthy hosts was restored to previous levels. EC2 network connectivity and DNS resolution outside of EC2 instances were not affected by this event.

The root cause of DNS resolution issues was a configuration update which incorrectly removed the setting that specifies the minimum healthy hosts for the EC2 DNS resolver fleet in the AP-NORTHEAST-2 Region. This resulted in the minimum healthy hosts configuration setting being interpreted as a very low default value that resulted in fewer in-service healthy hosts. With the reduced healthy host capacity for the EC2 DNS resolver fleet, DNS queries from within EC2 instances began to fail. At 8:21 AM KST, the engineering team was alerted to the DNS resolution issue within the AP-NORTHEAST-2 Region and immediately began working on resolution. We identified root cause at 8:48 AM KST and we first ensured that there was no further impact by preventing additional healthy hosts from being removed from service; this took an additional 15 minutes. We then started restoring capacity to previous levels which took the bulk of the recovery time. At 9:43 AM KST, DNS queries from within EC2 instances saw full recovery.

We are taking multiple steps to prevent recurrence of this issue, some of which are already complete. We have immediately validated and ensured that every AWS region has the correct capacity settings for the EC2 DNS resolver service. We are implementing semantic configuration validation for all EC2 DNS resolver configuration updates, to ensure every region always has sufficient minimum healthy hosts. We are also adding throttling to ensure that only a limited amount of healthy host capacity can be removed from service each hour. This will prevent the downscaling of the EC2 DNS resolver fleet in the event of an invalid configuration parameter.

Finally, we want to apologize for the impact this event caused for our customers. While we’ve had a strong track record of availability with EC2 DNS, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.‎

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages