MSQL Flexible Server

Incident Report: MySQL Flexible Server

Replacement Due to Index Change

Summary:

An incident occurred where a MySQL Flexible Server was unintentionally replaced during a Terraform apply operation. The root cause was the removal of an item from the customer_aliases list, leading to an index out-of-bounds error and subsequent unintended resource destruction and recreation. The affected resource was the MySQL server corresponding to the alias "CDEF".

Incident Details:

Date of Incident: 17th may 2024 AD
Affected Service: Azure MySQL Flexible Server
Terraform State Before Incident:

customer_aliases = [
  "ABCD",
  "CDEF",
  "ASDF"
]

Terraform State After Incident:

customer_aliases = [
  "ABCD",
  "ASDF"
]

Issue Observed: The removal of "CDEF" from the customer_aliases list caused the MySQL server at index 2 to shift to index 1, leading to Terraform attempting to replace the server due to an index out-of-bounds error.

Root Cause Analysis

The root cause of this incident was the removal of an item from the customer_aliases list, which resulted in an index out-of-bounds error

Dynamic Indexing: The use of count.index in the resource naming and other properties tied the MySQL server instances to specific list indices.
List Modification: Modifying the customer_aliases list caused indices to shift, leading to resource destruction and recreation.
Terraform State Dependency: Terraform's state file relies on indices for tracking resources, which makes it sensitive to changes in the order or length of lists.

Remediation

Lock Critical Resources:

Apply resource locks to critical infrastructure to prevent unintended destruction or replacement.
Use Unique Identifiers:

Instead of relying on list indices, use unique identifiers for each resource. This can be achieved by using the for_each feature in Terraform.
Review and Test Changes:

Thoroughly review and test any changes to Terraform configurations in a staging environment before applying them to production.

Implement automated tests and continuous integration (CI) pipelines to detect potential issues early.

Conclusion

By implementing the above remediation steps, we can prevent similar incidents in the future. Using unique identifiers and the for_each construct allows for more resilient resource management, reducing the risk of unintended resource replacement due to index changes. Additionally, locking critical resources and thorough testing will enhance the overall stability of the infrastructure.