MSQL Flexible Server
Incident Report: MySQL Flexible Server
Replacement Due to Index Change
Summary:
An incident occurred where a MySQL Flexible Server was unintentionally replaced during a Terraform apply operation. The root cause was the removal of an item from the customer_aliases list, leading to an index out-of-bounds error and subsequent unintended resource destruction and recreation. The affected resource was the MySQL server corresponding to the alias "CDEF".
Incident Details:
-
Date of Incident: 17th may 2024 AD
-
Affected Service: Azure MySQL Flexible Server
-
Terraform State Before Incident:
customer_aliases = [
"ABCD",
"CDEF",
"ASDF"
]
- Terraform State After Incident:
customer_aliases = [
"ABCD",
"ASDF"
]
- Issue Observed: The removal of "CDEF" from the
customer_aliaseslist caused the MySQL server at index 2 to shift to index 1, leading to Terraform attempting to replace the server due to an index out-of-bounds error.
Root Cause Analysis
The root cause of this incident was the removal of an item from the customer_aliases list, which resulted in an index out-of-bounds error
-
Dynamic Indexing: The use of
count.indexin the resource naming and other properties tied the MySQL server instances to specific list indices. -
List Modification: Modifying the
customer_aliaseslist caused indices to shift, leading to resource destruction and recreation. -
Terraform State Dependency: Terraform's state file relies on indices for tracking resources, which makes it sensitive to changes in the order or length of lists.
Remediation
-
Lock Critical Resources:
Apply resource locks to critical infrastructure to prevent unintended destruction or replacement.
-
Use Unique Identifiers:
Instead of relying on list indices, use unique identifiers for each resource. This can be achieved by using the
for_eachfeature in Terraform. -
Review and Test Changes:
Thoroughly review and test any changes to Terraform configurations in a staging environment before applying them to production.
Implement automated tests and continuous integration (CI) pipelines to detect potential issues early.
Conclusion
By implementing the above remediation steps, we can prevent similar incidents in the future. Using unique identifiers and the for_each construct allows for more resilient resource management, reducing the risk of unintended resource replacement due to index changes. Additionally, locking critical resources and thorough testing will enhance the overall stability of the infrastructure.