how to detect aws ecs deployment failure
Deployment Issue and Alert Configuration
Recently we had an interesting issue.
Due to an incorrect cloud config entry, one of our services fell into an
endless loop of failed deployments.
So we had to add some kind of alerts to detect this in the future.
We are deploying our services on AWS ECS.
As we are using yet-another-cloudwatch-exporter
to export CloudWatch metrics into Prometheus, the natural way was to
expose new metrics and add a Prometheus alert based on it.
We exported one new metric from
ECS/ContainerInsights
:
apiVersion: v1alpha1
# [...]
discovery:
jobs:
- type: ECS/ContainerInsights
regions:
- eu-west-1
dimensionNameRequirements: [ClusterName, ServiceName]
statistics: [Average]
nilToZero: true
addCloudwatchTimestamp: false
metrics:
- name: DeploymentCount
# [...]
and added new config for Prometheus alerts to
config/apps-alerts.yml
:
---
groups:
- name: apps-alerts
rules:
- alert: DeploymentAlert
expr: sum by (dimension_ServiceName) (aws_ecs_containerinsights_deployment_count_average) >= 2
for: 15m
labels:
severity: page
annotations:
summary: "Probable deployment problem: `{{ $labels.dimension_ServiceName }}`"
and finally added new config to Prometheus env-based configs
(prometheus-XXX-config.yml
):
# [...]
rule_files:
# [...]
- apps-alerts.yml
# [...]
Now if there are some problems with failed deployments, we are seeing notifications in our slack.