AI-powered monitoring and observability stack for AWS using CloudWatch, Lambda-based anomaly detection, Grafana dashboards, and intelligent alerting
The AWS AIOps Monitoring Stack provides a comprehensive, production-ready solution for AI-powered IT operations on AWS. This Terraform-based stack combines CloudWatch metrics, intelligent log analysis, anomaly detection, and automated alerting to help you proactively identify and resolve infrastructure issues.
┌─────────────────┐
│ CloudWatch │
│ Log Groups │
└────────┬────────┘
│
▼
┌─────────────────┐
│ CloudWatch │
│ Metrics │
└────────┬────────┘
│
├──────────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Log Analyzer │ │ Anomaly Scorer │
│ Lambda │ │ Lambda │
└────────┬────────┘ └────────┬────────┘
│ │
│ │
└────────┬───────────┘
│
▼
┌─────────────────┐
│ SNS Topic │
└────────┬────────┘
│
┌────────┴────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Slack │ │ PagerDuty │
│ Integration │ │ Integration │
└─────────────────┘ └─────────────────┘
Before deploying this stack, ensure you have:
gh) for repository creation (optional)The AWS credentials used must have permissions for:
git clone https://github.com/hammadhaqqani/aws-aiops-monitoring-stack.git
cd aws-aiops-monitoring-stack
Copy the example variables file and customize:
cd examples/complete
cp terraform.tfvars.example terraform.tfvars
Edit terraform.tfvars with your values:
region = "us-east-1"
environment = "prod"
project_name = "my-aiops-stack"
slack_webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
pagerduty_integration_key = "your-pagerduty-key"
sns_email_addresses = ["admin@example.com"]
log_groups = [
"/aws/lambda/my-function-1",
"/aws/lambda/my-function-2"
]
terraform init
terraform plan
terraform apply
After deployment, you’ll receive outputs including:
Access the dashboards:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=aiops-monitoring-main-prodhttps://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=aiops-monitoring-cost-prodCreates pre-configured CloudWatch dashboards for infrastructure and cost monitoring.
Usage:
module "cloudwatch_dashboards" {
source = "./modules/cloudwatch-dashboards"
project_name = "my-project"
environment = "prod"
log_groups = ["/aws/lambda/function1"]
}
Outputs:
dashboard_urls: Map of dashboard names to URLsCreates threshold-based alarms and composite alarms for infrastructure monitoring.
Usage:
module "cloudwatch_alarms" {
source = "./modules/cloudwatch-alarms"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
log_groups = ["/aws/lambda/function1"]
}
Features:
Enables CloudWatch anomaly detection for key metrics using ML-based algorithms.
Usage:
module "anomaly_detection" {
source = "./modules/anomaly-detection"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
}
Features:
Integrates with AWS Cost Anomaly Detection for automated cost monitoring.
Usage:
module "cost_anomaly" {
source = "./modules/cost-anomaly"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
account_id = "123456789012"
threshold = 50 # USD
}
Features:
Configures Slack and PagerDuty integrations for alerting.
Usage:
module "notifications" {
source = "./modules/notifications"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
slack_webhook_url = var.slack_webhook_url
pagerduty_integration_key = var.pagerduty_integration_key
}
Features:
lambdas/log-analyzer/)Analyzes CloudWatch Logs for patterns, errors, and anomalies.
Capabilities:
Trigger: EventBridge rule (every 5 minutes)
Input:
{
"log_groups": ["/aws/lambda/function1"],
"hours": 1
}
Output:
lambdas/anomaly-scorer/)Calculates anomaly scores for CloudWatch metrics using statistical methods.
Capabilities:
Trigger: EventBridge rule or manual invocation
Input:
{
"metrics": [
{
"namespace": "AWS/Lambda",
"metric_name": "Duration",
"statistic": "Average"
}
]
}
Output:
Pre-built dashboards are automatically created:
aiops-monitoring-main-{env})
aiops-monitoring-cost-{env})
JSON configurations are provided in dashboards/grafana/:
infrastructure-overview.json)
cost-analysis.json)
Import Instructions:
| Service | Usage | Cost |
|---|---|---|
| CloudWatch Metrics | ~100 metrics | $0.30 |
| CloudWatch Logs | 5 GB ingestion | $2.50 |
| CloudWatch Alarms | 20 alarms | $6.00 |
| Lambda Invocations | 8,640/month (5-min schedule) | $0.17 |
| Lambda Compute | 512 MB, 5-min runs | $2.00 |
| SNS | 1,000 notifications | $0.50 |
| Cost Anomaly Detection | Included | $0.00 |
| Total | ~$11.50/month |
Contributions are welcome! Please follow these guidelines:
git checkout -b feature/amazing-feature)terraform fmt before committingterraform validategit commit -m 'Add amazing feature')git push origin feature/amazing-feature)# Install pre-commit hooks (optional)
pre-commit install
# Format Terraform code
terraform fmt -recursive
# Validate Terraform
terraform validate
# Run security scan
tfsec .
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or contributions:
If you find this useful, consider buying me a coffee!