aws-aiops-monitoring-stack

AWS AIOps Monitoring Stack

Terraform GitHub Pages License Terraform AWS Buy Me A Coffee

AI-powered monitoring and observability stack for AWS using CloudWatch, Lambda-based anomaly detection, Grafana dashboards, and intelligent alerting

📋 Table of Contents

🎯 Overview

The AWS AIOps Monitoring Stack provides a comprehensive, production-ready solution for AI-powered IT operations on AWS. This Terraform-based stack combines CloudWatch metrics, intelligent log analysis, anomaly detection, and automated alerting to help you proactively identify and resolve infrastructure issues.

Key Capabilities

🏗️ Architecture

┌─────────────────┐
│  CloudWatch     │
│  Log Groups     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  CloudWatch     │
│  Metrics        │
└────────┬────────┘
         │
         ├──────────────────┐
         │                  │
         ▼                  ▼
┌─────────────────┐  ┌─────────────────┐
│  Log Analyzer   │  │ Anomaly Scorer  │
│  Lambda         │  │ Lambda          │
└────────┬────────┘  └────────┬────────┘
         │                    │
         │                    │
         └────────┬───────────┘
                  │
                  ▼
         ┌─────────────────┐
         │  SNS Topic      │
         └────────┬────────┘
                  │
         ┌────────┴────────┐
         │                 │
         ▼                 ▼
┌─────────────────┐  ┌─────────────────┐
│  Slack          │  │  PagerDuty      │
│  Integration    │  │  Integration    │
└─────────────────┘  └─────────────────┘

Data Flow

  1. Logs flow from AWS services (Lambda, ECS, etc.) into CloudWatch Log Groups
  2. Metrics are collected by CloudWatch from various AWS services
  3. Log Analyzer Lambda processes logs every 5 minutes, detecting patterns and errors
  4. Anomaly Scorer Lambda analyzes metrics using statistical methods (Z-score, percentiles, trends)
  5. Alerts are published to SNS when anomalies or errors are detected
  6. Notifications are sent to Slack, PagerDuty, or email based on configuration

✨ Features

Core Features

Advanced Features

📦 Prerequisites

Before deploying this stack, ensure you have:

  1. AWS Account with appropriate permissions
  2. Terraform >= 1.0 installed
  3. AWS CLI configured with credentials
  4. Python 3.11 (for local Lambda testing, optional)
  5. GitHub CLI (gh) for repository creation (optional)

Required AWS Permissions

The AWS credentials used must have permissions for:

Optional Integrations

🚀 Quick Start

1. Clone the Repository

git clone https://github.com/hammadhaqqani/aws-aiops-monitoring-stack.git
cd aws-aiops-monitoring-stack

2. Configure Variables

Copy the example variables file and customize:

cd examples/complete
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars with your values:

region      = "us-east-1"
environment = "prod"
project_name = "my-aiops-stack"

slack_webhook_url       = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
pagerduty_integration_key = "your-pagerduty-key"
sns_email_addresses     = ["admin@example.com"]

log_groups = [
  "/aws/lambda/my-function-1",
  "/aws/lambda/my-function-2"
]

3. Initialize Terraform

terraform init

4. Review the Plan

terraform plan

5. Deploy

terraform apply

6. Verify Deployment

After deployment, you’ll receive outputs including:

Access the dashboards:

📚 Modules

CloudWatch Dashboards Module

Creates pre-configured CloudWatch dashboards for infrastructure and cost monitoring.

Usage:

module "cloudwatch_dashboards" {
  source = "./modules/cloudwatch-dashboards"
  
  project_name = "my-project"
  environment  = "prod"
  log_groups   = ["/aws/lambda/function1"]
}

Outputs:

CloudWatch Alarms Module

Creates threshold-based alarms and composite alarms for infrastructure monitoring.

Usage:

module "cloudwatch_alarms" {
  source = "./modules/cloudwatch-alarms"
  
  project_name  = "my-project"
  environment   = "prod"
  sns_topic_arn = aws_sns_topic.alerts.arn
  log_groups    = ["/aws/lambda/function1"]
}

Features:

Anomaly Detection Module

Enables CloudWatch anomaly detection for key metrics using ML-based algorithms.

Usage:

module "anomaly_detection" {
  source = "./modules/anomaly-detection"
  
  project_name  = "my-project"
  environment   = "prod"
  sns_topic_arn = aws_sns_topic.alerts.arn
}

Features:

Cost Anomaly Module

Integrates with AWS Cost Anomaly Detection for automated cost monitoring.

Usage:

module "cost_anomaly" {
  source = "./modules/cost-anomaly"
  
  project_name  = "my-project"
  environment   = "prod"
  sns_topic_arn = aws_sns_topic.alerts.arn
  account_id    = "123456789012"
  threshold     = 50  # USD
}

Features:

Notifications Module

Configures Slack and PagerDuty integrations for alerting.

Usage:

module "notifications" {
  source = "./modules/notifications"
  
  project_name            = "my-project"
  environment             = "prod"
  sns_topic_arn           = aws_sns_topic.alerts.arn
  slack_webhook_url       = var.slack_webhook_url
  pagerduty_integration_key = var.pagerduty_integration_key
}

Features:

🔧 Lambda Functions

Log Analyzer (lambdas/log-analyzer/)

Analyzes CloudWatch Logs for patterns, errors, and anomalies.

Capabilities:

Trigger: EventBridge rule (every 5 minutes)

Input:

{
  "log_groups": ["/aws/lambda/function1"],
  "hours": 1
}

Output:

Anomaly Scorer (lambdas/anomaly-scorer/)

Calculates anomaly scores for CloudWatch metrics using statistical methods.

Capabilities:

Trigger: EventBridge rule or manual invocation

Input:

{
  "metrics": [
    {
      "namespace": "AWS/Lambda",
      "metric_name": "Duration",
      "statistic": "Average"
    }
  ]
}

Output:

📊 Dashboards

CloudWatch Dashboards

Pre-built dashboards are automatically created:

  1. Main Dashboard (aiops-monitoring-main-{env})
    • Lambda metrics overview
    • Error logs
    • ALB metrics
    • ECS container metrics
  2. Cost Dashboard (aiops-monitoring-cost-{env})
    • Daily AWS charges
    • Cost trends
    • Lambda cost drivers

Grafana Dashboards

JSON configurations are provided in dashboards/grafana/:

  1. Infrastructure Overview (infrastructure-overview.json)
    • Lambda invocations and errors
    • Anomaly scores
    • ALB response times
    • Error rates and active alarms
  2. Cost Analysis (cost-analysis.json)
    • Daily charges
    • Cost by service
    • Cost anomaly detection
    • Monthly cost forecast

Import Instructions:

  1. Open Grafana
  2. Go to Dashboards → Import
  3. Upload the JSON file
  4. Configure data source (CloudWatch or Prometheus)

💰 Cost Estimation

Monthly Cost Breakdown (Estimated)

Service Usage Cost
CloudWatch Metrics ~100 metrics $0.30
CloudWatch Logs 5 GB ingestion $2.50
CloudWatch Alarms 20 alarms $6.00
Lambda Invocations 8,640/month (5-min schedule) $0.17
Lambda Compute 512 MB, 5-min runs $2.00
SNS 1,000 notifications $0.50
Cost Anomaly Detection Included $0.00
Total   ~$11.50/month

Cost Optimization Tips

  1. Reduce Log Retention: Adjust log retention periods based on needs
  2. Optimize Lambda Memory: Tune memory size based on actual usage
  3. Filter Logs: Use metric filters to reduce log ingestion
  4. Consolidate Alarms: Use composite alarms to reduce alarm count
  5. Schedule Analysis: Adjust Lambda schedule frequency based on requirements

Free Tier Eligibility

🤝 Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests if applicable
  4. Follow Terraform best practices:
    • Use terraform fmt before committing
    • Validate with terraform validate
    • Document new variables and outputs
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Development Setup

# Install pre-commit hooks (optional)
pre-commit install

# Format Terraform code
terraform fmt -recursive

# Validate Terraform
terraform validate

# Run security scan
tfsec .

Code Style

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

For issues, questions, or contributions:


Built with ❤️ for AWS AIOps

Support

If you find this useful, consider buying me a coffee!

Buy Me A Coffee