grabbit ca7e92a1a1 🎉 Epic 3 Complete: Production Readiness & Observability
Successfully implemented comprehensive monitoring and alerting infrastructure for the Meteor platform across all three stories of Epic 3:

**Story 3.5: 核心业务指标监控 (Core Business Metrics Monitoring)**
- Instrumented NestJS web backend with CloudWatch metrics integration using prom-client
- Instrumented Go compute service with structured CloudWatch metrics reporting
- Created comprehensive Terraform infrastructure from scratch with modular design
- Built 5-row CloudWatch dashboard with application, error rate, business, and infrastructure metrics
- Added proper error categorization and provider performance tracking

**Story 3.6: 关键故障告警 (Critical System Alerts)**
- Implemented SNS-based alerting infrastructure via Terraform
- Created critical alarms for NestJS 5xx error rate (>1% threshold)
- Created Go service processing failure rate alarm (>5% threshold)
- Created SQS queue depth alarm (>1000 messages threshold)
- Added actionable alarm descriptions with investigation guidance
- Configured email notifications with manual confirmation workflow

**Cross-cutting Infrastructure:**
- Complete AWS infrastructure as code with Terraform (S3, SQS, CloudWatch, SNS, IAM, optional RDS/Fargate)
- Structured logging implementation across all services (NestJS, Go, Rust)
- Metrics collection following "Golden Four Signals" observability approach
- Configurable thresholds and deployment-ready monitoring solution

The platform now has production-grade observability with comprehensive metrics collection, centralized monitoring dashboards, and automated critical system alerting.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-03 23:42:01 +08:00
..

Meteor Fullstack Infrastructure

This directory contains Terraform configuration for the Meteor fullstack application AWS infrastructure.

Overview

The infrastructure includes:

  • S3 bucket for storing meteor event files and media
  • SQS queue for processing meteor events with dead letter queue
  • CloudWatch dashboard for comprehensive monitoring
  • IAM policies and roles for service permissions
  • Optional RDS PostgreSQL instance
  • Optional VPC and Fargate configuration for containerized deployment

Quick Start

  1. Install Terraform (version >= 1.0)

  2. Configure AWS credentials:

    aws configure
    # OR set environment variables:
    export AWS_ACCESS_KEY_ID="your-access-key"
    export AWS_SECRET_ACCESS_KEY="your-secret-key"
    export AWS_DEFAULT_REGION="us-east-1"
    
  3. Copy and customize variables:

    cp terraform.tfvars.example terraform.tfvars
    # Edit terraform.tfvars with your desired configuration
    
  4. Initialize and apply:

    terraform init
    terraform plan
    terraform apply
    

Configuration Options

Basic Setup (Default)

  • Creates S3 bucket and SQS queue only
  • Uses external database and container deployment
  • Minimal cost option

With RDS Database

enable_rds = true
rds_instance_class = "db.t3.micro"  # or larger for production

With VPC and Fargate

enable_fargate = true
web_backend_cpu = 256
web_backend_memory = 512
compute_service_cpu = 256
compute_service_memory = 512

Environment Variables

After applying Terraform, configure your applications with these environment variables:

# From terraform output
AWS_REGION=$(terraform output -raw aws_region)
AWS_S3_BUCKET_NAME=$(terraform output -raw s3_bucket_name)
AWS_SQS_QUEUE_URL=$(terraform output -raw sqs_queue_url)

# If using RDS
DATABASE_URL=$(terraform output -raw rds_endpoint)

# If using IAM user (not Fargate)
AWS_ACCESS_KEY_ID=$(terraform output -raw app_access_key_id)
AWS_SECRET_ACCESS_KEY=$(terraform output -raw app_secret_access_key)

CloudWatch Dashboard

The infrastructure creates a comprehensive monitoring dashboard at:

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=meteor-dev-monitoring-dashboard

Dashboard Includes:

  • Application metrics: Request volume, response times, error rates
  • Business metrics: Event processing, validation performance
  • Infrastructure metrics: SQS queue depth, RDS performance, Fargate utilization
  • Custom metrics: From your NestJS and Go services

Metrics Integration

Your applications are already configured to send metrics to CloudWatch:

NestJS Web Backend

  • Namespace: MeteorApp/WebBackend
  • Metrics: RequestCount, RequestDuration, ErrorCount, AuthOperationCount, etc.

Go Compute Service

  • Namespace: MeteorApp/ComputeService
  • Metrics: MessageProcessingCount, ValidationCount, DatabaseOperationCount, etc.

Cost Optimization

Development Environment

environment = "dev"
enable_rds = false                    # Use external database
enable_fargate = false                # Use external containers
cloudwatch_log_retention_days = 7     # Shorter retention

Production Environment

environment = "prod"
enable_rds = true
rds_instance_class = "db.t3.small"    # Appropriate size
enable_fargate = true                 # High availability
cloudwatch_log_retention_days = 30    # Longer retention

File Structure

infrastructure/
├── main.tf              # Provider and common configuration
├── variables.tf         # Input variables
├── outputs.tf           # Output values
├── s3.tf               # S3 bucket for event storage
├── sqs.tf              # SQS queues for processing
├── cloudwatch.tf       # Monitoring dashboard and alarms
├── iam.tf              # IAM roles and policies
├── rds.tf              # Optional PostgreSQL database
├── vpc.tf              # Optional VPC for Fargate
├── terraform.tfvars.example  # Example configuration
└── README.md           # This file

Deployment Integration

Docker Compose

Update your docker-compose.yml with Terraform outputs:

environment:
  - AWS_REGION=${AWS_REGION}
  - AWS_S3_BUCKET_NAME=${AWS_S3_BUCKET_NAME}
  - AWS_SQS_QUEUE_URL=${AWS_SQS_QUEUE_URL}

GitHub Actions

- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v1
  with:
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    aws-region: us-east-1

- name: Deploy infrastructure
  run: |
    cd infrastructure
    terraform init
    terraform apply -auto-approve

Security Best Practices

  1. IAM Permissions: Follow principle of least privilege
  2. S3 Security: All buckets have public access blocked
  3. Encryption: S3 server-side encryption enabled
  4. VPC: Private subnets for database and compute resources
  5. Secrets: RDS passwords stored in AWS Secrets Manager

Monitoring and Alerts

The infrastructure includes CloudWatch alarms for:

  • High error rates in web backend and compute service
  • High response times
  • SQS message age and dead letter queue messages
  • RDS CPU utilization (when enabled)

To add notifications:

  1. Create an SNS topic
  2. Add the topic ARN to alarm actions in cloudwatch.tf

Cleanup

To destroy all resources:

terraform destroy

Warning: This will delete all data in S3 and databases. For production, ensure you have backups.

Troubleshooting

Common Issues

  1. S3 bucket name conflicts: Bucket names must be globally unique

    • Solution: Change project_name or environment in variables
  2. RDS subnet group errors: Requires subnets in different AZs

    • Solution: Ensure enable_fargate = true when using RDS
  3. IAM permission errors: Check AWS credentials and permissions

    • Solution: Ensure your AWS account has admin access or required permissions
  4. CloudWatch dashboard empty: Wait for applications to send metrics

    • Solution: Deploy and run your applications to generate metrics

Getting Help

  1. Check Terraform documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
  2. Review AWS service limits and quotas
  3. Check AWS CloudFormation events for detailed error messages