Successfully implemented comprehensive monitoring and alerting infrastructure for the Meteor platform across all three stories of Epic 3: **Story 3.5: 核心业务指标监控 (Core Business Metrics Monitoring)** - Instrumented NestJS web backend with CloudWatch metrics integration using prom-client - Instrumented Go compute service with structured CloudWatch metrics reporting - Created comprehensive Terraform infrastructure from scratch with modular design - Built 5-row CloudWatch dashboard with application, error rate, business, and infrastructure metrics - Added proper error categorization and provider performance tracking **Story 3.6: 关键故障告警 (Critical System Alerts)** - Implemented SNS-based alerting infrastructure via Terraform - Created critical alarms for NestJS 5xx error rate (>1% threshold) - Created Go service processing failure rate alarm (>5% threshold) - Created SQS queue depth alarm (>1000 messages threshold) - Added actionable alarm descriptions with investigation guidance - Configured email notifications with manual confirmation workflow **Cross-cutting Infrastructure:** - Complete AWS infrastructure as code with Terraform (S3, SQS, CloudWatch, SNS, IAM, optional RDS/Fargate) - Structured logging implementation across all services (NestJS, Go, Rust) - Metrics collection following "Golden Four Signals" observability approach - Configurable thresholds and deployment-ready monitoring solution The platform now has production-grade observability with comprehensive metrics collection, centralized monitoring dashboards, and automated critical system alerting. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Meteor Fullstack Infrastructure
This directory contains Terraform configuration for the Meteor fullstack application AWS infrastructure.
Overview
The infrastructure includes:
- S3 bucket for storing meteor event files and media
- SQS queue for processing meteor events with dead letter queue
- CloudWatch dashboard for comprehensive monitoring
- IAM policies and roles for service permissions
- Optional RDS PostgreSQL instance
- Optional VPC and Fargate configuration for containerized deployment
Quick Start
-
Install Terraform (version >= 1.0)
-
Configure AWS credentials:
aws configure # OR set environment variables: export AWS_ACCESS_KEY_ID="your-access-key" export AWS_SECRET_ACCESS_KEY="your-secret-key" export AWS_DEFAULT_REGION="us-east-1" -
Copy and customize variables:
cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars with your desired configuration -
Initialize and apply:
terraform init terraform plan terraform apply
Configuration Options
Basic Setup (Default)
- Creates S3 bucket and SQS queue only
- Uses external database and container deployment
- Minimal cost option
With RDS Database
enable_rds = true
rds_instance_class = "db.t3.micro" # or larger for production
With VPC and Fargate
enable_fargate = true
web_backend_cpu = 256
web_backend_memory = 512
compute_service_cpu = 256
compute_service_memory = 512
Environment Variables
After applying Terraform, configure your applications with these environment variables:
# From terraform output
AWS_REGION=$(terraform output -raw aws_region)
AWS_S3_BUCKET_NAME=$(terraform output -raw s3_bucket_name)
AWS_SQS_QUEUE_URL=$(terraform output -raw sqs_queue_url)
# If using RDS
DATABASE_URL=$(terraform output -raw rds_endpoint)
# If using IAM user (not Fargate)
AWS_ACCESS_KEY_ID=$(terraform output -raw app_access_key_id)
AWS_SECRET_ACCESS_KEY=$(terraform output -raw app_secret_access_key)
CloudWatch Dashboard
The infrastructure creates a comprehensive monitoring dashboard at:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=meteor-dev-monitoring-dashboard
Dashboard Includes:
- Application metrics: Request volume, response times, error rates
- Business metrics: Event processing, validation performance
- Infrastructure metrics: SQS queue depth, RDS performance, Fargate utilization
- Custom metrics: From your NestJS and Go services
Metrics Integration
Your applications are already configured to send metrics to CloudWatch:
NestJS Web Backend
- Namespace:
MeteorApp/WebBackend - Metrics: RequestCount, RequestDuration, ErrorCount, AuthOperationCount, etc.
Go Compute Service
- Namespace:
MeteorApp/ComputeService - Metrics: MessageProcessingCount, ValidationCount, DatabaseOperationCount, etc.
Cost Optimization
Development Environment
environment = "dev"
enable_rds = false # Use external database
enable_fargate = false # Use external containers
cloudwatch_log_retention_days = 7 # Shorter retention
Production Environment
environment = "prod"
enable_rds = true
rds_instance_class = "db.t3.small" # Appropriate size
enable_fargate = true # High availability
cloudwatch_log_retention_days = 30 # Longer retention
File Structure
infrastructure/
├── main.tf # Provider and common configuration
├── variables.tf # Input variables
├── outputs.tf # Output values
├── s3.tf # S3 bucket for event storage
├── sqs.tf # SQS queues for processing
├── cloudwatch.tf # Monitoring dashboard and alarms
├── iam.tf # IAM roles and policies
├── rds.tf # Optional PostgreSQL database
├── vpc.tf # Optional VPC for Fargate
├── terraform.tfvars.example # Example configuration
└── README.md # This file
Deployment Integration
Docker Compose
Update your docker-compose.yml with Terraform outputs:
environment:
- AWS_REGION=${AWS_REGION}
- AWS_S3_BUCKET_NAME=${AWS_S3_BUCKET_NAME}
- AWS_SQS_QUEUE_URL=${AWS_SQS_QUEUE_URL}
GitHub Actions
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy infrastructure
run: |
cd infrastructure
terraform init
terraform apply -auto-approve
Security Best Practices
- IAM Permissions: Follow principle of least privilege
- S3 Security: All buckets have public access blocked
- Encryption: S3 server-side encryption enabled
- VPC: Private subnets for database and compute resources
- Secrets: RDS passwords stored in AWS Secrets Manager
Monitoring and Alerts
The infrastructure includes CloudWatch alarms for:
- High error rates in web backend and compute service
- High response times
- SQS message age and dead letter queue messages
- RDS CPU utilization (when enabled)
To add notifications:
- Create an SNS topic
- Add the topic ARN to alarm actions in
cloudwatch.tf
Cleanup
To destroy all resources:
terraform destroy
Warning: This will delete all data in S3 and databases. For production, ensure you have backups.
Troubleshooting
Common Issues
-
S3 bucket name conflicts: Bucket names must be globally unique
- Solution: Change
project_nameorenvironmentin variables
- Solution: Change
-
RDS subnet group errors: Requires subnets in different AZs
- Solution: Ensure
enable_fargate = truewhen using RDS
- Solution: Ensure
-
IAM permission errors: Check AWS credentials and permissions
- Solution: Ensure your AWS account has admin access or required permissions
-
CloudWatch dashboard empty: Wait for applications to send metrics
- Solution: Deploy and run your applications to generate metrics
Getting Help
- Check Terraform documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
- Review AWS service limits and quotas
- Check AWS CloudFormation events for detailed error messages