Successfully implemented comprehensive monitoring and alerting infrastructure for the Meteor platform across all three stories of Epic 3: **Story 3.5: 核心业务指标监控 (Core Business Metrics Monitoring)** - Instrumented NestJS web backend with CloudWatch metrics integration using prom-client - Instrumented Go compute service with structured CloudWatch metrics reporting - Created comprehensive Terraform infrastructure from scratch with modular design - Built 5-row CloudWatch dashboard with application, error rate, business, and infrastructure metrics - Added proper error categorization and provider performance tracking **Story 3.6: 关键故障告警 (Critical System Alerts)** - Implemented SNS-based alerting infrastructure via Terraform - Created critical alarms for NestJS 5xx error rate (>1% threshold) - Created Go service processing failure rate alarm (>5% threshold) - Created SQS queue depth alarm (>1000 messages threshold) - Added actionable alarm descriptions with investigation guidance - Configured email notifications with manual confirmation workflow **Cross-cutting Infrastructure:** - Complete AWS infrastructure as code with Terraform (S3, SQS, CloudWatch, SNS, IAM, optional RDS/Fargate) - Structured logging implementation across all services (NestJS, Go, Rust) - Metrics collection following "Golden Four Signals" observability approach - Configurable thresholds and deployment-ready monitoring solution The platform now has production-grade observability with comprehensive metrics collection, centralized monitoring dashboards, and automated critical system alerting. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
217 lines
6.3 KiB
Markdown
217 lines
6.3 KiB
Markdown
# Meteor Fullstack Infrastructure
|
|
|
|
This directory contains Terraform configuration for the Meteor fullstack application AWS infrastructure.
|
|
|
|
## Overview
|
|
|
|
The infrastructure includes:
|
|
- **S3 bucket** for storing meteor event files and media
|
|
- **SQS queue** for processing meteor events with dead letter queue
|
|
- **CloudWatch dashboard** for comprehensive monitoring
|
|
- **IAM policies** and roles for service permissions
|
|
- **Optional RDS PostgreSQL** instance
|
|
- **Optional VPC and Fargate** configuration for containerized deployment
|
|
|
|
## Quick Start
|
|
|
|
1. **Install Terraform** (version >= 1.0)
|
|
2. **Configure AWS credentials**:
|
|
```bash
|
|
aws configure
|
|
# OR set environment variables:
|
|
export AWS_ACCESS_KEY_ID="your-access-key"
|
|
export AWS_SECRET_ACCESS_KEY="your-secret-key"
|
|
export AWS_DEFAULT_REGION="us-east-1"
|
|
```
|
|
|
|
3. **Copy and customize variables**:
|
|
```bash
|
|
cp terraform.tfvars.example terraform.tfvars
|
|
# Edit terraform.tfvars with your desired configuration
|
|
```
|
|
|
|
4. **Initialize and apply**:
|
|
```bash
|
|
terraform init
|
|
terraform plan
|
|
terraform apply
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### Basic Setup (Default)
|
|
- Creates S3 bucket and SQS queue only
|
|
- Uses external database and container deployment
|
|
- Minimal cost option
|
|
|
|
### With RDS Database
|
|
```hcl
|
|
enable_rds = true
|
|
rds_instance_class = "db.t3.micro" # or larger for production
|
|
```
|
|
|
|
### With VPC and Fargate
|
|
```hcl
|
|
enable_fargate = true
|
|
web_backend_cpu = 256
|
|
web_backend_memory = 512
|
|
compute_service_cpu = 256
|
|
compute_service_memory = 512
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
After applying Terraform, configure your applications with these environment variables:
|
|
|
|
```bash
|
|
# From terraform output
|
|
AWS_REGION=$(terraform output -raw aws_region)
|
|
AWS_S3_BUCKET_NAME=$(terraform output -raw s3_bucket_name)
|
|
AWS_SQS_QUEUE_URL=$(terraform output -raw sqs_queue_url)
|
|
|
|
# If using RDS
|
|
DATABASE_URL=$(terraform output -raw rds_endpoint)
|
|
|
|
# If using IAM user (not Fargate)
|
|
AWS_ACCESS_KEY_ID=$(terraform output -raw app_access_key_id)
|
|
AWS_SECRET_ACCESS_KEY=$(terraform output -raw app_secret_access_key)
|
|
```
|
|
|
|
## CloudWatch Dashboard
|
|
|
|
The infrastructure creates a comprehensive monitoring dashboard at:
|
|
```
|
|
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=meteor-dev-monitoring-dashboard
|
|
```
|
|
|
|
### Dashboard Includes:
|
|
- **Application metrics**: Request volume, response times, error rates
|
|
- **Business metrics**: Event processing, validation performance
|
|
- **Infrastructure metrics**: SQS queue depth, RDS performance, Fargate utilization
|
|
- **Custom metrics**: From your NestJS and Go services
|
|
|
|
## Metrics Integration
|
|
|
|
Your applications are already configured to send metrics to CloudWatch:
|
|
|
|
### NestJS Web Backend
|
|
- Namespace: `MeteorApp/WebBackend`
|
|
- Metrics: RequestCount, RequestDuration, ErrorCount, AuthOperationCount, etc.
|
|
|
|
### Go Compute Service
|
|
- Namespace: `MeteorApp/ComputeService`
|
|
- Metrics: MessageProcessingCount, ValidationCount, DatabaseOperationCount, etc.
|
|
|
|
## Cost Optimization
|
|
|
|
### Development Environment
|
|
```hcl
|
|
environment = "dev"
|
|
enable_rds = false # Use external database
|
|
enable_fargate = false # Use external containers
|
|
cloudwatch_log_retention_days = 7 # Shorter retention
|
|
```
|
|
|
|
### Production Environment
|
|
```hcl
|
|
environment = "prod"
|
|
enable_rds = true
|
|
rds_instance_class = "db.t3.small" # Appropriate size
|
|
enable_fargate = true # High availability
|
|
cloudwatch_log_retention_days = 30 # Longer retention
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
infrastructure/
|
|
├── main.tf # Provider and common configuration
|
|
├── variables.tf # Input variables
|
|
├── outputs.tf # Output values
|
|
├── s3.tf # S3 bucket for event storage
|
|
├── sqs.tf # SQS queues for processing
|
|
├── cloudwatch.tf # Monitoring dashboard and alarms
|
|
├── iam.tf # IAM roles and policies
|
|
├── rds.tf # Optional PostgreSQL database
|
|
├── vpc.tf # Optional VPC for Fargate
|
|
├── terraform.tfvars.example # Example configuration
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Deployment Integration
|
|
|
|
### Docker Compose
|
|
Update your `docker-compose.yml` with Terraform outputs:
|
|
```yaml
|
|
environment:
|
|
- AWS_REGION=${AWS_REGION}
|
|
- AWS_S3_BUCKET_NAME=${AWS_S3_BUCKET_NAME}
|
|
- AWS_SQS_QUEUE_URL=${AWS_SQS_QUEUE_URL}
|
|
```
|
|
|
|
### GitHub Actions
|
|
```yaml
|
|
- name: Configure AWS credentials
|
|
uses: aws-actions/configure-aws-credentials@v1
|
|
with:
|
|
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
|
|
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
|
|
aws-region: us-east-1
|
|
|
|
- name: Deploy infrastructure
|
|
run: |
|
|
cd infrastructure
|
|
terraform init
|
|
terraform apply -auto-approve
|
|
```
|
|
|
|
## Security Best Practices
|
|
|
|
1. **IAM Permissions**: Follow principle of least privilege
|
|
2. **S3 Security**: All buckets have public access blocked
|
|
3. **Encryption**: S3 server-side encryption enabled
|
|
4. **VPC**: Private subnets for database and compute resources
|
|
5. **Secrets**: RDS passwords stored in AWS Secrets Manager
|
|
|
|
## Monitoring and Alerts
|
|
|
|
The infrastructure includes CloudWatch alarms for:
|
|
- High error rates in web backend and compute service
|
|
- High response times
|
|
- SQS message age and dead letter queue messages
|
|
- RDS CPU utilization (when enabled)
|
|
|
|
To add notifications:
|
|
1. Create an SNS topic
|
|
2. Add the topic ARN to alarm actions in `cloudwatch.tf`
|
|
|
|
## Cleanup
|
|
|
|
To destroy all resources:
|
|
```bash
|
|
terraform destroy
|
|
```
|
|
|
|
**Warning**: This will delete all data in S3 and databases. For production, ensure you have backups.
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **S3 bucket name conflicts**: Bucket names must be globally unique
|
|
- Solution: Change `project_name` or `environment` in variables
|
|
|
|
2. **RDS subnet group errors**: Requires subnets in different AZs
|
|
- Solution: Ensure `enable_fargate = true` when using RDS
|
|
|
|
3. **IAM permission errors**: Check AWS credentials and permissions
|
|
- Solution: Ensure your AWS account has admin access or required permissions
|
|
|
|
4. **CloudWatch dashboard empty**: Wait for applications to send metrics
|
|
- Solution: Deploy and run your applications to generate metrics
|
|
|
|
### Getting Help
|
|
|
|
1. Check Terraform documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
|
|
2. Review AWS service limits and quotas
|
|
3. Check AWS CloudFormation events for detailed error messages |