grabbit ca7e92a1a1 🎉 Epic 3 Complete: Production Readiness & Observability
Successfully implemented comprehensive monitoring and alerting infrastructure for the Meteor platform across all three stories of Epic 3:

**Story 3.5: 核心业务指标监控 (Core Business Metrics Monitoring)**
- Instrumented NestJS web backend with CloudWatch metrics integration using prom-client
- Instrumented Go compute service with structured CloudWatch metrics reporting
- Created comprehensive Terraform infrastructure from scratch with modular design
- Built 5-row CloudWatch dashboard with application, error rate, business, and infrastructure metrics
- Added proper error categorization and provider performance tracking

**Story 3.6: 关键故障告警 (Critical System Alerts)**
- Implemented SNS-based alerting infrastructure via Terraform
- Created critical alarms for NestJS 5xx error rate (>1% threshold)
- Created Go service processing failure rate alarm (>5% threshold)
- Created SQS queue depth alarm (>1000 messages threshold)
- Added actionable alarm descriptions with investigation guidance
- Configured email notifications with manual confirmation workflow

**Cross-cutting Infrastructure:**
- Complete AWS infrastructure as code with Terraform (S3, SQS, CloudWatch, SNS, IAM, optional RDS/Fargate)
- Structured logging implementation across all services (NestJS, Go, Rust)
- Metrics collection following "Golden Four Signals" observability approach
- Configurable thresholds and deployment-ready monitoring solution

The platform now has production-grade observability with comprehensive metrics collection, centralized monitoring dashboards, and automated critical system alerting.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-03 23:42:01 +08:00

217 lines
6.3 KiB
Markdown

# Meteor Fullstack Infrastructure
This directory contains Terraform configuration for the Meteor fullstack application AWS infrastructure.
## Overview
The infrastructure includes:
- **S3 bucket** for storing meteor event files and media
- **SQS queue** for processing meteor events with dead letter queue
- **CloudWatch dashboard** for comprehensive monitoring
- **IAM policies** and roles for service permissions
- **Optional RDS PostgreSQL** instance
- **Optional VPC and Fargate** configuration for containerized deployment
## Quick Start
1. **Install Terraform** (version >= 1.0)
2. **Configure AWS credentials**:
```bash
aws configure
# OR set environment variables:
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
```
3. **Copy and customize variables**:
```bash
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your desired configuration
```
4. **Initialize and apply**:
```bash
terraform init
terraform plan
terraform apply
```
## Configuration Options
### Basic Setup (Default)
- Creates S3 bucket and SQS queue only
- Uses external database and container deployment
- Minimal cost option
### With RDS Database
```hcl
enable_rds = true
rds_instance_class = "db.t3.micro" # or larger for production
```
### With VPC and Fargate
```hcl
enable_fargate = true
web_backend_cpu = 256
web_backend_memory = 512
compute_service_cpu = 256
compute_service_memory = 512
```
## Environment Variables
After applying Terraform, configure your applications with these environment variables:
```bash
# From terraform output
AWS_REGION=$(terraform output -raw aws_region)
AWS_S3_BUCKET_NAME=$(terraform output -raw s3_bucket_name)
AWS_SQS_QUEUE_URL=$(terraform output -raw sqs_queue_url)
# If using RDS
DATABASE_URL=$(terraform output -raw rds_endpoint)
# If using IAM user (not Fargate)
AWS_ACCESS_KEY_ID=$(terraform output -raw app_access_key_id)
AWS_SECRET_ACCESS_KEY=$(terraform output -raw app_secret_access_key)
```
## CloudWatch Dashboard
The infrastructure creates a comprehensive monitoring dashboard at:
```
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=meteor-dev-monitoring-dashboard
```
### Dashboard Includes:
- **Application metrics**: Request volume, response times, error rates
- **Business metrics**: Event processing, validation performance
- **Infrastructure metrics**: SQS queue depth, RDS performance, Fargate utilization
- **Custom metrics**: From your NestJS and Go services
## Metrics Integration
Your applications are already configured to send metrics to CloudWatch:
### NestJS Web Backend
- Namespace: `MeteorApp/WebBackend`
- Metrics: RequestCount, RequestDuration, ErrorCount, AuthOperationCount, etc.
### Go Compute Service
- Namespace: `MeteorApp/ComputeService`
- Metrics: MessageProcessingCount, ValidationCount, DatabaseOperationCount, etc.
## Cost Optimization
### Development Environment
```hcl
environment = "dev"
enable_rds = false # Use external database
enable_fargate = false # Use external containers
cloudwatch_log_retention_days = 7 # Shorter retention
```
### Production Environment
```hcl
environment = "prod"
enable_rds = true
rds_instance_class = "db.t3.small" # Appropriate size
enable_fargate = true # High availability
cloudwatch_log_retention_days = 30 # Longer retention
```
## File Structure
```
infrastructure/
├── main.tf # Provider and common configuration
├── variables.tf # Input variables
├── outputs.tf # Output values
├── s3.tf # S3 bucket for event storage
├── sqs.tf # SQS queues for processing
├── cloudwatch.tf # Monitoring dashboard and alarms
├── iam.tf # IAM roles and policies
├── rds.tf # Optional PostgreSQL database
├── vpc.tf # Optional VPC for Fargate
├── terraform.tfvars.example # Example configuration
└── README.md # This file
```
## Deployment Integration
### Docker Compose
Update your `docker-compose.yml` with Terraform outputs:
```yaml
environment:
- AWS_REGION=${AWS_REGION}
- AWS_S3_BUCKET_NAME=${AWS_S3_BUCKET_NAME}
- AWS_SQS_QUEUE_URL=${AWS_SQS_QUEUE_URL}
```
### GitHub Actions
```yaml
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Deploy infrastructure
run: |
cd infrastructure
terraform init
terraform apply -auto-approve
```
## Security Best Practices
1. **IAM Permissions**: Follow principle of least privilege
2. **S3 Security**: All buckets have public access blocked
3. **Encryption**: S3 server-side encryption enabled
4. **VPC**: Private subnets for database and compute resources
5. **Secrets**: RDS passwords stored in AWS Secrets Manager
## Monitoring and Alerts
The infrastructure includes CloudWatch alarms for:
- High error rates in web backend and compute service
- High response times
- SQS message age and dead letter queue messages
- RDS CPU utilization (when enabled)
To add notifications:
1. Create an SNS topic
2. Add the topic ARN to alarm actions in `cloudwatch.tf`
## Cleanup
To destroy all resources:
```bash
terraform destroy
```
**Warning**: This will delete all data in S3 and databases. For production, ensure you have backups.
## Troubleshooting
### Common Issues
1. **S3 bucket name conflicts**: Bucket names must be globally unique
- Solution: Change `project_name` or `environment` in variables
2. **RDS subnet group errors**: Requires subnets in different AZs
- Solution: Ensure `enable_fargate = true` when using RDS
3. **IAM permission errors**: Check AWS credentials and permissions
- Solution: Ensure your AWS account has admin access or required permissions
4. **CloudWatch dashboard empty**: Wait for applications to send metrics
- Solution: Deploy and run your applications to generate metrics
### Getting Help
1. Check Terraform documentation: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
2. Review AWS service limits and quotas
3. Check AWS CloudFormation events for detailed error messages