S3 Lifecycle Policies for Blog Data Pipeline
Overview
This document defines the S3 lifecycle policies for the blog_data pipeline buckets, optimizing for cost while maintaining data availability and integrity.
Current Lifecycle Policies
1. URL Cache Bucket (ron-website-docs)
Purpose: Web scraping cache - stores cached web content to avoid repeated requests
Current Policy:
- Transition to STANDARD_IA: 30 days
- Transition to GLACIER: 90 days (conditional)
- Expiration: Based on
cache_retention_daysvariable (default: 30 days) - Abort incomplete multipart uploads: 7 days
Rationale:
- Cache data is temporary and can be regenerated
- Older cache entries are less likely to be reused
- Cost optimization through early expiration
Status: Yes Optimal - No changes needed
2. Raw Data Bucket (blog_data_raw)
Purpose: Stores raw extracted data from web scraping (source of truth)
Current Policy:
rule {
id = "raw_data_retention"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
rule {
id = "old_versions_cleanup"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 7
}
}
Analysis:
- Yes Transitions to STANDARD_IA after 30 days (cost optimization)
- Yes Old versions expire after 7 days (keeps storage lean)
- Yes Abort incomplete uploads after 7 days (cleanup)
- ⚠️ No expiration policy (keeps data indefinitely)
Recommendation: Keep current policy
- Raw data is the source of truth and should be retained
- 30-day transition to IA is appropriate (data accessed during pipeline runs)
- 7-day old version cleanup is aggressive but acceptable (pipeline overwrites data monthly)
Status: Yes Optimal - No changes needed
3. Clean Data Bucket (blog_data_clean)
Purpose: Stores cleaned/processed data ready for Neo4j graph loading
Current Policy:
rule {
id = "clean_data_retention"
status = "Enabled"
transition {
days = 90
storage_class = "STANDARD_IA"
}
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
rule {
id = "old_versions_cleanup"
status = "Enabled"
noncurrent_version_expiration {
noncurrent_days = 30
}
}
Analysis:
- Yes Transitions to STANDARD_IA after 90 days (cost optimization)
- Yes Old versions expire after 30 days (reasonable retention)
- Yes Abort incomplete uploads after 7 days (cleanup)
- ⚠️ No expiration policy (keeps data indefinitely)
Recommendation: Keep current policy
- Clean data can be regenerated from raw data if needed
- 90-day transition to IA is conservative (data may be accessed for debugging)
- 30-day old version retention allows rollback if issues are discovered
Status: Yes Optimal - No changes needed
4. Kit Instructions Bucket (kit_instructions)
Purpose: Stores kit instruction PDFs and images
Current Policy:
- Transition to STANDARD_IA: 90 days
- Old versions expire: 30 days
- Abort incomplete uploads: 7 days
Status: Yes Optimal - No changes needed
5. Design Files Bucket (design_files)
Purpose: Stores OpenRocket design files (.ork)
Current Policy:
- Transition to STANDARD_IA: 90 days
- Old versions expire: 30 days
- Abort incomplete uploads: 7 days
Status: Yes Optimal - No changes needed
Cost Optimization Analysis
Storage Class Pricing (us-east-1)
- STANDARD: $0.023/GB/month
- STANDARD_IA: $0.0125/GB/month (46% savings)
- GLACIER: $0.004/GB/month (83% savings)
Current Cost Optimization
Raw Data Bucket:
- First 30 days: STANDARD ($0.023/GB/month)
- After 30 days: STANDARD_IA ($0.0125/GB/month)
- Savings: 46% after 30 days
Clean Data Bucket:
- First 90 days: STANDARD ($0.023/GB/month)
- After 90 days: STANDARD_IA ($0.0125/GB/month)
- Savings: 46% after 90 days
Estimated Monthly Costs
Assuming 1GB of data per bucket:
Raw Data:
- Month 1: $0.023 (all STANDARD)
- Month 2+: $0.0125 (all STANDARD_IA)
- Annual cost: ~$0.16/GB
Clean Data:
- Months 1-3: $0.023 (all STANDARD)
- Month 4+: $0.0125 (all STANDARD_IA)
- Annual cost: ~$0.18/GB
Total estimated annual cost: ~$0.34/GB across both buckets
Alternative Policies Considered
Option 1: Aggressive Cost Optimization
Raw Data: Transition to GLACIER after 90 days Clean Data: Expire after 180 days (can regenerate from raw)
Pros:
- Maximum cost savings (83% for raw data)
- Minimal storage footprint
Cons:
- Glacier retrieval takes hours (not suitable for pipeline)
- Losing clean data requires re-running cleaning pipeline
- Risk of data loss if raw data is corrupted
Decision: No Rejected - Pipeline needs fast access to data
Option 2: Conservative Retention
Raw Data: Keep in STANDARD indefinitely Clean Data: Keep in STANDARD for 180 days
Pros:
- Fastest access to all data
- No retrieval delays
Cons:
- Higher costs (no IA savings)
- Unnecessary for infrequently accessed data
Decision: No Rejected - Current policy provides better cost/performance balance
Option 3: Current Policy (Recommended)
Raw Data: STANDARD → STANDARD_IA (30 days) Clean Data: STANDARD → STANDARD_IA (90 days)
Pros:
- Yes Balances cost and performance
- Yes Fast access during active pipeline runs
- Yes Cost savings for older data
- Yes No retrieval delays (IA has same access speed as STANDARD)
Cons:
- None significant
Decision: Yes APPROVED - Current policy is optimal
Implementation Status
Yes All Policies Implemented
All lifecycle policies are already implemented in Terraform:
terraform/s3.tf- Contains all bucket lifecycle configurationsterraform/variables.tf- Defines lifecycle management variables- Controlled by:
var.enable_lifecycle_management(default:true)
Terraform Variables
variable "enable_lifecycle_management" {
description = "Enable S3 lifecycle management for cost optimization"
type = bool
default = true
}
variable "transition_to_ia_days" {
description = "Days after which to transition objects to Infrequent Access"
type = number
default = 30
}
variable "transition_to_glacier_days" {
description = "Days after which to transition objects to Glacier"
type = number
default = 90
}
Deployment
Lifecycle policies are deployed automatically when:
var.enable_lifecycle_management = true(default)- Running
terraform apply
Monitoring and Maintenance
Recommended Monitoring
-
Storage Metrics:
- Monitor bucket size growth
- Track storage class distribution
- Alert on unexpected growth
-
Cost Metrics:
- Monthly S3 costs per bucket
- Storage class transition counts
- Data retrieval costs (should be $0 for IA)
-
Lifecycle Metrics:
- Objects transitioned to IA
- Objects expired
- Incomplete multipart uploads aborted
Maintenance Schedule
- Monthly: Review storage costs and usage
- Quarterly: Evaluate lifecycle policy effectiveness
- Annually: Reassess retention requirements
Conclusion
Priority 4: Add S3 Lifecycle Policies - Yes COMPLETE
All S3 buckets in the blog_data pipeline already have optimal lifecycle policies implemented:
- Yes Raw data bucket: 30-day transition to IA, indefinite retention
- Yes Clean data bucket: 90-day transition to IA, indefinite retention
- Yes Cache bucket: 30-day expiration with IA/Glacier transitions
- Yes Kit instructions bucket: 90-day transition to IA
- Yes Design files bucket: 90-day transition to IA
No changes needed - Current policies provide optimal cost/performance balance.
Estimated annual savings: ~46% on storage costs for data older than transition period.