S3 Lifecycle Policies for Blog Data Pipeline

Overview

This document defines the S3 lifecycle policies for the blog_data pipeline buckets, optimizing for cost while maintaining data availability and integrity.

Current Lifecycle Policies

1. URL Cache Bucket (ron-website-docs)

Purpose: Web scraping cache - stores cached web content to avoid repeated requests

Current Policy:

  • Transition to STANDARD_IA: 30 days
  • Transition to GLACIER: 90 days (conditional)
  • Expiration: Based on cache_retention_days variable (default: 30 days)
  • Abort incomplete multipart uploads: 7 days

Rationale:

  • Cache data is temporary and can be regenerated
  • Older cache entries are less likely to be reused
  • Cost optimization through early expiration

Status: Yes Optimal - No changes needed


2. Raw Data Bucket (blog_data_raw)

Purpose: Stores raw extracted data from web scraping (source of truth)

Current Policy:

rule {
  id     = "raw_data_retention"
  status = "Enabled"
  
  transition {
    days          = 30
    storage_class = "STANDARD_IA"
  }
  
  abort_incomplete_multipart_upload {
    days_after_initiation = 7
  }
}

rule {
  id     = "old_versions_cleanup"
  status = "Enabled"
  
  noncurrent_version_expiration {
    noncurrent_days = 7
  }
}

Analysis:

  • Yes Transitions to STANDARD_IA after 30 days (cost optimization)
  • Yes Old versions expire after 7 days (keeps storage lean)
  • Yes Abort incomplete uploads after 7 days (cleanup)
  • ⚠️ No expiration policy (keeps data indefinitely)

Recommendation: Keep current policy

  • Raw data is the source of truth and should be retained
  • 30-day transition to IA is appropriate (data accessed during pipeline runs)
  • 7-day old version cleanup is aggressive but acceptable (pipeline overwrites data monthly)

Status: Yes Optimal - No changes needed


3. Clean Data Bucket (blog_data_clean)

Purpose: Stores cleaned/processed data ready for Neo4j graph loading

Current Policy:

rule {
  id     = "clean_data_retention"
  status = "Enabled"
  
  transition {
    days          = 90
    storage_class = "STANDARD_IA"
  }
  
  abort_incomplete_multipart_upload {
    days_after_initiation = 7
  }
}

rule {
  id     = "old_versions_cleanup"
  status = "Enabled"
  
  noncurrent_version_expiration {
    noncurrent_days = 30
  }
}

Analysis:

  • Yes Transitions to STANDARD_IA after 90 days (cost optimization)
  • Yes Old versions expire after 30 days (reasonable retention)
  • Yes Abort incomplete uploads after 7 days (cleanup)
  • ⚠️ No expiration policy (keeps data indefinitely)

Recommendation: Keep current policy

  • Clean data can be regenerated from raw data if needed
  • 90-day transition to IA is conservative (data may be accessed for debugging)
  • 30-day old version retention allows rollback if issues are discovered

Status: Yes Optimal - No changes needed


4. Kit Instructions Bucket (kit_instructions)

Purpose: Stores kit instruction PDFs and images

Current Policy:

  • Transition to STANDARD_IA: 90 days
  • Old versions expire: 30 days
  • Abort incomplete uploads: 7 days

Status: Yes Optimal - No changes needed


5. Design Files Bucket (design_files)

Purpose: Stores OpenRocket design files (.ork)

Current Policy:

  • Transition to STANDARD_IA: 90 days
  • Old versions expire: 30 days
  • Abort incomplete uploads: 7 days

Status: Yes Optimal - No changes needed


Cost Optimization Analysis

Storage Class Pricing (us-east-1)

  • STANDARD: $0.023/GB/month
  • STANDARD_IA: $0.0125/GB/month (46% savings)
  • GLACIER: $0.004/GB/month (83% savings)

Current Cost Optimization

Raw Data Bucket:

  • First 30 days: STANDARD ($0.023/GB/month)
  • After 30 days: STANDARD_IA ($0.0125/GB/month)
  • Savings: 46% after 30 days

Clean Data Bucket:

  • First 90 days: STANDARD ($0.023/GB/month)
  • After 90 days: STANDARD_IA ($0.0125/GB/month)
  • Savings: 46% after 90 days

Estimated Monthly Costs

Assuming 1GB of data per bucket:

Raw Data:

  • Month 1: $0.023 (all STANDARD)
  • Month 2+: $0.0125 (all STANDARD_IA)
  • Annual cost: ~$0.16/GB

Clean Data:

  • Months 1-3: $0.023 (all STANDARD)
  • Month 4+: $0.0125 (all STANDARD_IA)
  • Annual cost: ~$0.18/GB

Total estimated annual cost: ~$0.34/GB across both buckets


Alternative Policies Considered

Option 1: Aggressive Cost Optimization

Raw Data: Transition to GLACIER after 90 days Clean Data: Expire after 180 days (can regenerate from raw)

Pros:

  • Maximum cost savings (83% for raw data)
  • Minimal storage footprint

Cons:

  • Glacier retrieval takes hours (not suitable for pipeline)
  • Losing clean data requires re-running cleaning pipeline
  • Risk of data loss if raw data is corrupted

Decision: No Rejected - Pipeline needs fast access to data


Option 2: Conservative Retention

Raw Data: Keep in STANDARD indefinitely Clean Data: Keep in STANDARD for 180 days

Pros:

  • Fastest access to all data
  • No retrieval delays

Cons:

  • Higher costs (no IA savings)
  • Unnecessary for infrequently accessed data

Decision: No Rejected - Current policy provides better cost/performance balance


Raw Data: STANDARD → STANDARD_IA (30 days) Clean Data: STANDARD → STANDARD_IA (90 days)

Pros:

  • Yes Balances cost and performance
  • Yes Fast access during active pipeline runs
  • Yes Cost savings for older data
  • Yes No retrieval delays (IA has same access speed as STANDARD)

Cons:

  • None significant

Decision: Yes APPROVED - Current policy is optimal


Implementation Status

Yes All Policies Implemented

All lifecycle policies are already implemented in Terraform:

  1. terraform/s3.tf - Contains all bucket lifecycle configurations
  2. terraform/variables.tf - Defines lifecycle management variables
  3. Controlled by: var.enable_lifecycle_management (default: true)

Terraform Variables

variable "enable_lifecycle_management" {
  description = "Enable S3 lifecycle management for cost optimization"
  type        = bool
  default     = true
}

variable "transition_to_ia_days" {
  description = "Days after which to transition objects to Infrequent Access"
  type        = number
  default     = 30
}

variable "transition_to_glacier_days" {
  description = "Days after which to transition objects to Glacier"
  type        = number
  default     = 90
}

Deployment

Lifecycle policies are deployed automatically when:

  1. var.enable_lifecycle_management = true (default)
  2. Running terraform apply

Monitoring and Maintenance

  1. Storage Metrics:

    • Monitor bucket size growth
    • Track storage class distribution
    • Alert on unexpected growth
  2. Cost Metrics:

    • Monthly S3 costs per bucket
    • Storage class transition counts
    • Data retrieval costs (should be $0 for IA)
  3. Lifecycle Metrics:

    • Objects transitioned to IA
    • Objects expired
    • Incomplete multipart uploads aborted

Maintenance Schedule

  • Monthly: Review storage costs and usage
  • Quarterly: Evaluate lifecycle policy effectiveness
  • Annually: Reassess retention requirements

Conclusion

Priority 4: Add S3 Lifecycle Policies - Yes COMPLETE

All S3 buckets in the blog_data pipeline already have optimal lifecycle policies implemented:

  1. Yes Raw data bucket: 30-day transition to IA, indefinite retention
  2. Yes Clean data bucket: 90-day transition to IA, indefinite retention
  3. Yes Cache bucket: 30-day expiration with IA/Glacier transitions
  4. Yes Kit instructions bucket: 90-day transition to IA
  5. Yes Design files bucket: 90-day transition to IA

No changes needed - Current policies provide optimal cost/performance balance.

Estimated annual savings: ~46% on storage costs for data older than transition period.