S3 Setup Guide

Overview

The blog_data pipeline uses AWS S3 buckets for data storage, enabling seamless operation on Prefect SaaS where container filesystems are ephemeral.

Architecture

Data Flow:
┌─────────────────────────────────────────────────────────────┐
│ Web Scraping (Extraction)                                   │
│ - Scrapes data from various sources                         │
│ - Saves to: blog-data-raw S3 bucket                         │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Data Cleaning (Transformation)                              │
│ - Reads from: blog-data-raw S3 bucket                       │
│ - Cleans and validates data                                 │
│ - Saves to: blog-data-clean S3 bucket                       │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│ Graph Loading (Neo4j)                                       │
│ - Reads from: blog-data-clean S3 bucket                     │
│ - Loads into Neo4j database                                 │
└─────────────────────────────────────────────────────────────┘

S3 Buckets

blog-data-raw

  • Purpose: Stores raw extracted data from web sources
  • Lifecycle: Data retained for 30 days (configurable)
  • Access: Read/write by extraction tasks
  • Format: CSV files organized by entity type

blog-data-clean

  • Purpose: Stores cleaned, validated data ready for Neo4j
  • Lifecycle: Data retained indefinitely
  • Access: Read by graph loading tasks
  • Format: CSV files with standardized schema

blog-data-cache

  • Purpose: Caches web content to avoid duplicate requests
  • Lifecycle: Session-based (cleared after extraction)
  • Access: Read/write by scraper
  • Format: HTML/JSON content with MD5 hash keys

Configuration

Set these environment variables in .env.local:

# AWS Credentials
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-west-2

# S3 Buckets
AWS_BUCKET_NAME=blog-data-cache
AWS_BLOG_DATA_RAW_BUCKET=blog-data-raw
AWS_BLOG_DATA_CLEAN_BUCKET=blog-data-clean

Bucket provisioning

S3 buckets for this pipeline are created and managed by Terraform in the blog_infra repository. This repo does not define or apply S3 infrastructure.

Key Implementation Details

Data Source Extraction

  • src/data_sources/base.py: save_to_csv() writes to S3
  • _save_single_csv_s3(): Single file saves
  • _save_split_csv_s3(): Split files by field value

Data Cleaning

  • tasks/cleaning/utils.py: load_raw_csv() reads from raw bucket
  • save_clean_csv(): Writes to clean bucket
  • Both use boto3 for S3 operations

URL Buffering

  • src/scraper.py: URLBuffer class manages URL deduplication
  • Buffer key format: {session_id}/{url_hash}
  • Avoids duplicate web requests within a session

Troubleshooting

Access Denied Errors

  • Verify AWS credentials in .env.local
  • Check IAM user has S3 permissions
  • Ensure bucket names are correct

Bucket Not Found

  • Verify buckets exist in AWS console
  • Check bucket names match configuration
  • Ensure correct AWS region

Slow Performance

  • Check network connectivity
  • Monitor S3 request metrics in AWS console
  • Consider enabling S3 Transfer Acceleration
  • docs/ARCHITECTURE.md - Overall system architecture
  • README.md - Project overview