S3 Setup Guide

Overview

The blog_data pipeline uses AWS S3 buckets for data storage, enabling seamless operation on Prefect SaaS where container filesystems are ephemeral.

Architecture

Data Flow:
┌─────────────────────────────────────────────────────────────┐
│ Web Scraping (Extraction)                                   │
│ - Scrapes data from various sources                         │
│ - Saves to: blog-data-raw S3 bucket                         │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Data Cleaning (Transformation)                              │
│ - Reads from: blog-data-raw S3 bucket                       │
│ - Cleans and validates data                                 │
│ - Saves to: blog-data-clean S3 bucket                       │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Graph Loading (Neo4j)                                       │
│ - Reads from: blog-data-clean S3 bucket                     │
│ - Loads into Neo4j database                                 │
└─────────────────────────────────────────────────────────────┘

S3 Buckets

blog-data-raw

Purpose: Stores raw extracted data from web sources
Lifecycle: Data retained for 30 days (configurable)
Access: Read/write by extraction tasks
Format: CSV files organized by entity type

blog-data-clean

Purpose: Stores cleaned, validated data ready for Neo4j
Lifecycle: Data retained indefinitely
Access: Read by graph loading tasks
Format: CSV files with standardized schema

blog-data-cache

Purpose: Caches web content to avoid duplicate requests
Lifecycle: Session-based (cleared after extraction)
Access: Read/write by scraper
Format: HTML/JSON content with MD5 hash keys

Configuration

Set these environment variables in .env.local:

# AWS Credentials
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-west-2

# S3 Buckets
AWS_BUCKET_NAME=blog-data-cache
AWS_BLOG_DATA_RAW_BUCKET=blog-data-raw
AWS_BLOG_DATA_CLEAN_BUCKET=blog-data-clean

Bucket provisioning

S3 buckets for this pipeline are created and managed by Terraform in the Terraform stack in this repository (see infra/platform/infra/envs/prod).

Key Implementation Details

Data Source Extraction

src/data_sources/base.py: save_to_csv() writes to S3
_save_single_csv_s3(): Single file saves
_save_split_csv_s3(): Split files by field value

Data Cleaning

tasks/cleaning/utils.py: load_raw_csv() reads from raw bucket
save_clean_csv(): Writes to clean bucket
Both use boto3 for S3 operations

URL Buffering

src/scraper.py: URLBuffer class manages URL deduplication
Buffer key format: {session_id}/{url_hash}
Avoids duplicate web requests within a session

Troubleshooting

Access Denied Errors

Verify AWS credentials in .env.local
Check IAM user has S3 permissions
Ensure bucket names are correct

Bucket Not Found

Verify buckets exist in AWS console
Check bucket names match configuration
Ensure correct AWS region

Slow Performance

Check network connectivity
Monitor S3 request metrics in AWS console
Consider enabling S3 Transfer Acceleration

Pipelines Architecture - Overall system architecture
Pipelines Quick Start - Project setup and overview