S3 Setup Guide
Overview
The blog_data pipeline uses AWS S3 buckets for data storage, enabling seamless operation on Prefect SaaS where container filesystems are ephemeral.
Architecture
Data Flow:
┌─────────────────────────────────────────────────────────────┐
│ Web Scraping (Extraction) │
│ - Scrapes data from various sources │
│ - Saves to: blog-data-raw S3 bucket │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Data Cleaning (Transformation) │
│ - Reads from: blog-data-raw S3 bucket │
│ - Cleans and validates data │
│ - Saves to: blog-data-clean S3 bucket │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Graph Loading (Neo4j) │
│ - Reads from: blog-data-clean S3 bucket │
│ - Loads into Neo4j database │
└─────────────────────────────────────────────────────────────┘
S3 Buckets
blog-data-raw
- Purpose: Stores raw extracted data from web sources
- Lifecycle: Data retained for 30 days (configurable)
- Access: Read/write by extraction tasks
- Format: CSV files organized by entity type
blog-data-clean
- Purpose: Stores cleaned, validated data ready for Neo4j
- Lifecycle: Data retained indefinitely
- Access: Read by graph loading tasks
- Format: CSV files with standardized schema
blog-data-cache
- Purpose: Caches web content to avoid duplicate requests
- Lifecycle: Session-based (cleared after extraction)
- Access: Read/write by scraper
- Format: HTML/JSON content with MD5 hash keys
Configuration
Set these environment variables in .env.local:
# AWS Credentials
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-west-2
# S3 Buckets
AWS_BUCKET_NAME=blog-data-cache
AWS_BLOG_DATA_RAW_BUCKET=blog-data-raw
AWS_BLOG_DATA_CLEAN_BUCKET=blog-data-clean
Bucket provisioning
S3 buckets for this pipeline are created and managed by Terraform in the blog_infra repository. This repo does not define or apply S3 infrastructure.
Key Implementation Details
Data Source Extraction
src/data_sources/base.py:save_to_csv()writes to S3_save_single_csv_s3(): Single file saves_save_split_csv_s3(): Split files by field value
Data Cleaning
tasks/cleaning/utils.py:load_raw_csv()reads from raw bucketsave_clean_csv(): Writes to clean bucket- Both use boto3 for S3 operations
URL Buffering
src/scraper.py:URLBufferclass manages URL deduplication- Buffer key format:
{session_id}/{url_hash} - Avoids duplicate web requests within a session
Troubleshooting
Access Denied Errors
- Verify AWS credentials in
.env.local - Check IAM user has S3 permissions
- Ensure bucket names are correct
Bucket Not Found
- Verify buckets exist in AWS console
- Check bucket names match configuration
- Ensure correct AWS region
Slow Performance
- Check network connectivity
- Monitor S3 request metrics in AWS console
- Consider enabling S3 Transfer Acceleration
Related Documentation
docs/ARCHITECTURE.md- Overall system architectureREADME.md- Project overview