Image Processing Guide

Last Updated: 2025-10-30
Status: Production
Related Diagram: terraform/diagram_7_image_processing.png

Overview

This document describes how product images are processed in the blog-data pipeline, from scraping vendor websites through Cloudinary CDN optimization to display on the blog frontend.

Architecture

Components

Web Scraper (Prefect Worker - ECS)
- Extracts product data from vendor websites
- Downloads product images from vendor servers
- Orchestrates the upload process
Cloudinary CDN
- Receives uploaded images
- Automatically optimizes and transforms images
- Delivers images globally via CDN
AWS Secrets Manager
- Stores Cloudinary API credentials securely
- Provides credentials to ECS tasks via IAM roles
S3 Raw Bucket
- Stores CSV files with Cloudinary URLs
- Acts as source of truth for product data
Vercel Frontend
- Reads CSV data from S3
- Loads optimized images from Cloudinary CDN
- Displays product pages to blog visitors

Data Flow

Vendor Website → Web Scraper → Image Download → Cloudinary Upload
                                                      ↓
                                                Auto Transform
                                                      ↓
                                                Global CDN ← Vercel ← Blog Visitor
                                                      ↓
                                                Cloudinary URL
                                                      ↓
                                                S3 Raw (CSV)

Image Processing Pipeline

Step 1: Image Discovery

Location: src/data_sources/{vendor}.py

During web scraping, the scraper identifies product images:

# Example from Estes scraper
image_url = product_soup.select_one('img.product-image')['src']

Output: Image URL from vendor website

Step 2: Image Download

Location: src/cloudinary_uploader.py

The scraper downloads the image from the vendor:

response = requests.get(image_url, timeout=30)
image_data = response.content

Output: Raw image bytes (typically JPEG or PNG)

Step 3: Credential Retrieval

Location: src/cloudinary_uploader.py → get_cloudinary_credentials_from_secrets_manager()

Fetches Cloudinary credentials from AWS Secrets Manager:

secret_name = "blog-data/cloudinary/credentials"
client = boto3.client('secretsmanager', region_name='eu-west-2')
secret = client.get_secret_value(SecretId=secret_name)
credentials = json.loads(secret['SecretString'])

Security:

ECS task execution role has secretsmanager:GetSecretValue permission
Credentials never stored in environment variables or code
All access logged in CloudTrail

Output: Cloudinary credentials (cloud_name, api_key, api_secret)

Step 4: Image Upload

Location: src/cloudinary_uploader.py → CloudinaryUploader.upload_image()

Uploads image to Cloudinary with metadata:

result = cloudinary.uploader.upload(
    image_data,
    folder=f"kits/{manufacturer}",
    public_id=product_id,
    resource_type="image",
    overwrite=True,
    invalidate=True
)

Parameters:

folder: Organized by manufacturer (e.g., kits/estes, kits/loc)
public_id: Product identifier for consistent URLs
overwrite=True: Replace existing images
invalidate=True: Clear CDN cache for updates

Output: Cloudinary response with URL and metadata

Step 5: Automatic Optimization

Performed by: Cloudinary CDN (automatic)

Cloudinary automatically applies transformations:

Format Conversion
- Converts to WebP for modern browsers
- Falls back to AVIF for maximum compression
- Maintains original format as fallback
Quality Optimization
- Automatically adjusts quality based on content
- Reduces file size while maintaining visual quality
- Typically 40-80% smaller than original
Responsive Sizing
- Generates multiple sizes for different devices
- Serves appropriate size based on device/viewport
- Reduces bandwidth for mobile users

Example Transformations:

Original: https://res.cloudinary.com/ronaldhatcher/image/upload/kits/estes/1234.jpg
WebP: https://res.cloudinary.com/ronaldhatcher/image/upload/f_webp/kits/estes/1234.jpg
Thumbnail: https://res.cloudinary.com/ronaldhatcher/image/upload/w_300,h_300,c_fill/kits/estes/1234.jpg

Step 6: URL Storage

Location: src/data_sources/{vendor}.py

The image metadata is stored in the CSV:

df['image'] = primary_image_public_id  # Primary image public_id
df['images'] = uploaded_images  # Full list of image metadata (Cloudinary or S3)

CSV Structure:

product_id,name,manufacturer,image,images
1234,Alpha III,Estes,kits/estes/1234/primary,[{"public_id":"kits/estes/1234/primary","storage":"cloudinary",...}]

Note: The images field contains a JSON array with metadata for all uploaded images, including storage location (Cloudinary or S3), URLs, and other details.

Upload to S3:

s3_client.put_object(
    Bucket='blog-data-raw',
    Key=f'kits/{manufacturer}_kits.csv',
    Body=df.to_csv(index=False)
)

Step 7: Frontend Display

Location: Vercel Next.js application (blog_code repository)

The frontend loads images from the appropriate storage (Cloudinary or S3):

// Parse images metadata
const imageData = JSON.parse(kit.images)[0]
const imageUrl =
  imageData.storage === 'cloudinary' ? imageData.url : imageData.s3_url

;<Image src={imageUrl} alt={kit.name} width={600} height={600} loading='lazy' />

Benefits:

Automatic format selection (WebP/AVIF)
Responsive sizing based on viewport
Lazy loading for performance
Global CDN delivery (low latency)

Configuration

Cloudinary Settings

Account: ronaldhatcher
Cloud Name: ronaldhatcher
Region: Auto (global CDN)

Folder Structure:

kits/
  ├── estes/
  │   ├── 1234.jpg
  │   ├── 5678.jpg
  │   └── ...
  ├── loc/
  │   ├── 9012.jpg
  │   └── ...
  └── rocketarium/
      ├── 3456.jpg
      └── ...

Credentials Storage

AWS Secrets Manager:

Secret Name: blog-data/cloudinary/credentials
Region: eu-west-2
KMS Encryption: Yes (using blog-data KMS key)
Rotation: Manual (not automated)

Secret Structure:

{
  "cloud_name": "your-cloud-name",
  "api_key": "YOUR_CLOUDINARY_API_KEY",
  "api_secret": "YOUR_CLOUDINARY_API_SECRET"
}

IAM Permissions

ECS Task Execution Role:

{
  "Effect": "Allow",
  "Action": "secretsmanager:GetSecretValue",
  "Resource": "arn:aws:secretsmanager:eu-west-2:*:secret:blog-data/cloudinary/credentials-*"
}

Error Handling

Missing Credentials

If Cloudinary credentials are not available:

uploader = create_cloudinary_uploader(logger, metrics, folder_prefix)
if uploader is None:
    logger.warning("Cloudinary credentials not found. Image upload disabled.")
    # Continue without images - CSV will not have images column

Result: Scraping continues, but images are not uploaded. CSV will be missing images column.

Upload Failures

If image upload fails:

try:
    uploaded_images = image_handler.upload_image(image_url, manufacturer, sku, image_type)
except Exception as e:
    logger.error(f"Failed to upload image for {sku}: {e}")
    uploaded_images = None  # Store None in CSV

Result: CSV will have None for that product's cloudinary_images field.

Missing Images in CSV

The cleaning pipeline handles missing image columns gracefully:

# In tasks/cleaning/clean_kits.py
"imageSrc": df_vendor.get("cloudinary_images"),  # Returns None if column missing

Result: Clean data will have null for image URLs, frontend handles gracefully.

Monitoring

Metrics Tracked

In ExtractionMetrics:

images_uploaded: Count of successfully uploaded images
images_failed: Count of failed uploads
upload_duration: Time spent uploading images

Logging:

INFO: Uploading image for product 1234 to Cloudinary
INFO: Successfully uploaded image: https://res.cloudinary.com/.../1234.jpg
INFO: Uploaded 45 images in 12.3 seconds

Cloudinary Dashboard

Monitor usage at: https://cloudinary.com/console

Key Metrics:

Storage used
Bandwidth consumed
Transformations performed
API calls made

Troubleshooting

Images Not Uploading

Symptom: CSV missing cloudinary_images column

Possible Causes:

Cloudinary credentials not in Secrets Manager
ECS task role lacks secretsmanager:GetSecretValue permission
Network connectivity issues to Cloudinary API

Solution:

# Verify secret exists
aws secretsmanager get-secret-value \
  --secret-id blog-data/cloudinary/credentials \
  --region eu-west-2

# Check ECS task logs
aws logs tail /ecs/prod/prefect-worker --follow

Images Not Displaying on Frontend

Symptom: Broken images on blog

Possible Causes:

Cloudinary URLs in CSV are incorrect
Cloudinary account suspended/over quota
CORS issues (unlikely with Cloudinary)

Solution:

# Test Cloudinary URL directly
curl -I https://res.cloudinary.com/ronaldhatcher/image/upload/kits/estes/1234.jpg

# Check Cloudinary dashboard for quota/issues

Slow Image Loading

Symptom: Images take long to load on frontend

Possible Causes:

Not using Cloudinary transformations (loading full-size images)
Not using WebP/AVIF format
CDN cache not warmed up

Solution:

Ensure frontend uses Cloudinary transformation URLs
Use f_auto for automatic format selection
Use q_auto for automatic quality optimization

Best Practices

1. Always Use Transformations

// Good - uses transformations
<Image src={`${cloudinaryUrl}/f_auto,q_auto,w_600`} />

// Bad - uses original
<Image src={cloudinaryUrl} />

2. Organize by Manufacturer

Keep folder structure consistent:

kits/{manufacturer}/{product_id}.{ext}

3. Use Consistent Public IDs

Use product ID as public_id for predictable URLs:

public_id = f"{product_id}"  # Not random strings

4. Handle Missing Images Gracefully

Always check for None/null:

cloudinary_url = df.get("cloudinary_images")
if cloudinary_url:
    # Use Cloudinary
else:
    # Use fallback or placeholder

5. Monitor Quota

Cloudinary free tier limits:

25 GB storage
25 GB bandwidth/month
25,000 transformations/month

Monitor usage and upgrade if needed.

CLOUDINARY_SETUP.md - Initial setup and configuration
CLOUDINARY_CONFIGURATION.md - Detailed configuration reference
Architecture Diagram 7 - Visual flow diagram
Cloudinary Documentation - Official docs

Future Enhancements

Planned

Automatic image resizing before upload (reduce upload time)
Lazy loading for all images (improve page load)
Progressive image loading (show low-res first)

Under Consideration

Video support for product demos
3D model support for rocket kits
AI-powered image tagging
Automatic background removal

Image Processing Guide

Overview

Architecture

Components

Data Flow

Image Processing Pipeline

Step 1: Image Discovery

Step 2: Image Download

Step 3: Credential Retrieval

Step 4: Image Upload

Step 5: Automatic Optimization

Step 6: URL Storage

Step 7: Frontend Display

Configuration

Cloudinary Settings

Credentials Storage

IAM Permissions

Error Handling

Missing Credentials

Upload Failures

Missing Images in CSV

Monitoring

Metrics Tracked

Cloudinary Dashboard

Troubleshooting

Images Not Uploading

Images Not Displaying on Frontend

Slow Image Loading

Best Practices

1. Always Use Transformations

2. Organize by Manufacturer

3. Use Consistent Public IDs

4. Handle Missing Images Gracefully

5. Monitor Quota

Related Documentation

Future Enhancements

Planned

Under Consideration