Image Processing Guide

Last Updated: 2025-10-30
Status: Production
Related Diagram: terraform/diagram_7_image_processing.png

Overview

This document describes how product images are processed in the blog-data pipeline, from scraping vendor websites through Cloudinary CDN optimization to display on the blog frontend.

Architecture

Components

  1. Web Scraper (Prefect Worker - ECS)

    • Extracts product data from vendor websites
    • Downloads product images from vendor servers
    • Orchestrates the upload process
  2. Cloudinary CDN

    • Receives uploaded images
    • Automatically optimizes and transforms images
    • Delivers images globally via CDN
  3. AWS Secrets Manager

    • Stores Cloudinary API credentials securely
    • Provides credentials to ECS tasks via IAM roles
  4. S3 Raw Bucket

    • Stores CSV files with Cloudinary URLs
    • Acts as source of truth for product data
  5. Vercel Frontend

    • Reads CSV data from S3
    • Loads optimized images from Cloudinary CDN
    • Displays product pages to blog visitors

Data Flow

Vendor Website → Web Scraper → Image Download → Cloudinary Upload

                                                Auto Transform

                                                Global CDN ← Vercel ← Blog Visitor

                                                Cloudinary URL

                                                S3 Raw (CSV)

Image Processing Pipeline

Step 1: Image Discovery

Location: src/data_sources/{vendor}.py

During web scraping, the scraper identifies product images:

# Example from Estes scraper
image_url = product_soup.select_one('img.product-image')['src']

Output: Image URL from vendor website

Step 2: Image Download

Location: src/cloudinary_uploader.py

The scraper downloads the image from the vendor:

response = requests.get(image_url, timeout=30)
image_data = response.content

Output: Raw image bytes (typically JPEG or PNG)

Step 3: Credential Retrieval

Location: src/cloudinary_uploader.pyget_cloudinary_credentials_from_secrets_manager()

Fetches Cloudinary credentials from AWS Secrets Manager:

secret_name = "blog-data/cloudinary/credentials"
client = boto3.client('secretsmanager', region_name='eu-west-2')
secret = client.get_secret_value(SecretId=secret_name)
credentials = json.loads(secret['SecretString'])

Security:

  • ECS task execution role has secretsmanager:GetSecretValue permission
  • Credentials never stored in environment variables or code
  • All access logged in CloudTrail

Output: Cloudinary credentials (cloud_name, api_key, api_secret)

Step 4: Image Upload

Location: src/cloudinary_uploader.pyCloudinaryUploader.upload_image()

Uploads image to Cloudinary with metadata:

result = cloudinary.uploader.upload(
    image_data,
    folder=f"kits/{manufacturer}",
    public_id=product_id,
    resource_type="image",
    overwrite=True,
    invalidate=True
)

Parameters:

  • folder: Organized by manufacturer (e.g., kits/estes, kits/loc)
  • public_id: Product identifier for consistent URLs
  • overwrite=True: Replace existing images
  • invalidate=True: Clear CDN cache for updates

Output: Cloudinary response with URL and metadata

Step 5: Automatic Optimization

Performed by: Cloudinary CDN (automatic)

Cloudinary automatically applies transformations:

  1. Format Conversion

    • Converts to WebP for modern browsers
    • Falls back to AVIF for maximum compression
    • Maintains original format as fallback
  2. Quality Optimization

    • Automatically adjusts quality based on content
    • Reduces file size while maintaining visual quality
    • Typically 40-80% smaller than original
  3. Responsive Sizing

    • Generates multiple sizes for different devices
    • Serves appropriate size based on device/viewport
    • Reduces bandwidth for mobile users

Example Transformations:

  • Original: https://res.cloudinary.com/ronaldhatcher/image/upload/kits/estes/1234.jpg
  • WebP: https://res.cloudinary.com/ronaldhatcher/image/upload/f_webp/kits/estes/1234.jpg
  • Thumbnail: https://res.cloudinary.com/ronaldhatcher/image/upload/w_300,h_300,c_fill/kits/estes/1234.jpg

Step 6: URL Storage

Location: src/data_sources/{vendor}.py

The image metadata is stored in the CSV:

df['image'] = primary_image_public_id  # Primary image public_id
df['images'] = uploaded_images  # Full list of image metadata (Cloudinary or S3)

CSV Structure:

product_id,name,manufacturer,image,images
1234,Alpha III,Estes,kits/estes/1234/primary,[{"public_id":"kits/estes/1234/primary","storage":"cloudinary",...}]

Note: The images field contains a JSON array with metadata for all uploaded images, including storage location (Cloudinary or S3), URLs, and other details.

Upload to S3:

s3_client.put_object(
    Bucket='blog-data-raw',
    Key=f'kits/{manufacturer}_kits.csv',
    Body=df.to_csv(index=False)
)

Step 7: Frontend Display

Location: Vercel Next.js application (blog_code repository)

The frontend loads images from the appropriate storage (Cloudinary or S3):

// Parse images metadata
const imageData = JSON.parse(kit.images)[0]
const imageUrl =
  imageData.storage === 'cloudinary' ? imageData.url : imageData.s3_url

;<Image src={imageUrl} alt={kit.name} width={600} height={600} loading='lazy' />

Benefits:

  • Automatic format selection (WebP/AVIF)
  • Responsive sizing based on viewport
  • Lazy loading for performance
  • Global CDN delivery (low latency)

Configuration

Cloudinary Settings

Account: ronaldhatcher
Cloud Name: ronaldhatcher
Region: Auto (global CDN)

Folder Structure:

kits/
  ├── estes/
  │   ├── 1234.jpg
  │   ├── 5678.jpg
  │   └── ...
  ├── loc/
  │   ├── 9012.jpg
  │   └── ...
  └── rocketarium/
      ├── 3456.jpg
      └── ...

Credentials Storage

AWS Secrets Manager:

  • Secret Name: blog-data/cloudinary/credentials
  • Region: eu-west-2
  • KMS Encryption: Yes (using blog-data KMS key)
  • Rotation: Manual (not automated)

Secret Structure:

{
  "cloud_name": "your-cloud-name",
  "api_key": "YOUR_CLOUDINARY_API_KEY",
  "api_secret": "YOUR_CLOUDINARY_API_SECRET"
}

IAM Permissions

ECS Task Execution Role:

{
  "Effect": "Allow",
  "Action": "secretsmanager:GetSecretValue",
  "Resource": "arn:aws:secretsmanager:eu-west-2:*:secret:blog-data/cloudinary/credentials-*"
}

Error Handling

Missing Credentials

If Cloudinary credentials are not available:

uploader = create_cloudinary_uploader(logger, metrics, folder_prefix)
if uploader is None:
    logger.warning("Cloudinary credentials not found. Image upload disabled.")
    # Continue without images - CSV will not have images column

Result: Scraping continues, but images are not uploaded. CSV will be missing images column.

Upload Failures

If image upload fails:

try:
    uploaded_images = image_handler.upload_image(image_url, manufacturer, sku, image_type)
except Exception as e:
    logger.error(f"Failed to upload image for {sku}: {e}")
    uploaded_images = None  # Store None in CSV

Result: CSV will have None for that product's cloudinary_images field.

Missing Images in CSV

The cleaning pipeline handles missing image columns gracefully:

# In tasks/cleaning/clean_kits.py
"imageSrc": df_vendor.get("cloudinary_images"),  # Returns None if column missing

Result: Clean data will have null for image URLs, frontend handles gracefully.

Monitoring

Metrics Tracked

In ExtractionMetrics:

  • images_uploaded: Count of successfully uploaded images
  • images_failed: Count of failed uploads
  • upload_duration: Time spent uploading images

Logging:

INFO: Uploading image for product 1234 to Cloudinary
INFO: Successfully uploaded image: https://res.cloudinary.com/.../1234.jpg
INFO: Uploaded 45 images in 12.3 seconds

Cloudinary Dashboard

Monitor usage at: https://cloudinary.com/console

Key Metrics:

  • Storage used
  • Bandwidth consumed
  • Transformations performed
  • API calls made

Troubleshooting

Images Not Uploading

Symptom: CSV missing cloudinary_images column

Possible Causes:

  1. Cloudinary credentials not in Secrets Manager
  2. ECS task role lacks secretsmanager:GetSecretValue permission
  3. Network connectivity issues to Cloudinary API

Solution:

# Verify secret exists
aws secretsmanager get-secret-value \
  --secret-id blog-data/cloudinary/credentials \
  --region eu-west-2

# Check ECS task logs
aws logs tail /ecs/prod/prefect-worker --follow

Images Not Displaying on Frontend

Symptom: Broken images on blog

Possible Causes:

  1. Cloudinary URLs in CSV are incorrect
  2. Cloudinary account suspended/over quota
  3. CORS issues (unlikely with Cloudinary)

Solution:

# Test Cloudinary URL directly
curl -I https://res.cloudinary.com/ronaldhatcher/image/upload/kits/estes/1234.jpg

# Check Cloudinary dashboard for quota/issues

Slow Image Loading

Symptom: Images take long to load on frontend

Possible Causes:

  1. Not using Cloudinary transformations (loading full-size images)
  2. Not using WebP/AVIF format
  3. CDN cache not warmed up

Solution:

  • Ensure frontend uses Cloudinary transformation URLs
  • Use f_auto for automatic format selection
  • Use q_auto for automatic quality optimization

Best Practices

1. Always Use Transformations

// Good - uses transformations
<Image src={`${cloudinaryUrl}/f_auto,q_auto,w_600`} />

// Bad - uses original
<Image src={cloudinaryUrl} />

2. Organize by Manufacturer

Keep folder structure consistent:

kits/{manufacturer}/{product_id}.{ext}

3. Use Consistent Public IDs

Use product ID as public_id for predictable URLs:

public_id = f"{product_id}"  # Not random strings

4. Handle Missing Images Gracefully

Always check for None/null:

cloudinary_url = df.get("cloudinary_images")
if cloudinary_url:
    # Use Cloudinary
else:
    # Use fallback or placeholder

5. Monitor Quota

Cloudinary free tier limits:

  • 25 GB storage
  • 25 GB bandwidth/month
  • 25,000 transformations/month

Monitor usage and upgrade if needed.

Future Enhancements

Planned

  • Automatic image resizing before upload (reduce upload time)
  • Lazy loading for all images (improve page load)
  • Progressive image loading (show low-res first)

Under Consideration

  • Video support for product demos
  • 3D model support for rocket kits
  • AI-powered image tagging
  • Automatic background removal