Image Processing Guide
Last Updated: 2025-10-30
Status: Production
Related Diagram: terraform/diagram_7_image_processing.png
Overview
This document describes how product images are processed in the blog-data pipeline, from scraping vendor websites through Cloudinary CDN optimization to display on the blog frontend.
Architecture
Components
-
Web Scraper (Prefect Worker - ECS)
- Extracts product data from vendor websites
- Downloads product images from vendor servers
- Orchestrates the upload process
-
Cloudinary CDN
- Receives uploaded images
- Automatically optimizes and transforms images
- Delivers images globally via CDN
-
AWS Secrets Manager
- Stores Cloudinary API credentials securely
- Provides credentials to ECS tasks via IAM roles
-
S3 Raw Bucket
- Stores CSV files with Cloudinary URLs
- Acts as source of truth for product data
-
Vercel Frontend
- Reads CSV data from S3
- Loads optimized images from Cloudinary CDN
- Displays product pages to blog visitors
Data Flow
Vendor Website → Web Scraper → Image Download → Cloudinary Upload
↓
Auto Transform
↓
Global CDN ← Vercel ← Blog Visitor
↓
Cloudinary URL
↓
S3 Raw (CSV)
Image Processing Pipeline
Step 1: Image Discovery
Location: src/data_sources/{vendor}.py
During web scraping, the scraper identifies product images:
# Example from Estes scraper
image_url = product_soup.select_one('img.product-image')['src']
Output: Image URL from vendor website
Step 2: Image Download
Location: src/cloudinary_uploader.py
The scraper downloads the image from the vendor:
response = requests.get(image_url, timeout=30)
image_data = response.content
Output: Raw image bytes (typically JPEG or PNG)
Step 3: Credential Retrieval
Location: src/cloudinary_uploader.py → get_cloudinary_credentials_from_secrets_manager()
Fetches Cloudinary credentials from AWS Secrets Manager:
secret_name = "blog-data/cloudinary/credentials"
client = boto3.client('secretsmanager', region_name='eu-west-2')
secret = client.get_secret_value(SecretId=secret_name)
credentials = json.loads(secret['SecretString'])
Security:
- ECS task execution role has
secretsmanager:GetSecretValuepermission - Credentials never stored in environment variables or code
- All access logged in CloudTrail
Output: Cloudinary credentials (cloud_name, api_key, api_secret)
Step 4: Image Upload
Location: src/cloudinary_uploader.py → CloudinaryUploader.upload_image()
Uploads image to Cloudinary with metadata:
result = cloudinary.uploader.upload(
image_data,
folder=f"kits/{manufacturer}",
public_id=product_id,
resource_type="image",
overwrite=True,
invalidate=True
)
Parameters:
folder: Organized by manufacturer (e.g.,kits/estes,kits/loc)public_id: Product identifier for consistent URLsoverwrite=True: Replace existing imagesinvalidate=True: Clear CDN cache for updates
Output: Cloudinary response with URL and metadata
Step 5: Automatic Optimization
Performed by: Cloudinary CDN (automatic)
Cloudinary automatically applies transformations:
-
Format Conversion
- Converts to WebP for modern browsers
- Falls back to AVIF for maximum compression
- Maintains original format as fallback
-
Quality Optimization
- Automatically adjusts quality based on content
- Reduces file size while maintaining visual quality
- Typically 40-80% smaller than original
-
Responsive Sizing
- Generates multiple sizes for different devices
- Serves appropriate size based on device/viewport
- Reduces bandwidth for mobile users
Example Transformations:
- Original:
https://res.cloudinary.com/ronaldhatcher/image/upload/kits/estes/1234.jpg - WebP:
https://res.cloudinary.com/ronaldhatcher/image/upload/f_webp/kits/estes/1234.jpg - Thumbnail:
https://res.cloudinary.com/ronaldhatcher/image/upload/w_300,h_300,c_fill/kits/estes/1234.jpg
Step 6: URL Storage
Location: src/data_sources/{vendor}.py
The image metadata is stored in the CSV:
df['image'] = primary_image_public_id # Primary image public_id
df['images'] = uploaded_images # Full list of image metadata (Cloudinary or S3)
CSV Structure:
product_id,name,manufacturer,image,images
1234,Alpha III,Estes,kits/estes/1234/primary,[{"public_id":"kits/estes/1234/primary","storage":"cloudinary",...}]
Note: The images field contains a JSON array with metadata for all uploaded images, including storage location (Cloudinary or S3), URLs, and other details.
Upload to S3:
s3_client.put_object(
Bucket='blog-data-raw',
Key=f'kits/{manufacturer}_kits.csv',
Body=df.to_csv(index=False)
)
Step 7: Frontend Display
Location: Vercel Next.js application (blog_code repository)
The frontend loads images from the appropriate storage (Cloudinary or S3):
// Parse images metadata
const imageData = JSON.parse(kit.images)[0]
const imageUrl =
imageData.storage === 'cloudinary' ? imageData.url : imageData.s3_url
;<Image src={imageUrl} alt={kit.name} width={600} height={600} loading='lazy' />
Benefits:
- Automatic format selection (WebP/AVIF)
- Responsive sizing based on viewport
- Lazy loading for performance
- Global CDN delivery (low latency)
Configuration
Cloudinary Settings
Account: ronaldhatcher
Cloud Name: ronaldhatcher
Region: Auto (global CDN)
Folder Structure:
kits/
├── estes/
│ ├── 1234.jpg
│ ├── 5678.jpg
│ └── ...
├── loc/
│ ├── 9012.jpg
│ └── ...
└── rocketarium/
├── 3456.jpg
└── ...
Credentials Storage
AWS Secrets Manager:
- Secret Name:
blog-data/cloudinary/credentials - Region: eu-west-2
- KMS Encryption: Yes (using blog-data KMS key)
- Rotation: Manual (not automated)
Secret Structure:
{
"cloud_name": "your-cloud-name",
"api_key": "YOUR_CLOUDINARY_API_KEY",
"api_secret": "YOUR_CLOUDINARY_API_SECRET"
}
IAM Permissions
ECS Task Execution Role:
{
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "arn:aws:secretsmanager:eu-west-2:*:secret:blog-data/cloudinary/credentials-*"
}
Error Handling
Missing Credentials
If Cloudinary credentials are not available:
uploader = create_cloudinary_uploader(logger, metrics, folder_prefix)
if uploader is None:
logger.warning("Cloudinary credentials not found. Image upload disabled.")
# Continue without images - CSV will not have images column
Result: Scraping continues, but images are not uploaded. CSV will be missing images column.
Upload Failures
If image upload fails:
try:
uploaded_images = image_handler.upload_image(image_url, manufacturer, sku, image_type)
except Exception as e:
logger.error(f"Failed to upload image for {sku}: {e}")
uploaded_images = None # Store None in CSV
Result: CSV will have None for that product's cloudinary_images field.
Missing Images in CSV
The cleaning pipeline handles missing image columns gracefully:
# In tasks/cleaning/clean_kits.py
"imageSrc": df_vendor.get("cloudinary_images"), # Returns None if column missing
Result: Clean data will have null for image URLs, frontend handles gracefully.
Monitoring
Metrics Tracked
In ExtractionMetrics:
images_uploaded: Count of successfully uploaded imagesimages_failed: Count of failed uploadsupload_duration: Time spent uploading images
Logging:
INFO: Uploading image for product 1234 to Cloudinary
INFO: Successfully uploaded image: https://res.cloudinary.com/.../1234.jpg
INFO: Uploaded 45 images in 12.3 seconds
Cloudinary Dashboard
Monitor usage at: https://cloudinary.com/console
Key Metrics:
- Storage used
- Bandwidth consumed
- Transformations performed
- API calls made
Troubleshooting
Images Not Uploading
Symptom: CSV missing cloudinary_images column
Possible Causes:
- Cloudinary credentials not in Secrets Manager
- ECS task role lacks
secretsmanager:GetSecretValuepermission - Network connectivity issues to Cloudinary API
Solution:
# Verify secret exists
aws secretsmanager get-secret-value \
--secret-id blog-data/cloudinary/credentials \
--region eu-west-2
# Check ECS task logs
aws logs tail /ecs/prod/prefect-worker --follow
Images Not Displaying on Frontend
Symptom: Broken images on blog
Possible Causes:
- Cloudinary URLs in CSV are incorrect
- Cloudinary account suspended/over quota
- CORS issues (unlikely with Cloudinary)
Solution:
# Test Cloudinary URL directly
curl -I https://res.cloudinary.com/ronaldhatcher/image/upload/kits/estes/1234.jpg
# Check Cloudinary dashboard for quota/issues
Slow Image Loading
Symptom: Images take long to load on frontend
Possible Causes:
- Not using Cloudinary transformations (loading full-size images)
- Not using WebP/AVIF format
- CDN cache not warmed up
Solution:
- Ensure frontend uses Cloudinary transformation URLs
- Use
f_autofor automatic format selection - Use
q_autofor automatic quality optimization
Best Practices
1. Always Use Transformations
// Good - uses transformations
<Image src={`${cloudinaryUrl}/f_auto,q_auto,w_600`} />
// Bad - uses original
<Image src={cloudinaryUrl} />
2. Organize by Manufacturer
Keep folder structure consistent:
kits/{manufacturer}/{product_id}.{ext}
3. Use Consistent Public IDs
Use product ID as public_id for predictable URLs:
public_id = f"{product_id}" # Not random strings
4. Handle Missing Images Gracefully
Always check for None/null:
cloudinary_url = df.get("cloudinary_images")
if cloudinary_url:
# Use Cloudinary
else:
# Use fallback or placeholder
5. Monitor Quota
Cloudinary free tier limits:
- 25 GB storage
- 25 GB bandwidth/month
- 25,000 transformations/month
Monitor usage and upgrade if needed.
Related Documentation
- CLOUDINARY_SETUP.md - Initial setup and configuration
- CLOUDINARY_CONFIGURATION.md - Detailed configuration reference
- Architecture Diagram 7 - Visual flow diagram
- Cloudinary Documentation - Official docs
Future Enhancements
Planned
- Automatic image resizing before upload (reduce upload time)
- Lazy loading for all images (improve page load)
- Progressive image loading (show low-res first)
Under Consideration
- Video support for product demos
- 3D model support for rocket kits
- AI-powered image tagging
- Automatic background removal