Data Extraction Flow
Last Updated: 2026-01-18
Scrapes raw data from supported sources and writes the raw layer to S3.
Source of truth
- Flow code:
data/platform/flows/data_extraction.py - Extractors:
data/platform/src/data_sources/
What it extracts
See Data Sources Overview for per-source details.
Execution model
- Uses Prefect concurrency (
.submit()+wait()) to run independent sources in parallel. - Failures should be isolated to a single source (the flow should still attempt others).
Output
Raw outputs are written to the raw layer bucket:
BLOG_DATA_BUCKET_RAW
The bucket typically contains per-entity folders like kits/, motors/, clubs/, vendors/.
Configuration
Common env vars:
BLOG_DATA_BUCKET_RAWMAX_CONCURRENT_EXTRACTORS(limits parallelism)- Scraper controls (see
data/platform/src/config.py)