Data Extraction Flow

Last Updated: 2026-01-18

Scrapes raw data from supported sources and writes the raw layer to S3.

Source of truth

Flow code: data/platform/flows/data_extraction.py
Extractors: data/platform/src/data_sources/

What it extracts

See Data Sources Overview for per-source details.

Execution model

Uses Prefect concurrency (.submit() + wait()) to run independent sources in parallel.
Failures should be isolated to a single source (the flow should still attempt others).

Output

Raw outputs are written to the raw layer bucket:

BLOG_DATA_BUCKET_RAW

The bucket typically contains per-entity folders like kits/, motors/, clubs/, vendors/.

Configuration

Common env vars:

BLOG_DATA_BUCKET_RAW
MAX_CONCURRENT_EXTRACTORS (limits parallelism)
Scraper controls (see data/platform/src/config.py)

Up next

Data Cleaning Flow

Normalize raw data into clean, graph-ready CSVs.