Data Extraction Flow

Last Updated: 2026-01-18

Scrapes raw data from supported sources and writes the raw layer to S3.

Source of truth

  • Flow code: data/platform/flows/data_extraction.py
  • Extractors: data/platform/src/data_sources/

What it extracts

See Data Sources Overview for per-source details.

Execution model

  • Uses Prefect concurrency (.submit() + wait()) to run independent sources in parallel.
  • Failures should be isolated to a single source (the flow should still attempt others).

Output

Raw outputs are written to the raw layer bucket:

  • BLOG_DATA_BUCKET_RAW

The bucket typically contains per-entity folders like kits/, motors/, clubs/, vendors/.

Configuration

Common env vars:

  • BLOG_DATA_BUCKET_RAW
  • MAX_CONCURRENT_EXTRACTORS (limits parallelism)
  • Scraper controls (see data/platform/src/config.py)