Data Cleaning Flow

Last Updated: 2026-01-18

Transforms raw scraped data into normalized, graph-ready CSVs.

Source of truth

  • Flow code: data/platform/flows/data_cleaning.py
  • Libraries: data/platform/src/libraries/

What it does

Typical cleaning responsibilities:

  • Manufacturer normalization (canonical IDs + alias resolution)
  • Validation (required fields, basic typing, referential integrity)
  • Normalization (dates, text cleanup, unit conversions where needed)
  • Output shaping for the graph loader

Output

Clean outputs are written to the clean layer bucket:

  • BLOG_DATA_BUCKET_CLEAN

Key dependency: manufacturer registry

Cleaning uses the manufacturer registry to avoid duplicate manufacturer nodes.