Data Cleaning Flow
Last Updated: 2026-01-18
Transforms raw scraped data into normalized, graph-ready CSVs.
Source of truth
- Flow code:
data/platform/flows/data_cleaning.py - Libraries:
data/platform/src/libraries/
What it does
Typical cleaning responsibilities:
- Manufacturer normalization (canonical IDs + alias resolution)
- Validation (required fields, basic typing, referential integrity)
- Normalization (dates, text cleanup, unit conversions where needed)
- Output shaping for the graph loader
Output
Clean outputs are written to the clean layer bucket:
BLOG_DATA_BUCKET_CLEAN
Key dependency: manufacturer registry
Cleaning uses the manufacturer registry to avoid duplicate manufacturer nodes.