Main Pipeline Flow

Last Updated: 2026-01-18

Coordinates the end-to-end ETL run by executing extraction → cleaning → graph load.

Source of truth

  • Flow code: data/platform/flows/main_pipeline.py

High-level architecture

graph TD
    A[Main Pipeline] --> B[Data Extraction]
    A --> C[Data Cleaning]
    A --> D[Graph Load]
    B --> E[S3 raw layer]
    C --> F[S3 clean layer]
    D --> G[Neo4j]

Inputs / outputs

  • Inputs: configuration + source definitions
  • Outputs:
    • Raw CSVs in BLOG_DATA_BUCKET_RAW
    • Clean CSVs in BLOG_DATA_BUCKET_CLEAN
    • Neo4j graph updated with latest dataset

Operational notes

  • The sub-flows are designed to be runnable independently.
  • Failures should surface in Prefect with enough logging to identify the failing source/task.