Main Pipeline Flow

Last Updated: 2026-01-18

Coordinates the end-to-end ETL run by executing extraction → cleaning → graph load.

Source of truth

Flow code: data/platform/flows/main_pipeline.py

High-level architecture

graph TD
    A[Main Pipeline] --> B[Data Extraction]
    A --> C[Data Cleaning]
    A --> D[Graph Load]
    B --> E[S3 raw layer]
    C --> F[S3 clean layer]
    D --> G[Neo4j]

Inputs / outputs

Inputs: configuration + source definitions
Outputs:
- Raw CSVs in BLOG_DATA_BUCKET_RAW
- Clean CSVs in BLOG_DATA_BUCKET_CLEAN
- Neo4j graph updated with latest dataset

Operational notes

The sub-flows are designed to be runnable independently.
Failures should surface in Prefect with enough logging to identify the failing source/task.

Up next

Data Extraction Flow

Scrape sources into the raw layer in S3.