Main Pipeline Flow
Last Updated: 2026-01-18
Coordinates the end-to-end ETL run by executing extraction → cleaning → graph load.
Source of truth
- Flow code:
data/platform/flows/main_pipeline.py
High-level architecture
graph TD
A[Main Pipeline] --> B[Data Extraction]
A --> C[Data Cleaning]
A --> D[Graph Load]
B --> E[S3 raw layer]
C --> F[S3 clean layer]
D --> G[Neo4j]
Inputs / outputs
- Inputs: configuration + source definitions
- Outputs:
- Raw CSVs in
BLOG_DATA_BUCKET_RAW - Clean CSVs in
BLOG_DATA_BUCKET_CLEAN - Neo4j graph updated with latest dataset
- Raw CSVs in
Operational notes
- The sub-flows are designed to be runnable independently.
- Failures should surface in Prefect with enough logging to identify the failing source/task.