Architecture Overview: Snowflake Data Engineering
A high-level introduction to SPCS, Tasks, and orchestrating workflows securely inside Snowflake.
Welcome to Snowflake Engineering. This portal serves as a deep technical dive for engineers looking to build beyond simple standard data warehousing tasks.
By leveraging tools such as Snowpark Container Services (SPCS), we can securely host long-running ML jobs, interact with dbt, and coordinate via open-source tools like Airflow—all without moving your sensitive data.
The Mental Model
In a modern Snowflake architecture, your code moves to the data natively.
Gone are the days when compute resided purely on external EC2 instances pulling millions of rows out of Snowflake. Today, we execute arbitrary Docker containers within the data perimeter using SPCS, while utilizing Tasks for lightweight orchestration.
Here is a system interaction diagram mapping how a complete workflow operates:
sequenceDiagram
participant User
participant Airflow as Apache Airflow
participant Tasks as Snowflake Tasks
participant SPCS as Snowpark Container Services
participant DB as Snowflake Data Cloud
User->>Airflow: Trigger Pipeline
Airflow->>Tasks: Call `SYSTEM$TASK_START`
Tasks->>DB: Process Raw Data (dbt integration)
Tasks->>SPCS: Trigger Model Inference API
SPCS->>DB: Query processed features
DB-->>SPCS: Return Tensor Arrays
SPCS-->>Tasks: Return Inference Results (Status 200)
Tasks-->>Airflow: Pipeline complete
Airflow-->>User: Notification Sent
In the future tutorials we will look into how we can use dbt-core with Airflow, submit the ML jobs from stages, and use GitHub actions to deploy our code to Snowflake. Although this high-level diagram showcases Snowflake tasks and dbt native, dbt-core with Airflow as an orchestrator is a much more common pattern in the industry.
Why use Snowpark Container Services?
- Security: Since compute happens within the Snowflake boundary, sensitive PII/PHI data never travels over the public internet to external APIs.
- Scalability: Natively scale up and down nodes natively using compute pools.
- Versatility: You can build standard REST APIs (FastAPI) or host robust long-running stream processors.
Next Steps
In the following tutorials, we will dissect each segment of this architectural flow, showing practical code deployments using git integrations directly in Snowflake!