LogoLakehouse
architecture.v2 // medallion_pattern
● LIVEclick nodes to trace lineage
medallion_architecture.svg // click to trace
KafkaCDCBatchSpark ETLdbt modelsBI ToolsML PlatformAPIsBronzeRaw Ingestion3 tablesSilverTransformed3 tablesGoldServing Layer3 tablesformat: apache_iceberg // catalog: polaris // engine: trino + spark

// weekly_dispatch — issue #47

Written for engineers who debug Spark jobs at 2 AM, migrate off legacy warehouses, and pitch Databricks-versus-Snowflake to skeptical VPs.

14,200+
subscribers
47
issues published
3,800+
discord members
no signup required
Apache IcebergDelta LakeApache HudiSpark 4.0Trinodbt CoreDatabricksSnowflakeApache FlinkPolaris Catalog
published 2026-02-24 // 14 min read

Iceberg v3, Catalog Wars, and Why Your Spark Tuning Is Probably Wrong

The Iceberg v3 spec ships three changes that matter: default values for columns, multi-argument transforms for partitioning, and variant types. The one that will bite you in production is the partition evolution behavior when you combine PARTITION BY DAY with the new WITHIN PARTITION SORT ORDER clause.

-- v2 behavior: hidden partitioning rewrites all files ALTER TABLE events REPLACE PARTITION FIELD days(event_ts) WITH hours(event_ts); -- v3: use WITHIN PARTITION SORT ORDER to avoid full rewrite ALTER TABLE events SET WITHIN PARTITION SORT ORDER (event_ts ASC NULLS LAST);

The rewrite cost on a 40TB events table is not something you want to discover at 2 AM. The sort order hint tells the engine to write new files in sorted order without touching existing partitions — a 73% reduction in compaction jobs in our testing on a 1B row dataset.

Polaris, Unity Catalog, and Gravitino are not just table registries. They are becoming the execution planning layer. When your catalog can push predicate filters down into the manifest file list before a single file is opened, the query engine becomes almost irrelevant — Trino, Spark, and DuckDB all hit the same catalog API and get back the same pruned manifest.

The teams winning on cost right now are the ones who invested in catalog-level partition pruning, not in tuning their Spark executor memory. The compute is cheap. The I/O is what kills you.

Query
Trino 435
Spark 4.0
Winner
Q1 (agg)
2.3s
4.1s
Trino
Q7 (join)
18.4s
9.2s
Spark
Q21 (scan)
41s
38s
Spark
Q55 (window)
7.8s
14.2s
Trino
Q82 (subq)
22s
11s
Spark

Trino wins on aggregation-heavy workloads. Spark wins on large multi-way joins and complex subqueries where the adaptive query execution planner has room to work. The takeaway: run both, route by query shape.

The 1.8 adapter ships incremental_predicates with full Iceberg partition awareness. This means your incremental models can skip entire partition directories during the merge scan — not just filter rows after reading them.

-- models/events_silver.sql {{config( materialized='incremental', incremental_strategy='merge', unique_key='event_id', partition_by=['event_date'], incremental_predicates=[ "DBT_INTERNAL_DEST.event_date >= dateadd(day, -3, current_date)" ] })}}