Optimising table distribution in Azure Synapse: Lessons from a logistic data solution
When we recently migrated a logistics data warehouse to Azure Synapse Dedicated SQL Pools, the biggest challenge wasn't the migration itself, it was optimising performance without ballooning compute costs (DWUs).
Optimising table distribution in Azure Synapse: Lessons from a logistic data solution
When we recently migrated a logistics data warehouse to Azure Synapse Dedicated SQL Pools, the biggest challenge wasn’t the migration itself, it was optimising performance without ballooning compute costs (DWUs).
In logistics, query patterns often involve large fact tables joined with multiple dimension tables: orders with customers, shipments with locations, inventory with products. Without proper distribution strategies, queries triggered excessive data movement across compute nodes, leading to high DWU consumption and slow performance.
Thats where table distribution optimisation came in. Choosing the right distribution method for each table, Hash, Replicated, or Round-robin, was the difference between a sluggish system and a cost-efficient, responsive warehouse.
Start with staging (round-robin distribution)
When ingesting raw logistics data (shipments, tracking scans, etc.), our first priority was speed, not optimisation. Round-robin distribution is ideal for staging because it spreads rows evenly across distributions without taking keys into consideration.
CREATE TABLE dbo.staging_shipment
WITH
( DISTRIBUTION = ROUND_ROBIN,
HEAP )
AS
SELECT * FROM source.shipment_landing;
- Fast for initial loads.
- Avoids the overhead of deciding on keys upfront.
- A temporary table; we will optimise later when moving to production fact tables.
Optimise dimensions with replicated distribution
Dimensions in logistics (such as date, product, location) are usually small enough to replicate across all compute nodes. This eliminates data movement during joins.
CREATE TABLE dbo.dim_location
(
dim_location_key INT NOT NULL PRIMARY KEY,
city NVARCHAR(100),
country NVARCHAR(50),
region NVARCHAR(50)
)
WITH
( DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX );
- Dimensions are used in almost every query.
- Replication means each node has a local copy, no shuffling needed.
- Works best when tables are less than 2 GB and don’t change frequently.
Optimise facts with hash distribution
For large fact tables like shipments or orders, the goal is to minimise shuffling during joins. Hash distribution ensures related rows land on the same node.
CREATE TABLE dbo.fact_shipment
WITH
( DISTRIBUTION = HASH(customer_id),
CLUSTERED COLUMNSTORE INDEX )
AS
SELECT *
FROM dbo.staging_shipment;
- Joins between
FactShipmentsandDimCustomeronCustomerIdnow happen locally on each node. - This reduces expensive cross, node data movement.
- Choosing the right key is crucial: it should be high-cardinality, frequently joined, and evenly distributed.
Verify distribution health
After setting distribution, we checked for skew and data movement:
DBCC PDW_SHOWSPACEUSED('dbo.fact_shipment');
This showed whether some distributions had far more rows than others. Skew = bad, because it overloads certain nodes.
To confirm whether queries triggered shuffles, we used:
EXPLAIN
SELECT f.shipment_id, l.city
FROM dbo.fact_shipment f
JOIN dbo.dim_location l ON f.dim_location_key = l.dim_location_key;
If distribution is good: No shuffle steps in the plan.
If bad: The query plan shows ShuffleMoveOperation, meaning Synapse is moving data across nodes.
Closing thoughts
For our logistics project, optimising distribution strategies cut DWU usage significantly while speeding up delivery. The formula was simple:
- Staging = Round-robin (fast ingest).
- Dimensions = Replicated (small, heavily joined).
- Facts = Hash (large, join-heavy).
By verifying skew and monitoring data movement, we tuned the warehouse to balance performance and cost, ensuring the migration to Synapse delivered real value without unnecessary overhead.