data_skew | jun ren's digital garden

🗓️ 03022025 1215

DATA SKEW

Data skew issues occur when data is unevenly distributed across segments (shards) in a distributed database system

Causes

Uneven Distribution: If the distribution key (the column used to partition data across segments) is not chosen carefully, some segments may end up with significantly more data than others.
Hotspots: Certain values in the distribution key column may appear much more frequently than others, causing those segments to become overloaded.

Performance Bottlenecks:
- Overloaded segments (with more data) take longer to process, slowing down queries that depend on them.
- Other segments may sit idle, wasting computational resources.
Resource Imbalance:
- Skewed data distribution can lead to uneven usage of CPU, memory, and disk I/O, reducing the overall efficiency of the system.
Scalability Issues:
- As data grows, skewed segments can become even more overloaded, making it harder to scale the system effectively.