storage_disaggregation_architecture | jun ren's digital garden

🗓️ 31012025 1657
📎 #wip

STORAGE DISAGGREGATION ARCHITECTURE

A.K.A Compute-Storage Separation (Storage Disaggregation Architecture)

SUMMARY

The Compute-storage separation (or storage disaggregation) architecture decouples the

compute layer (responsible for data processing)
the storage layer (responsible for data persistence)

Allowing them to scale independently

Ideal for
- real-time data processing systems
- cloud-native platforms
- data-intensive applications where dynamic resource allocation, cost-efficiency, and fault tolerance are critical.

Key Features

Decoupling of Compute and Storage Layers

Compute Layer

Handles real-time or batch data processing tasks

aggregations
transformations
analytics

Compute nodes are typically ephemeral, processing data in memory for low-latency performance

Storage Layer

Handles

persistent storage of raw data
intermediate results
processed outputs using distributed file systems (e.g., Alibaba Cloud’s Pangu, OSS, or HDFS).

Independent Scaling

Compute resources (CPU, memory) can be scaled independently based on workload demands.
Storage resources (disk capacity, I/O throughput) can grow as needed without affecting compute operations.

Resource Optimization

Pay-as-you-go

Only pay for active compute processing and persistent storage separately.

Avoid overprovisioning

Done by dynamically allocating resources based on real-time demands.

Fault Tolerance and Reliability

Persistent storage ensures data integrity even if compute nodes fail or restart
State snapshots and checkpoints are stored in the storage layer for recovery during failures.

Elasticity and Cost Efficiency

Compute workloads can be scaled up or down during traffic spikes without overcommitting storage resources.
Persistent data storage remains unaffected by compute resource scaling

Reducing costs compared to tightly coupled architectures

Performance Optimization

High-performance compute nodes can process large-scale data without being constrained by local disk capacity
Modern storage solutions (e.g., OSS, Pangu) provide fast access, allowing efficient streaming and batch operations.

How the Architecture Works

1 - Data Ingestion

Raw data is ingested from storage systems such as OSS, Kafka, or HDFS.
State snapshots and intermediate results during processing are saved periodically to distributed storage.

2 - Processing

Compute nodes (e.g., Flink clusters) process the data in-memory for low-latency computations.
Ephemeral compute nodes execute real-time processing tasks like anomaly detection, aggregation, and transformation.

3 - Persistent Storage

Raw data, checkpoints, and processing results are stored in persistent layers (e.g., OSS, Pangu, or RDS).
This decoupling ensures that even during a node failure, the system can quickly restart processing by recovering state from the storage layer.

Advantages of Compute-Storage Separation

Elasticity

Independent scaling of compute and storage enables efficient handling of dynamic workloads (e.g., traffic spikes).

Cost Efficiency

Storage and compute are billed separately, avoiding overprovisioning commonly seen in tightly coupled systems.

High Performance

High-performance compute nodes interact with distributed storage systems to deliver low-latency results.

Fault Tolerance:

Persistent storage ensures reliable recovery during failures, with state snapshots available for resuming processing.

References

ChatGPT

Key Features​

Decoupling of Compute and Storage Layers​

Compute Layer​

Storage Layer​

Independent Scaling​

Resource Optimization​

Fault Tolerance and Reliability​

Elasticity and Cost Efficiency​

Performance Optimization​

How the Architecture Works​

1 - Data Ingestion​

2 - Processing​

3 - Persistent Storage​

Advantages of Compute-Storage Separation​

Elasticity​

Cost Efficiency​

High Performance​

Fault Tolerance:​

References​