ποΈ 03112024 1135
π
ABSTRACT
- How to implement streaming data processing pipelines and why Flink manages state
- How to use event time to consistently compute accurate analytics
- How to build event-driven applications on continuous streams
- How Flink is able to provide fault-tolerant, stateful stream processing with exactly-once semantics
Stream processingβ
| Paradigm | Description |
|---|---|
| Batch Processing | For processing a bounded data stream |
| Stream Processing | Processing a unbounded data stream (input might never end) |
Flink applications composed of:
- Streaming data sources (message_queue)
- Operators - for transforming data
- Send result streams to sinks (applications that need this data)
Parallel dataflowsβ
INFO
Programs in flink are inherently parallel and distributed
- During execution, a stream has one or more stream partitions
- Each operator has one or more operator subtasks
- The operator subtasks are independent of one another
- Execute in different threads / machines or containers
IMPORTANT
The number of operator subtasks is the parallelism of that particular operator
Transporting data between operatorsβ
One-to-one forwardingβ
Preserves partitioning / ordering of elements
Redistributing streamsβ
- Changes partitioning of streams
- Ordering is preserved within each pair of sending / receiving subtasks
- Each operator subtask sends data to different target subtasks, depending on the selected transformation
- Examples
keyBy()broadcast()rebalance()
Timely Stream Processingβ
TLDR
Consider the timestamp at which the event happened rather than when the event was received
Stateful Stream Processingβ
ABSTRACT
Flinkβs operations can be stateful > How one event is handled can depend on the accumulated effect of all the events that came before it
State is managed locally on each parallel instance either on
- JVM Heap
- On-Disk data structures (if memory too large)
Fault Tolerance via State Snapshotsβ
- Flink is able to provide fault-tolerant, exactly-once semantics (stream_processing_semantics) through a combination of state snapshots and stream replay
Snapshotsβ
- Capture entire state of the distributed pipeline asynchronously
- Offsets into input queues
- State throughout the job graph (up to the point of data ingestion)
Failureβ
- Sources are rewound
- State is restored
- Processing resumes