ποΈ 03112024 1135
π
ABSTRACT
- How to implement streaming data processing pipelines and why Flink manages state
 - How to use event time to consistently compute accurate analytics
 - How to build event-driven applications on continuous streams
 - How Flink is able to provide fault-tolerant, stateful stream processing with exactly-once semantics
 
Stream processingβ
| Paradigm | Description | 
|---|---|
| Batch Processing | For processing a bounded data stream | 
| Stream Processing | Processing a unbounded data stream (input might never end) | 
Flink applications composed of:
- Streaming data sources (message_queue)
 - Operators - for transforming data
 - Send result streams to sinks (applications that need this data)
 
Parallel dataflowsβ
INFO
Programs in flink are inherently parallel and distributed
- During execution, a stream has one or more stream partitions
 - Each operator has one or more operator subtasks
- The operator subtasks are independent of one another
 - Execute in different threads / machines or containers
 
 
IMPORTANT
The number of operator subtasks is the parallelism of that particular operator
Transporting data between operatorsβ
One-to-one forwardingβ
Preserves partitioning / ordering of elements
Redistributing streamsβ
- Changes partitioning of streams
 - Ordering is preserved within each pair of sending / receiving subtasks
 - Each operator subtask sends data to different target subtasks, depending on the selected transformation
 - Examples
keyBy()broadcast()rebalance()
 
Timely Stream Processingβ
TLDR
Consider the timestamp at which the event happened rather than when the event was received
Stateful Stream Processingβ
ABSTRACT
Flinkβs operations can be stateful > How one event is handled can depend on the accumulated effect of all the events that came before it
State is managed locally on each parallel instance either on
- JVM Heap
 - On-Disk data structures (if memory too large)
 
Fault Tolerance via State Snapshotsβ
- Flink is able to provide fault-tolerant, exactly-once semantics (stream_processing_semantics) through a combination of state snapshots and stream replay
 
Snapshotsβ
- Capture entire state of the distributed pipeline asynchronously
- Offsets into input queues
 - State throughout the job graph (up to the point of data ingestion)
 
 
Failureβ
- Sources are rewound
 - State is restored
 - Processing resumes