ποΈ 03112024 1135
π
flink
ABSTRACT
- How to implement streaming data processing pipelines and why Flink manages state
- How to use event time to consistently compute accurate analytics
- How to build event-driven applications on continuous streams
- How Flink is able to provide fault-tolerant, stateful stream processing with exactly-once semantics
Stream processingβ
Paradigm | Description |
---|---|
Batch Processing | For processing a bounded data stream |
Stream Processing | Processing a unbounded data stream (input might never end) |
Flink applications composed of:
- Streaming data sources (message_queue)
- Operators - for transforming data
- Send result streams to sinks (applications that need this data)
Parallel dataflowsβ
INFO
Programs in flink are inherently parallel and distributed
- During execution, a stream has one or more stream partitions
- Each operator has one or more operator subtasks
- The operator subtasks are independent of one another
- Execute in different threads / machines or containers
IMPORTANT
The number of operator subtasks is the parallelism of that particular operator
Transporting data between operatorsβ
One-to-one forwardingβ
Preserves partitioning / ordering of elements
Redistributing streamsβ
- Changes partitioning of streams
- Ordering is preserved within each pair of sending / receiving subtasks
- Each operator subtask sends data to different target subtasks, depending on the selected transformation
- Examples
keyBy()
broadcast()
rebalance()
Timely Stream Processingβ
TLDR
Consider the timestamp at which the event happened rather than when the event was received
Stateful Stream Processingβ
ABSTRACT
Flinkβs operations can be stateful > How one event is handled can depend on the accumulated effect of all the events that came before it
State is managed locally on each parallel instance either on
- JVM Heap
- On-Disk data structures (if memory too large)
Fault Tolerance via State Snapshotsβ
- Flink is able to provide fault-tolerant, exactly-once semantics (stream_processing_semantics) through a combination of state snapshots and stream replay
Snapshotsβ
- Capture entire state of the distributed pipeline asynchronously
- Offsets into input queues
- State throughout the job graph (up to the point of data ingestion)
Failureβ
- Sources are rewound
- State is restored
- Processing resumes