🗓️ 18022025 1441
📎

flink_savepoints

SUMMARY

Consistent image of the execution state of a streaming job, created via Flink’s [checkpointing mechanism](https://nightlies

Usage

Stop and resume
Fork / update flink job

Components

Directory with (large) binary files on stable storage

Net data of job's execution state image

Meta data file (relatively small)

Primarily contains pointers to all files on stable storage that are part of the Savepoint, in form of relative paths

In order to allow upgrades between programs and Flink versions, it is important to check out the following section about assigning IDs to your operators.

Operator UID

Used to scope the state of each operator

WARNING

It is highly recommended that you specify operator IDs via the uid(String) method

Generated IDs depend on the structure of your program and are sensitive to program changes

If you do not specify the IDs manually they will be generated automatically

You can automatically restore from the savepoint as long as these IDs do not change

Operator ID | State
------------+------------------------
source-id   | State of StatefulSource
mapper-id   | State of StatefulMapper

You can think of a savepoint as holding a map of Operator ID -> State for each stateful operator

Savepoint format

Canonical Format

:D
- Format that is unified across all state backends
- Most stable format
- Targeted at maintaining most compatibility with previous versions / schemas / modifications etc.
- Can store using one state backend and restore it using another
D:
- Slow

native format

Creates a snapshot in the format specific for the used state backend (e.g. SST files for RocksDB)

Claim mode

Determines determines who takes ownership of the files that make up a Savepoint or [externalized checkpoints](https://nightliesThe Claim Mode .apache.org/flink/flink-docs-release-1.20/docs/ops/state/checkpoints//#resuming-from-a-retained-checkpoint) after restoring i

Snapshots can be owned either by a user or Flink itself
If a snapshot is owned by a user, Flink will not delete its files
Flink can not depend on the existence of the files from such a snapshot, as it might be deleted outside of Flink’s control

TLDR

Each claim mode serves a specific purposes

Still, we (the ppl writing the docs) believe the default NO_CLAIM mode is a good tradeoff in most situations, as it provides clear ownership with a small price for the first checkpoint after the restore

NO_CLAIM (default)

In the NO_CLAIM mode Flink will not assume ownership of the snapshot
It will leave the files in user’s control and never delete any of the files
In this mode you can start multiple jobs from the same snapshot

NOTE

In order to make sure Flink does not depend on any of the files from that snapshot, it will force the first (successful) checkpoint to be a full checkpoint as opposed to an incremental one

Once the first full checkpoint completes, all subsequent checkpoints will be taken as usual/configured. Consequently, once a checkpoint succeeds you can manually delete the original snapshot. You can not do this earlier, because without any completed checkpoints Flink will - upon failure - try to recover from the initial snapshot.

NO_CLAIM claim mode

CLAIM

Flink claims ownership of the snapshot and essentially treats it like a checkpoint
- its controls the lifecycle
- Might delete it if it is not needed for recovery anymore
Hence, it is not safe to manually delete the snapshot or to start two jobs from the same snapshot
Flink keeps around a configured number of checkpoints.

CLAIM mode

References

https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/ops/state/savepoints/

Usage​

Components​

Operator UID​

Savepoint format

Canonical Format​

native format​

Claim mode​

NO_CLAIM (default)​

CLAIM​