Skip to main content

🗓️ 15012025 2139
📎

datastream_api

flink's data stream API

What can you stream using datastream APIs?

Anything that flink can serialize

Native serialization

Used for

  • basic types, i.e., String, Long, Integer, Boolean, Array
  • composite types: Java Tuples, POJOs, and Scala case classes
NOTE

Flink recognizes a data type as a POJO type (and allows “by-name” field referencing) if the following conditions are fulfilled:

  • The class is public and standalone (no non-static inner class)
  • The class has a public no-argument constructor
  • All non-static, non-transient fields in the class (and all superclasses) are either public (and non-final) or have public getter- and setter- methods that follow the Java beans naming conventions for getters and setters.

Other serializers

  • Falls back to Kryo for other types
  • Avro is well supported

Stream Execution Environment

  • Context in which Flink application runs
  • Starting point for building and executing Flink streaming jobs

Job Graph

  • Blueprint of data processing logic using the DataStream API
  • Built and attached to the StreamExecutionEnvironment
  • env.execute() - triggers packaging of graph and sending it to the JobManager

JobManager

  • Divides job into smaller, parallelisable tasks based on configured parallelism
  • Distributes jobs to TaskManagers

TaskManager

  • Handles actual computation
  • Each parallel slice of your job will be executed in a task slot
NOTE

A task slot represents allocated computational resources

Flink runtime: client, job manager, task managers

Stream Sources

Entry points for data into a Flink application.

TypeDescriptionExample
BoundedFinite dataFiles / Batch datasets
UnboundedInfinite data streamsMessage queues / socket connections
NOTE

Can define your own data source by implementing the SourceFunction interface for unbounded sources or RichSourceFunction for advanced features

Stream Sinks

Endpoints where processed data is sent or stored after computation

TypeDescription
StorageFiles, databases, distributed storage
Message QueueKafka / Rabbit MQ
MonitoringMonitoring tools like Prometheus or Grafana

Further Reading (I haven't read this)


References