Datastream api
connectors
testing
deploy application
recovery
Kafka sync
Kubernetes
airflow
get filesystem
Kafka
Windowing
Watermark

Flink

→ Datastream api

Join in data stream api can be done by shifting to the table api

→ connectors

→ testing

→ deploy application

→ recovery

working directory
checkpointing

By default, Flink creates checkpoints to guarantee exactly-once state consistency. However, it can also be configured to provide at-least-once guarantees:

// set mode to at-least-once cpConfig.setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE );

By default, a failing checkpoint causes an exception that results in an application restart. You can disable this behavior and let the application continue after a checkpointing error.

// do not fail the job on a checkpointing error cpConfig.setFailOnCheckpointingErrors(false);

// enable checkpoint compression env.getConfig.setUseSnapshotCompression(true)

you can also enable a feature called externalized checkpoints to retain checkpoints after the application has stopped.

// Enable externalized checkpoints cpConfig.enableExternalizedCheckpoints(  ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)

→ Kafka sync

→ Kubernetes

deploy on local kubernetes

# install specific k3s version
root@localhost:~# export  GITHUB_TOKEN=<GH-token>; curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.16+k3s3 sh -s

service account needed
example of cluster role binding

→ airflow

operator

→ get filesystem

final org.apache.flink.core.fs.Path path =
       new org.apache.flink.core.fs.Path("s3a://test-bucket/");
   org.apache.flink.core.fs.FileSystem fs = path.getFileSystem();

→ Kafka

→ Windowing

tumbling window: splits the time in discrete chuncks. Sum of chuncks is as long as total time.
sliding window: split the time in discrete chuncks, but consider n times the same part where n is function of the size of the window.
cumulative window: consider n times the same part of a discrete chunck. For eg, a day chunck would consider 1h, 2h until 24h then again 1h, 2h etc.
Session Window: variable length chuncks of based on data characteristics such as frequency.

→ Watermark

A watermark is nothing more than a special record injected into the stream that carries a timestamp(t).


It flows through the operators and informs each operator that no elements with a timestamp older or equal to the watermark timestamp (t) should arrive.

React ?

This page was last modified: 2024-09-17 09:38