Flink
→ Datastream api
→ connectors
- parquet connector
- hive read/write
- aws glue not yet supported
- file system including write parquet pqrtitiomned and add partition to hms
- iceberg
- hudi
- datagen, for sample data generation source
→ testing
→ deploy application
- deployment modes
- deploy modes yarn
- standalone kubernetes
- native kubernetes
- deploy mode kube operator
- s3 stage backend
- s3p scheme for s3 checkpoint
- native deploy with HA
→ recovery
-
By default, Flink creates checkpoints to guarantee exactly-once state consistency. However, it can also be configured to provide at-least-once guarantees:
// set mode to at-least-once cpConfig.setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE );
By default, a failing checkpoint causes an exception that results in an application restart. You can disable this behavior and let the application continue after a checkpointing error.
// do not fail the job on a checkpointing error cpConfig.setFailOnCheckpointingErrors(false);
// enable checkpoint compression env.getConfig.setUseSnapshotCompression(true)
you can also enable a feature called externalized checkpoints to retain checkpoints after the application has stopped.
// Enable externalized checkpoints cpConfig.enableExternalizedCheckpoints( ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
→ Kafka sync
→ Kubernetes
- deploy on local kubernetes
# install specific k3s version root@localhost:~# export GITHUB_TOKEN=<GH-token>; curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.25.16+k3s3 sh -s
- service account needed
- example of cluster role binding
→ airflow
→ get filesystem
final org.apache.flink.core.fs.Path path =
new org.apache.flink.core.fs.Path("s3a://test-bucket/");
org.apache.flink.core.fs.FileSystem fs = path.getFileSystem();
→ Kafka
- best practices for kafka
- serialisation
- generic record see into avro
- kafka globalConversion can add stuff also here
- int96 is not supported in avro-parquet, so turning it into 12 length byte array
- avro parquet reader code
- transformation from int96 to schema fixed 12
- discussion about int96 in spark
- definition of int96 aka impala_timestamp
→ Windowing
- tumbling window: splits the time in discrete chuncks. Sum of chuncks is as long as total time.
- sliding window: split the time in discrete chuncks, but consider n times the same part where n is function of the size of the window.
- cumulative window: consider n times the same part of a discrete chunck. For eg, a day chunck would consider 1h, 2h until 24h then again 1h, 2h etc.
- Session Window: variable length chuncks of based on data characteristics such as frequency.
→ Watermark
A watermark is nothing more than a special record injected into the stream that carries a timestamp(t).
It flows through the operators and informs each operator that no elements with a timestamp older or equal to the watermark timestamp (t) should arrive.
This page was last modified: