#spark

Acid Tables

Comparison

apache iceberg

Delta-lake

Features

limitations

Hudi

Blogs

tips

GDPR handling

Case of huge append only tables:

One can use MOR tables, and bulk insert data, with global sort for read performance. This would lead to adding base files only. As for updates/deletes, it could be done by adding logs files on a regular basis. The table compaction could be done on a monthly basis, which would reduce drastically the file amplification. As for the compaction strategy, they could trigger an inline compaction with large resources from time to time. Also they be carefully with the occ to avoid killing the long running job. Likely they can limit the compact ion scope either by number of logs per base file or on day based partition strategy; there is lot of strategies

An other approach would be to leverage the uber fast cow feature.

Questions remains:

case of normal tables

Either method can be used.

metadata table

Hudi cli

You can rollback commits, but from the latest to the newest. Not a given commit.

connect --path s3://foo/bar
help commit rollback
commit rollback --commit 20230605210008584 --rollbackUsingMarkers false

Z-order

    "hoodie.clustering.plan.strategy.sort.columns": "col1,col2,col3", "hoodie.layout.optimize.build.curve.sample.size": "3",
    "hoodie.layout.optimize.curve.build.method": "sample",
    "hoodie.layout.optimize.strategy": "hilbert",
    "hoodie.bulkinsert.user.defined.partitioner.class": "org.apache.hudi.execution.bulkinsert.RowSpatialCurveSortPartitioner",

Bloom investigations

In apache spark:

In apache hudi:

For bulk_insert:

For insert:

Comment Investigation

Sorting

Custom sort

It does a sort (order by) on specified columns and then coalesce, based on shuffle parallelism.

GlobalSortPartitionerWithRows

It does a sort (order by) on both _hoodie_partition_path and _hoodie_record_key columns and then coalesce, based on shuffle parallelism

PartitionPathRepartitionAndSortPartitionerWithRows

It does a repartition based on _hoodie_partition_path and then sortWithinPartitions based again on _hoodie_partition_path.

Weird to both repartition and sort, the docs explains:

this sort mode does an additional step of sorting the records based on the partition path within a single Spark partition, given that data for multiple physical partitions can be sent to the same Spark partition and executor. If data is skewed (most records are intended for a handful of partition paths among all) then this can cause an imbalance among Spark executors.

PartitionSortPartitionerWithRows

Same as PartitionPathRepartitionAndSortPartitionerWithRows, but no repartition before sorting.

z-order support

Sizing

Merge On read tables

Stats columns

plugins

trino

React ?

This page was last modified: