versionning
Parquet
→ Compression
Get parquet file compression:
import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile(("/tmp/enc/part-00000-e9f69621-ef56-42b4-bf1b-997e8f2b088e-c000.snappy.parquet"))
parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa724db64f0>
file_offset: 4
file_path:
physical_type: INT32
num_values: 1
path_in_schema: foo
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7fa725374180>
has_min_max: True
min: 1
max: 1
null_count: 0
distinct_count: 0
num_values: 1
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY
encodings: ('BIT_PACKED', 'PLAIN')
has_dictionary_page: False
dictionary_page_offset: None
data_page_offset: 4
total_compressed_size: 29
total_uncompressed_size: 27
→ Parquet-cli
→ Basic install
in short:
- git clone parquet-mr
- in parquet-cli, mvn install -DskipTests
- mvn dependency:copy-dependencies
- then:
alias parquet-cli="java -cp 'target/parquet-cli-1.12.2.jar:target/dependency/*' org.apache.parquet.cli.Main"
eg:
parquet-cli meta file.parquet|grep column
→ Docker installation
FROM maven:3.8.3-openjdk-8-slim AS builder
ARG VERSION_PARQUET_MR=1.12.2
RUN apt update && apt install -y git
RUN git clone --depth 1 -b "apache-parquet-${VERSION_PARQUET_MR}" https://github.com/apache/parquet-mr.git
WORKDIR /parquet-mr/parquet-cli
RUN mvn package -B -DskipTests
RUN mvn dependency:copy-dependencies
RUN mkdir /parquet-cli
RUN cp \
/parquet-mr/parquet-cli/target/parquet-cli-${VERSION_PARQUET_MR}-runtime.jar \
/parquet-cli/parquet-cli.jar
RUN cp -r \
/parquet-mr/parquet-cli/target/dependency \
/parquet-cli
FROM amazoncorretto:8-alpine3.20
RUN apk add --no-cache tini git
COPY --from=builder /parquet-cli /parquet-cli
ENTRYPOINT ["/sbin/tini", "--", "java", "-cp", "/parquet-cli/*:/parquet-cli/dependency/*", "org.apache.parquet.cli.Main"]
docker build run parquet-cli:latest
docker run -v /tmp/:/data parquet-cli:latest pages /data/file.parquet
→ Parquet row groups
- dremio recommends one row group per parquet file, and uses 256MB row groups
- parquet itself recommends 512MB row groups
→ Parquet encoding
This page was last modified: