versionning

Parquet

Compression

Get parquet file compression:

import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile(("/tmp/enc/part-00000-e9f69621-ef56-42b4-bf1b-997e8f2b088e-c000.snappy.parquet"))
parquet_file.metadata.row_group(0).column(0)

<pyarrow._parquet.ColumnChunkMetaData object at 0x7fa724db64f0>
  file_offset: 4
  file_path:
  physical_type: INT32
  num_values: 1
  path_in_schema: foo
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7fa725374180>
      has_min_max: True
      min: 1
      max: 1
      null_count: 0
      distinct_count: 0
      num_values: 1
      physical_type: INT32
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('BIT_PACKED', 'PLAIN')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 29
  total_uncompressed_size: 27

Parquet-cli

Basic install

in short:

  1. git clone parquet-mr
  2. in parquet-cli, mvn install -DskipTests
  3. mvn dependency:copy-dependencies
  4. then: alias parquet-cli="java -cp 'target/parquet-cli-1.12.2.jar:target/dependency/*' org.apache.parquet.cli.Main"

eg:

parquet-cli meta file.parquet|grep column

Docker installation

docker build  run parquet-cli:latest
docker run -v /tmp/:/data parquet-cli:latest pages /data/file.parquet

Parquet row groups

Parquet encoding

React ?

This page was last modified: