Output in parquet format increases file size

Asked 2 years ago, Updated 2 years ago, 46 views

Load any file in binary using apache pyarrow
We put the binary in a list and output it in a parquet format.
I have verified it with the following source, but if you print it in the packet format,
The file size is seven times larger than the original file.
When I opened the output file as a text file, the data seemed to be uncompressed
There is no compression.
Is it possible to compress it somehow?

#-*-coding:utf-8-*-
import pyarrow aspa
import pyarrow.parquet aspq

open_data_path="test_log_70mb.txt"
file_list = [ ]
with open(open_data_path, 'rb') asf:
    data=bytes(f.read())
    file_list.append(data)

pa_data=[
    pa.array(file_list)
]

pa_batch=pa.RecordBatch.from_arrays(pa_data, ["file_list"))
table=pa.Table.from_batch([pa_batch])
pq.write_table(table, "./test_parquet", compression="gzip")

Results

total71734
-rwxrwxrwx1vagrantvagrant423 Aug 2502:05 test2.py
-rwxrwxrwx1vagrant vagrant 73454817 Aug 2502:05 test_log_70mb.txt
vagrant@apex01: /vagrant/arrow_test$python test2.py
vagrant@apex01:/vagrant/arrow_test$ls-l
total511880
-rwxrwxrwx1vagrantvagrant423 Aug 2502:05 test2.py
-rwxrwxrwx1vagrant vagrant 73454817 Aug 2502:05 test_log_70mb.txt
-rwxrwxrwx1vagrant vagrant 450709316 Aug 2502:05 test_parquet

Environment
python 2.7
pyarrow 0.6.0

python apache

2022-09-30 15:42

1 Answers

 compression='snappy'

How about ?


2022-09-30 15:42

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.