Skip to content

将PostgreSQL的表转换成parquet的几种方法 #42

@Z-Xiao-M

Description

@Z-Xiao-M

茴香豆的茴字,怎样写的?

读过书,……我便要考你一考。茴香豆的茴字,怎样写的?

孔乙己喝过半碗酒,涨红的脸色渐渐复了原,旁人便又问道,“孔乙己,你当真认识字么?”孔乙己看着问他的人,显出不屑置辩的神气。他们便接着说道,“你怎的连半个秀才也捞不到呢?”孔乙己立刻显出颓唐不安模样,脸上笼上了一层灰色,嘴里说些话;这回可是全是之乎者也之类,一些不懂了。在这时候,众人也都哄笑起来:店内外充满了快活的空气。

在这些时候,我可以附和着笑,掌柜是决不责备的。而且掌柜见了孔乙己,也每每这样问他,引人发笑。孔乙己自己知道不能和他们谈天,便只好向孩子说话。有一回对我说道,“你读过书么?”我略略点一点头。他说,“读过书,……我便要考你一考。茴香豆的茴字,怎样写的?”我想,讨饭一样的人,也配考我么? 便回过脸去,不再理会。孔乙己等了许久,很恳切的说道,“不能写罢?……我教给你,记着!这些字应该记着。将来做掌柜的时候,写帐要用。”我暗想我和掌柜的等级还很远呢,而且我们掌柜也从不将茴香豆上帐;又好笑,又不耐烦,懒懒的答他道,“谁要你教,不是草头底下一个来回的回字么?”孔乙己显出极高兴的样子,将两个指头的长指甲敲着柜台,点头说,“对呀对呀!……回字有四样写法,你知道么?”我愈不耐烦了,努着嘴走远。孔乙己刚用指甲蘸了酒,想在柜上写字,见我毫不热心,便又叹一口气,显出极惋惜的样子。

有几回,邻舍孩子听得笑声,也赶热闹,围住了孔乙己。他便给他们茴香豆吃,一人一颗。孩子吃完豆,仍然不散,眼睛都望着碟子。孔乙己着了慌,伸开五指将碟子罩住,弯腰下去说道,“不多了,我已经不多了。”直起身又看一看豆,自己摇头说,“不多不多!多乎哉?不多也。”于是这一群孩子都在笑声里走散了。

怎么将PostgreSQL的表转换成parquet

学过PostregreSQL,……我便要考你一考。将PostgreSQL的表转换成parquet,该怎样做呢?😀

postgresql配合duckdb

PostgreSQL生成csv,再通过duckdb转换成parquet

来自老乡的blog https://github.com/digoal/blog/blob/master/202311/20231130_01.md

psql -c "copy (select id, md5(random()::text) as info, clock_timestamp() ts from generate_series(1,10000) id) to stdout with (format csv, header on)" | duckdb -c "COPY (SELECT * FROM read_csv('/dev/stdin', delim=',', header=true, columns={'id': 'INTEGER', 'info': 'VARCHAR', 'ts': 'timestamp'})) TO '/tmp/test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);"

主要分成两个部分

-- postgresql执行
copy (select id, md5(random()::text) as info, clock_timestamp() ts from generate_series(1,10000) id) to stdout with (format csv, header on)
-- duckdb执行
COPY (SELECT * FROM read_csv('/dev/stdin', delim=',', header=true, columns={'id': 'INTEGER', 'info': 'VARCHAR', 'ts': 'timestamp'})) TO '/tmp/test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);

简单演示

postgres@zxm-VMware-Virtual-Platform:~/test$ psql -c "copy (select id, md5(random()::text) as info, clock_timestamp() ts from generate_series(1,10000) id) to stdout with (format csv, header on)" | duckdb -c "COPY (SELECT * FROM read_csv('/dev/stdin', delim=',', header=true, columns={'id': 'INTEGER', 'info': 'VARCHAR', 'ts': 'timestamp'})) TO '/tmp/test.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD', ROW_GROUP_SIZE 100000);"  
postgres@zxm-VMware-Virtual-Platform:~/test$ file /tmp/test.parquet
/tmp/test.parquet: Apache Parquet
postgres@zxm-VMware-Virtual-Platform:~/test$ duckdb
DuckDB v1.3.2 (Ossivalis) 0b83e5d2f6
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from '/tmp/test.parquet';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    10000     │
└──────────────┘
D select * from '/tmp/test.parquet' limit 5;
┌───────┬──────────────────────────────────┬────────────────────────────┐
│  id   │               info               │             ts             │
│ int32 │             varchartimestamp          │
├───────┼──────────────────────────────────┼────────────────────────────┤
│     1 │ ebacc378cc0186bc3dc9eb55aae01e66 │ 2025-09-23 09:43:43.807629 │
│     2 │ 31c4f9c72131e3c345b67ca3cb0be76f │ 2025-09-23 09:43:43.807645 │
│     3 │ b280ef0e548239ddadcf3f20922b6203 │ 2025-09-23 09:43:43.807648 │
│     4 │ fd34351a122b6410380c486f295c849a │ 2025-09-23 09:43:43.80765  │
│     5 │ 84ebc3f89f473d213fbb5bb933428a9b │ 2025-09-23 09:43:43.807653 │
└───────┴──────────────────────────────────┴────────────────────────────┘
D 

duckdb-postgres

使用duckdb的插件duckdb-postgres,以前也叫做postgres_scanner来着,去拉取远端PostgreSQL的数据,然后使用duckdb的copy语句

postgres@zxm-VMware-Virtual-Platform:~/test$ psql
psql (16.10)
Type "help" for help.

postgres=# create table test as select id, md5(random()::text) as info, clock_timestamp() ts from generate_series(1,10000) id;
SELECT 10000
postgres=# \q
postgres@zxm-VMware-Virtual-Platform:~/test$ duckdb
DuckDB v1.3.2 (Ossivalis) 0b83e5d2f6
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D INSTALL postgres;
D 
D LOAD postgres;
D ATTACH 'dbname=postgres user=postgres host=127.0.0.1 port=5432' AS db (TYPE postgres, SCHEMA 'public');
D select count(*) from db.test;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    10000     │
└──────────────┘
D COPY db.test TO '/tmp/test2.parquet';
D  select count(*) from '/tmp/test2.parquet';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    10000     │
└──────────────┘
D select * from '/tmp/test2.parquet' limit 5;
┌───────┬──────────────────────────────────┬───────────────────────────────┐
│  id   │               info               │              ts               │
│ int32 │             varchartimestamp with time zone    │
├───────┼──────────────────────────────────┼───────────────────────────────┤
│     1 │ 03035730b2e25daa5c0846ac4240c084 │ 2025-09-23 18:00:01.095988+08 │
│     2 │ 76d27fc563a6bb33e58a3671494129f6 │ 2025-09-23 18:00:01.096066+08 │
│     3 │ 8f508b8def12eab455165dd6eb8a4cd4 │ 2025-09-23 18:00:01.096071+08 │
│     4 │ 0e2af5b95ae49489e4c3d7bdae7bf393 │ 2025-09-23 18:00:01.096074+08 │
│     5 │ d084e3c4d1106debae2d9efdaa2d6449 │ 2025-09-23 18:00:01.096077+08 │
└───────┴──────────────────────────────────┴───────────────────────────────┘
D 

pg_duckdb

这是PostgreSQL的插件,pg_duckdb增强了PostgreSQL的COPY,原生的PostgreSQL的COPY是不支持指定格式为PARQUET的 https://www.postgresql.org/docs/16/sql-copy.html

postgres@zxm-VMware-Virtual-Platform:~/test$ psql
psql (16.10)
Type "help" for help.

postgres=# \dx
                  List of installed extensions
   Name    | Version |   Schema   |         Description          
-----------+---------+------------+------------------------------
 pg_duckdb | 1.1.0   | public     | DuckDB Embedded in Postgres
 plpgsql   | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

postgres=# COPY test TO '/tmp/test3.parquet';
COPY 10000
postgres=# \! ls /tmp/test3.parquet -al
-rw------- 1 postgres postgres 406699  9月 23 18:11 /tmp/test3.parquet
postgres=# \! file /tmp/test3.parquet
/tmp/test3.parquet: Apache Parquet
postgres=# \q
postgres@zxm-VMware-Virtual-Platform:~/test$ duckdb 
DuckDB v1.3.2 (Ossivalis) 0b83e5d2f6
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from '/tmp/test3.parquet';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    10000     │
└──────────────┘
D select * from '/tmp/test3.parquet' limit 5;
┌───────┬──────────────────────────────────┬───────────────────────────────┐
│  id   │               info               │              ts               │
│ int32 │             varchartimestamp with time zone    │
├───────┼──────────────────────────────────┼───────────────────────────────┤
│     1 │ 03035730b2e25daa5c0846ac4240c084 │ 2025-09-23 18:00:01.095988+08 │
│     2 │ 76d27fc563a6bb33e58a3671494129f6 │ 2025-09-23 18:00:01.096066+08 │
│     3 │ 8f508b8def12eab455165dd6eb8a4cd4 │ 2025-09-23 18:00:01.096071+08 │
│     4 │ 0e2af5b95ae49489e4c3d7bdae7bf393 │ 2025-09-23 18:00:01.096074+08 │
│     5 │ d084e3c4d1106debae2d9efdaa2d6449 │ 2025-09-23 18:00:01.096077+08 │
└───────┴──────────────────────────────────┴───────────────────────────────┘
D .exit
postgres@zxm-VMware-Virtual-Platform:~/test$ psql
psql (16.10)
Type "help" for help.

postgres=# COPY test TO '/tmp/test4' WITH (FORMAT PARQUET);
COPY 10000
postgres=# \! ls /tmp/test4 -al
-rw------- 1 postgres postgres 406699  9月 23 18:19 /tmp/test4
postgres=# \! file /tmp/test4
/tmp/test4: Apache Parquet
postgres=# \q
postgres@zxm-VMware-Virtual-Platform:~/test$ duckdb 
DuckDB v1.3.2 (Ossivalis) 0b83e5d2f6
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from read_parquet('/tmp/test4');
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    10000     │
└──────────────┘
D select * from read_parquet('/tmp/test4') limit 5;
┌───────┬──────────────────────────────────┬───────────────────────────────┐
│  id   │               info               │              ts               │
│ int32 │             varchartimestamp with time zone    │
├───────┼──────────────────────────────────┼───────────────────────────────┤
│     1 │ 03035730b2e25daa5c0846ac4240c084 │ 2025-09-23 18:00:01.095988+08 │
│     2 │ 76d27fc563a6bb33e58a3671494129f6 │ 2025-09-23 18:00:01.096066+08 │
│     3 │ 8f508b8def12eab455165dd6eb8a4cd4 │ 2025-09-23 18:00:01.096071+08 │
│     4 │ 0e2af5b95ae49489e4c3d7bdae7bf393 │ 2025-09-23 18:00:01.096074+08 │
│     5 │ d084e3c4d1106debae2d9efdaa2d6449 │ 2025-09-23 18:00:01.096077+08 │
└───────┴──────────────────────────────────┴───────────────────────────────┘
D 

pg_parquet

pg_parquet这也是PostgreSQL的插件,同pg_duckdb一致的用法,增强了PostgreSQL的COPY,不过它是一个rust写的插件。这里就贴一下官方的测试案例好了。

-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);

-- create a table with complex types
CREATE TABLE product_example (
    id int,
    product product,
    products product[],
    created_at TIMESTAMP,
    updated_at TIMESTAMPTZ
);

-- insert some rows into the table
insert into product_example values (
    1,
    ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
    ARRAY[ROW(1, NULL, NULL)::product, NULL],
    now(),
    '2022-05-01 12:00:00-04'
);

-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');

-- show table
SELECT * FROM product_example;

-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';

-- show table
SELECT * FROM product_example;

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions